Module 5: ML Model Packaging & Deployment | Serialization

Learn essential ML model packaging & deployment techniques in Module 5. Master model serialization with Pickle for production-ready AI applications.

Module 5: Model Packaging & Deployment

This module covers the essential steps and techniques for packaging and deploying machine learning models, enabling them to be used in production environments.

5.1 Model Serialization

Before deployment, models need to be saved in a format that can be loaded and used efficiently. This process is called serialization.

Common Serialization Formats:

  • Pickle:

    • A standard Python library for serializing and de-serializing Python object structures.
    • Pros: Simple to use, supports most Python objects.
    • Cons: Python-specific, potential security risks when loading untrusted data, version compatibility issues between Python versions.
    • Use Cases: Quick saving and loading of models within Python environments.
    import pickle
    from sklearn.linear_model import LogisticRegression
    from sklearn.datasets import load_iris
    
    # Train a simple model
    X, y = load_iris(return_X_y=True)
    model = LogisticRegression()
    model.fit(X, y)
    
    # Save the model
    with open('logistic_regression_model.pkl', 'wb') as f:
        pickle.dump(model, f)
    
    # Load the model
    with open('logistic_regression_model.pkl', 'rb') as f:
        loaded_model = pickle.load(f)
    
    # Make predictions with the loaded model
    predictions = loaded_model.predict(X[:5])
    print(predictions)
  • ONNX (Open Neural Network Exchange):

    • An open format built to represent machine learning models, enabling interoperability between different frameworks.
    • Pros: Framework agnostic (PyTorch, TensorFlow, scikit-learn, etc.), hardware acceleration potential, good for cross-platform deployment.
    • Cons: Not all model components or operations might be supported by all runtimes.
    • Use Cases: Deploying models across diverse platforms and frameworks, optimizing for inference speed.
    import torch
    import torch.nn as nn
    import onnxruntime as ort
    
    # Define a simple PyTorch model
    class SimpleModel(nn.Module):
        def __init__(self):
            super(SimpleModel, self).__init__()
            self.linear = nn.Linear(10, 2)
    
        def forward(self, x):
            return self.linear(x)
    
    model = SimpleModel()
    
    # Export the model to ONNX
    dummy_input = torch.randn(1, 10)
    torch.onnx.export(model, dummy_input, "simple_model.onnx", verbose=True)
    
    # Load and run the ONNX model using ONNX Runtime
    ort_session = ort.InferenceSession("simple_model.onnx")
    input_name = ort_session.get_inputs()[0].name
    output_name = ort_session.get_outputs()[0].name
    
    # Prepare input data (matching the dummy input shape and type)
    input_data = torch.randn(1, 10).numpy().astype("float32")
    ort_outputs = ort_session.run([output_name], {input_name: input_data})
    
    print(ort_outputs)
  • TorchScript:

    • A way to serialize PyTorch models so they can be loaded in C++ or used in environments where Python is not available.
    • Pros: Enables deployment without a Python interpreter, performance optimizations.
    • Cons: Primarily for PyTorch models, requires tracing or scripting the model.
    • Use Cases: Deploying PyTorch models in C++ environments, mobile applications.
    import torch
    import torch.nn as nn
    
    # Define a simple PyTorch model
    class SimpleModel(nn.Module):
        def __init__(self):
            super(SimpleModel, self).__init__()
            self.linear = nn.Linear(10, 2)
    
        def forward(self, x):
            return self.linear(x)
    
    model = SimpleModel()
    
    # Trace the model (scripting is another option)
    dummy_input = torch.randn(1, 10)
    traced_model = torch.jit.trace(model, dummy_input)
    
    # Save the TorchScript model
    traced_model.save("simple_model.pt")
    
    # Load the TorchScript model
    loaded_traced_model = torch.jit.load("simple_model.pt")
    
    # Make predictions
    predictions = loaded_traced_model(dummy_input)
    print(predictions)

5.2 REST API Development

Exposing your trained model as a RESTful API allows other applications to request predictions easily.

Frameworks for API Development:

  • FastAPI:

    • A modern, fast (high-performance) web framework for building APIs with Python 3.7+, based on standard Python type hints.
    • Pros: Very fast, automatic data validation, interactive API documentation (Swagger UI, ReDoc), easy to learn.
    • Use Cases: Building high-performance ML inference APIs.
    from fastapi import FastAPI
    from pydantic import BaseModel
    import joblib # Or your preferred model loading library
    import numpy as np
    
    # Load your trained model (replace with your actual model path and loading logic)
    try:
        model = joblib.load("your_trained_model.pkl")
    except FileNotFoundError:
        # Create a dummy model for demonstration if the file doesn't exist
        from sklearn.linear_model import LogisticRegression
        from sklearn.datasets import load_iris
        X, y = load_iris(return_X_y=True)
        model = LogisticRegression()
        model.fit(X, y)
        joblib.dump(model, "your_trained_model.pkl") # Save it for future runs
        print("Dummy model created and saved as 'your_trained_model.pkl'")
    
    
    # Define the request body schema using Pydantic
    class PredictionRequest(BaseModel):
        features: list[float] # Assuming your model expects a list of floats
    
    # Initialize FastAPI app
    app = FastAPI()
    
    @app.post("/predict/")
    async def predict_item(request: PredictionRequest):
        """
        Accepts a list of features and returns a prediction from the loaded model.
        """
        try:
            # Convert input features to a numpy array
            input_data = np.array(request.features).reshape(1, -1) # Reshape for single prediction
            prediction = model.predict(input_data)
            return {"prediction": prediction.tolist()} # Return prediction as a list
        except Exception as e:
            return {"error": str(e)}
    
    # To run this:
    # 1. Save the code as main.py
    # 2. Install uvicorn: pip install uvicorn fastapi joblib scikit-learn numpy
    # 3. Run: uvicorn main:app --reload
    # Then access http://127.0.0.1:8000/docs for interactive API documentation.
  • Flask:

    • A lightweight WSGI web application framework in Python.
    • Pros: Simple to start with, flexible, mature ecosystem.
    • Cons: Less performant out-of-the-box compared to FastAPI, requires extensions for features like automatic docs.
    • Use Cases: Simple ML APIs, prototyping.
    from flask import Flask, request, jsonify
    import joblib # Or your preferred model loading library
    import numpy as np
    
    # Load your trained model (replace with your actual model path and loading logic)
    try:
        model = joblib.load("your_trained_model.pkl")
    except FileNotFoundError:
        # Create a dummy model for demonstration if the file doesn't exist
        from sklearn.linear_model import LogisticRegression
        from sklearn.datasets import load_iris
        X, y = load_iris(return_X_y=True)
        model = LogisticRegression()
        model.fit(X, y)
        joblib.dump(model, "your_trained_model.pkl") # Save it for future runs
        print("Dummy model created and saved as 'your_trained_model.pkl'")
    
    
    app = Flask(__name__)
    
    @app.route('/predict', methods=['POST'])
    def predict():
        """
        Accepts JSON input with 'features' and returns a prediction.
        """
        data = request.get_json()
        if not data or 'features' not in data:
            return jsonify({"error": "Invalid input. 'features' field is required."}), 400
    
        try:
            # Convert input features to a numpy array
            input_data = np.array(data['features']).reshape(1, -1) # Reshape for single prediction
            prediction = model.predict(input_data)
            return jsonify({"prediction": prediction.tolist()}) # Return prediction as a list
        except Exception as e:
            return jsonify({"error": str(e)}), 500
    
    # To run this:
    # 1. Save the code as app.py
    # 2. Install Flask: pip install Flask joblib scikit-learn numpy
    # 3. Run: python app.py
    # Then send POST requests to http://127.0.0.1:5000/predict with JSON body like:
    # {"features": [5.1, 3.5, 1.4, 0.2]}

5.3 Deployment Strategies

Once your model is serialized and accessible via an API (or directly), you need to choose how to deploy it.

5.3.1 Local Deployment

  • Description: Running your model directly on a developer's machine or a dedicated server.
  • Pros: Simple setup, good for testing and development, full control.
  • Cons: Limited scalability, maintenance burden, not suitable for high traffic.
  • Use Cases: Local development environments, internal tools.

5.3.2 Cloud Deployment

  • Description: Deploying your model on cloud platforms like AWS, Google Cloud, Azure, etc.

  • Pros: Scalability, managed services, high availability, cost-effectiveness (pay-as-you-go).

  • Cons: Can be complex to manage, vendor lock-in potential, cost monitoring is crucial.

  • Use Cases: Most production applications requiring scalability and reliability.

    • Managed ML Platforms: Services like AWS SageMaker, Google AI Platform, Azure Machine Learning offer end-to-end solutions for deploying models.
    • Container Orchestration: Docker and Kubernetes are crucial here.

5.3.3 Serverless Deployment

  • Description: Deploying your model as functions that are triggered by events (e.g., HTTP requests) and automatically scale based on demand. Examples include AWS Lambda, Google Cloud Functions, Azure Functions.
  • Pros: Automatic scaling, pay-per-execution, no server management.
  • Cons: Cold start issues (initial latency), execution time limits, limited memory/storage, not ideal for very large models or complex dependencies.
  • Use Cases: Infrequent or event-driven predictions, microservices.

5.3.4 Edge Deployment

  • Description: Deploying models directly onto end-user devices (e.g., mobile phones, IoT devices).
  • Pros: Low latency, offline capabilities, privacy benefits, reduced bandwidth usage.
  • Cons: Limited computational resources on edge devices, model size constraints, complex deployment and updates.
  • Use Cases: Real-time applications on mobile, autonomous systems, smart devices.
    • Frameworks like TensorFlow Lite and PyTorch Mobile are commonly used.

5.4 Deploying with Docker/Kubernetes

Containerization with Docker and orchestration with Kubernetes are industry-standard for robust and scalable deployments.

5.4.1 Dockerizing ML Models

Docker allows you to package your model, its dependencies, and your API code into a portable container.

  • Dockerfile Example:

    # Use an official Python runtime as a parent image
    FROM python:3.9-slim
    
    # Set the working directory in the container
    WORKDIR /app
    
    # Copy the requirements file into the container
    COPY requirements.txt .
    
    # Install any needed packages specified in requirements.txt
    RUN pip install --no-cache-dir -r requirements.txt
    
    # Copy the rest of the application code into the container
    COPY . .
    
    # Make port 80 available to the world outside this container
    EXPOSE 80
    
    # Define environment variable
    ENV MODEL_PATH=/app/your_trained_model.pkl
    
    # Run your API application when the container launches
    # Replace 'main:app' with your FastAPI/Flask app entry point
    CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "80"]
    # Or for Flask: CMD ["python", "app.py"]
  • requirements.txt Example:

    fastapi
    uvicorn
    python-multipart
    joblib
    scikit-learn
    numpy
  • Building and Running a Docker Image:

    # Build the Docker image (from the directory containing Dockerfile and your code)
    docker build -t my-ml-api .
    
    # Run the Docker container
    docker run -p 8000:80 my-ml-api

    This will start your API, typically accessible at http://localhost:8000.

5.4.2 Kubernetes Deployment

Kubernetes is used to automate the deployment, scaling, and management of containerized applications.

  • Kubernetes Deployment Manifest (example deployment.yaml):

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: ml-api-deployment
      labels:
        app: ml-api
    spec:
      replicas: 3 # Number of desired pods
      selector:
        matchLabels:
          app: ml-api
      template:
        metadata:
          labels:
            app: ml-api
        spec:
          containers:
          - name: ml-api-container
            image: your-dockerhub-username/my-ml-api:latest # Replace with your image
            ports:
            - containerPort: 80 # Port your app listens on inside the container
            resources: # Optional: define resource requests and limits
              requests:
                memory: "256Mi"
                cpu: "500m"
              limits:
                memory: "512Mi"
                cpu: "1000m"
            # You can also mount models as volumes here if not baked into the image
  • Kubernetes Service Manifest (example service.yaml):

    apiVersion: v1
    kind: Service
    metadata:
      name: ml-api-service
    spec:
      selector:
        app: ml-api # Matches the labels in your Deployment template
      ports:
      - protocol: TCP
        port: 80 # The port the service will be accessible on
        targetPort: 80 # The port your container listens on
      type: LoadBalancer # Or ClusterIP, NodePort depending on your needs
  • Applying Manifests:

    # Apply the deployment
    kubectl apply -f deployment.yaml
    
    # Apply the service
    kubectl apply -f service.yaml

This setup allows Kubernetes to manage multiple instances of your containerized API, automatically restart failed containers, and scale your application up or down based on traffic.