ML Model Development & Versioning | Reproducibility Guide

Master ML model development & versioning for reproducibility and maintainability. Learn data versioning, pipeline reproducibility, and ML tooling.

Module 3: Model Development & Versioning

This module focuses on the crucial aspects of developing, tracking, and managing machine learning models to ensure reproducibility and maintainability.

3.1 Data Versioning & Pipeline Reproducibility

Reproducibility in machine learning starts with reproducible data. This section covers strategies for versioning your datasets and ensuring that your entire machine learning pipeline can be reliably recreated.

3.1.1 Data Versioning Tools

  • DVC (Data Version Control): DVC is an open-source version control system for machine learning projects. It works alongside Git, allowing you to version large datasets and models without bloating your Git repository. DVC tracks data and model files using lightweight pointers stored in Git, while the actual data resides in remote storage (e.g., cloud storage, network drives).

    • Key Features:

      • Version control for large files.
      • Integration with Git.
      • Support for various remote storage solutions.
      • Pipeline definition and execution.
      • Experiment tracking.
    • Basic Workflow:

      1. Initialize DVC:
        dvc init
      2. Add data to DVC:
        dvc add data/my_dataset.csv
        This creates a data/my_dataset.csv.dvc file, which is a small pointer file.
      3. Commit data pointers to Git:
        git add data/.gitignore data/my_dataset.csv.dvc .dvc/config
        git commit -m "Add dataset versioning"
      4. Push data to remote storage:
        dvc push

3.1.2 Pipeline Reproducibility

  • DVC Pipelines: DVC allows you to define your ML pipeline as a directed acyclic graph (DAG) of dependencies. This ensures that each step in your pipeline is reproducible. When you run dvc repro, DVC intelligently executes only the necessary steps based on changes in input data or code.

    • Defining a Pipeline: Pipelines are defined in a dvc.yaml file.

      stages:
        prepare_data:
          cmd: python scripts/prepare_data.py --input data/raw.csv --output data/processed.csv
          deps:
            - data/raw.csv
            - scripts/prepare_data.py
          outs:
            - data/processed.csv
      
        train_model:
          cmd: python scripts/train.py --data data/processed.csv --model models/model.pkl
          deps:
            - data/processed.csv
            - scripts/train.py
          outs:
            - models/model.pkl
    • Running the Pipeline:

      dvc repro

      This command will execute the defined stages in the correct order, respecting dependencies.

3.2 Experiment Tracking with MLFlow/DVC

Keeping track of your experiments is vital for understanding model performance, comparing different approaches, and debugging. Both MLFlow and DVC offer robust solutions for experiment tracking.

3.2.1 MLFlow

MLFlow is an open-source platform for managing the ML lifecycle, including experimentation, reproducibility, and deployment.

  • Key Features:

    • MLFlow Tracking: Log parameters, code versions, metrics, and output files.
    • MLFlow Projects: Package your code in a reusable and reproducible format.
    • MLFlow Models: Package models in a standard format that can be used in various downstream tools.
    • MLFlow Registry: Centralized model store for managing model lifecycles.
  • Basic Tracking Workflow:

    1. Install MLFlow:
      pip install mlflow
    2. Start tracking in your Python script:
      import mlflow
      import mlflow.sklearn
      from sklearn.model_selection import train_test_split
      from sklearn.linear_model import LogisticRegression
      from sklearn.metrics import accuracy_score
      
      # Start an MLFlow run
      with mlflow.start_run():
          # Log parameters
          mlflow.log_param("solver", "liblinear")
          mlflow.log_param("C", 0.1)
      
          # Load data (example)
          X, y = ... # Your data loading here
          X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
      
          # Train model
          model = LogisticRegression(solver="liblinear", C=0.1)
          model.fit(X_train, y_train)
      
          # Make predictions and evaluate
          y_pred = model.predict(X_test)
          accuracy = accuracy_score(y_test, y_pred)
      
          # Log metrics
          mlflow.log_metric("accuracy", accuracy)
      
          # Log the model
          mlflow.sklearn.log_model(model, "logistic_regression_model")
      
          print(f"Logged run with accuracy: {accuracy}")
          print(f"MLFlow run ID: {mlflow.active_run().info.run_id}")
    3. Launch the MLFlow UI to view experiments:
      mlflow ui
      Open your browser to http://localhost:5000 to see the logged experiments.

3.2.2 DVC with Experiment Tracking

DVC also provides experiment tracking capabilities, often integrated with its pipeline and data versioning features. DVC experiments allow you to quickly iterate on different model configurations and compare results.

  • Key Features:

    • DVC Experiments: Record and manage changes to parameters, metrics, and outputs for different model runs.
    • Branching: Create Git branches to isolate experiments.
    • Metrics Comparison: Easily compare metrics across different experiment runs.
  • Basic Experiment Tracking Workflow:

    1. Define an experiment script (e.g., scripts/train.py) that logs metrics and outputs. This script should be compatible with DVC's pipeline definitions.
    2. Create an experiment:
      dvc exp init
    3. Create a new experiment run:
      dvc exp run -n my_first_experiment
      This command will execute the pipeline defined in dvc.yaml and record the results as an experiment. You can also specify parameters to change for the experiment:
      dvc exp run -n experiment_with_different_params -S model.learning_rate=0.005
    4. View experiments:
      dvc exp show
      This will display a table of your experiments, their parameters, and metrics.
    5. Compare experiments:
      dvc exp diff my_first_experiment experiment_with_different_params

3.3 Model Training Scripts with Best Practices

Writing clean, modular, and well-documented model training scripts is crucial for maintainability and collaboration.

3.3.1 Script Structure

A typical model training script should include:

  1. Imports: All necessary libraries.
  2. Argument Parsing: Use libraries like argparse to make scripts configurable.
  3. Data Loading and Preprocessing: Load and prepare data. This can often be a separate function or module.
  4. Model Definition: Instantiate your model.
  5. Training: Train the model.
  6. Evaluation: Evaluate the model's performance.
  7. Saving: Save the trained model and any necessary artifacts (e.g., scalers, encoders).
  8. Logging: Log parameters, metrics, and potentially the model itself (e.g., with MLFlow).

3.3.2 Best Practices

  • Modularity: Break down the training process into functions (e.g., load_data, preprocess, train_model, evaluate).
  • Configuration: Use command-line arguments (argparse) or configuration files (YAML, JSON) to manage hyperparameters and paths.
  • Reproducibility:
    • Set random seeds (random.seed(), np.random.seed(), tf.random.set_seed()).
    • Use specific versions of libraries.
    • Track your code with Git.
  • Logging: Log key information (hyperparameters, metrics, input data versions) consistently.
  • Error Handling: Implement robust error handling and informative error messages.
  • Testing: Consider writing unit tests for critical functions (e.g., data preprocessing).
  • Docstrings: Add clear docstrings to functions and classes to explain their purpose, arguments, and return values.

Example train.py snippet:

import argparse
import os
import yaml
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import joblib # For saving models
import mlflow

def load_config(config_path):
    """Loads configuration from a YAML file."""
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    return config

def train_model(data_path, model_save_path, params):
    """Trains a RandomForestClassifier and saves the model."""
    print(f"Loading data from: {data_path}")
    df = pd.read_csv(data_path)

    # Assuming 'target' is the column to predict
    X = df.drop('target', axis=1)
    y = df['target']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=params['test_size'], random_state=params['random_state']
    )

    print("Training model...")
    model = RandomForestClassifier(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        random_state=params['random_state']
    )
    model.fit(X_train, y_train)

    print("Evaluating model...")
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy:.4f}")

    # Log metrics with MLFlow
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_params({
        "n_estimators": params['n_estimators'],
        "max_depth": params['max_depth'],
        "test_size": params['test_size'],
        "random_state": params['random_state']
    })

    # Save the model
    os.makedirs(os.path.dirname(model_save_path), exist_ok=True)
    joblib.dump(model, model_save_path)
    print(f"Model saved to: {model_save_path}")

    return accuracy

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Train a machine learning model.")
    parser.add_argument("--data", required=True, help="Path to the processed data CSV file.")
    parser.add_argument("--model-output", default="models/rf_model.joblib", help="Path to save the trained model.")
    parser.add_argument("--config", default="config/train_params.yaml", help="Path to the training parameters YAML file.")

    args = parser.parse_args()

    # Load configuration
    train_params = load_config(args.config)

    # Start MLFlow run
    with mlflow.start_run():
        train_model(args.data, args.model_output, train_params)

Example config/train_params.yaml:

test_size: 0.2
random_state: 42
n_estimators: 100
max_depth: 10

3.4 Setting Up Virtual Environments and Dependency Tracking

Virtual environments isolate your project's dependencies, preventing conflicts with other Python projects. Proper dependency tracking ensures that your project can be reliably set up on any machine.

3.4.1 Virtual Environments

  • venv (Built-in): Python's standard library module for creating lightweight virtual environments.

    • Creating a virtual environment:
      python -m venv .venv
      This command creates a .venv directory in your project root.
    • Activating the environment:
      • On Windows:
        .venv\Scripts\activate
      • On macOS/Linux:
        source .venv/bin/activate
      Your terminal prompt will usually change to indicate the active environment (e.g., (.venv) $).
    • Deactivating the environment:
      deactivate
  • conda (Anaconda/Miniconda): A popular package and environment manager, especially for data science.

    • Creating a conda environment:
      conda create --name myenv python=3.9
    • Activating the environment:
      conda activate myenv
    • Deactivating the environment:
      conda deactivate

3.4.2 Dependency Tracking

  • pip freeze > requirements.txt: This command captures all installed packages in the current Python environment and their exact versions, saving them to a requirements.txt file.

    • Generating requirements.txt:
      # Ensure your virtual environment is activated
      pip freeze > requirements.txt
    • Installing dependencies from requirements.txt:
      # Create a new virtual environment and activate it first
      pip install -r requirements.txt
  • conda env export > environment.yaml: For conda environments, this exports the environment's configuration, including Python version and channel information, into a environment.yaml file.

    • Generating environment.yaml:
      # Ensure your conda environment is activated
      conda env export > environment.yaml
    • Creating an environment from environment.yaml:
      conda env create -f environment.yaml
  • Poetry/Pipenv (More advanced): Tools like Poetry and Pipenv offer more sophisticated dependency management, including dependency resolution, locking, and virtual environment management within the project. They typically use pyproject.toml (Poetry) or Pipfile (Pipenv) for dependency specifications. These are recommended for larger, more complex projects.