Enhancing Data Science Workflow with Unit Testing

In this blog post, we’ll explore how to effectively use unit tests for a data science project using the Pytest framework. We’ll use the popular Iris dataset as an example to demonstrate how to write tests for data preprocessing, model training, and evaluation functions.

What is Unit Testing?

Unit testing involves testing individual units (functions, methods, classes) of code to verify that they behave as intended. It allows you to detect and fix bugs early in the development process and ensures that changes to your code don’t introduce unexpected issues.

Setting Up the Project

Before we dive into writing tests, let’s set up our project structure:

data-science-project/
│
├── src/
│   ├── preprocessing.py
│   └── model.py
│── tests/
│   ├── test_preprocessing.py
│   └── test_model.py
│
├── data/
│   └── iris.csv
│
└── requirements.txt

src/: Directory containing Python modules for data preprocessing (preprocessing.py) and model training (model.py).
tests/: Directory for storing test modules (test_preprocessing.py and test_model.py).
data/: Directory containing the Iris dataset (iris.csv).
requirements.txt: File listing project dependencies (e.g., pytest, pandas, scikit-learn).

Installing Dependencies

Make sure you have Python and pip installed. Create a conda environment and install the required packages:

conda create --name=ml_test_project python=3.10
conda activate ml_test_project
pip install -r requirements.txt

requirements.txt

pandas
pytest
scikit-learn

Writing Unit Tests with Pytest

Let’s start by writing unit tests for our data preprocessing and modeling functions using Pytest.

Writing Tests for Data Preprocessing Create src/preprocessing.py with a function to load the Iris dataset and preprocess it:

import pandas as pd

def load_iris_dataset(file_path):
    return pd.read_csv(file_path)

def preprocess_data(df):
    # Drop rows with missing values
    df = df.dropna()
    # Encode target variable
    df['species'] = df['species'].astype('category').cat.codes
    return df

Now, create tests/test_preprocessing.py to write unit tests for the preprocessing functions:

import pytest
from pathlib import Path
from src.preprocessing import load_iris_dataset, preprocess_data

data_dir = Path(__file__).parent.parent / 'data'  # Navigate up to the project root
iris_path = data_dir / 'iris.csv'


@pytest.fixture
def iris_data():
    return load_iris_dataset(iris_path)

def test_load_iris_dataset():
    df = load_iris_dataset(iris_path)
    assert not df.empty

def test_preprocess_data(iris_data):
    preprocessed_df = preprocess_data(iris_data)
    assert preprocessed_df.isna().sum().sum() == 0
    assert 'species' in preprocessed_df.columns

def test_missing_values():
    df = load_iris_dataset(iris_path)
    assert not df.isnull().values.any(), "Dataset contains missing values"

def test_no_duplicates():
    df = load_iris_dataset(iris_path)
    assert not df.duplicated().any(), "Dataset contains duplicate records"


def test_column_datatypes():
    df = load_iris_dataset(iris_path)
    expected_datatypes = {
        'sepal length (cm)': 'float64',
        'sepal width (cm)': 'float64',
        'petal length (cm)': 'float64',
        'petal width (cm)': 'float64',
        'species': 'category'
    }
    for col, dtype in expected_datatypes.items():
        assert df[col].dtype == dtype, f"Unexpected datatype for column {col}"

Writing Tests for Model Training and Evaluation Create src/model.py with a function to train a simple classifier on the Iris dataset:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def train_and_evaluate_model(df):
    X = df.drop('species', axis=1)
    y = df['species']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    return model, accuracy

Now, create tests/test_model.py to write unit tests for the model training and evaluation function:

import pytest
from pathlib import Path
from src.model import train_and_evaluate_model
from src.preprocessing import load_iris_dataset, preprocess_data
data_dir = Path(__file__).parent.parent / 'data'  # Navigate up to the project root
iris_path = data_dir / 'iris.csv'

@pytest.fixture
def preprocessed_iris_data():
    df = load_iris_dataset(iris_path)
    return preprocess_data(df)

def test_train_and_evaluate_model(preprocessed_iris_data):
    model, accuracy = train_and_evaluate_model(preprocessed_iris_data)
    assert accuracy > 0.8

Running the Tests

To run the tests using Pytest, navigate to the tests directory and execute:

pytest

Full Code in Git Repo
Conclusion

By incorporating a suite of comprehensive unit tests, you can ensure the robustness and correctness of your data preprocessing, modeling, and evaluation workflows. Continuously expand and refine your tests to cover various scenarios and edge cases, enhancing the reliability and integrity of your data science projects.

What is Unit Testing?

Setting Up the Project

Installing Dependencies

Writing Unit Tests with Pytest

Running the Tests

Conclusion