How to Write Clean Code: A Data Scientist’s Guide

If you don’t feel like listening right, you can listen to audio summary generated by NotebookLM.

Introduction

Picture this: You’re opening a Jupyter notebook you created three months ago. As you scroll through cells filled with data preprocessing, model training, and evaluation code, you realize you can barely understand what you wrote. Sound familiar?

This scenario plays out daily in data science teams worldwide. While we focus intensely on model accuracy, feature engineering, and hyperparameter tuning, we often overlook a crucial aspect of our work: writing clean, maintainable code.

Let’s look at a typical example of code you might find in a data science project:

def p(d, ft):
    # preprocess data
    d = d.fillna(0)
    # extract features
    f = pd.get_dummies(d[ft])
    # train model
    X = f.values
    y = d['target'].values
    m = RandomForestClassifier()
    m.fit(X, y)
    return m

While this code works, it’s challenging to understand its purpose, maintain it, or collaborate on it. Here’s the same functionality written with clean code principles:

def train_classifier(data: pd.DataFrame, feature_columns: List[str]) -> RandomForestClassifier:
    """
    Trains a Random Forest classifier on the given data.
    
    Args:
        data: Input DataFrame containing features and target
        feature_columns: List of column names to use as features
        
    Returns:
        Trained RandomForestClassifier model
    """
    preprocessed_data = preprocess_features(data)
    feature_matrix = engineer_features(preprocessed_data, feature_columns)
    model = train_model(feature_matrix, preprocessed_data['target'])
    
    return model

def preprocess_features(data: pd.DataFrame) -> pd.DataFrame:
    """Handles missing values in the dataset."""
    return data.fillna(0)

def engineer_features(data: pd.DataFrame, feature_columns: List[str]) -> np.ndarray:
    """Converts categorical variables into dummy variables."""
    return pd.get_dummies(data[feature_columns]).values

def train_model(features: np.ndarray, target: np.ndarray) -> RandomForestClassifier:
    """Trains a Random Forest classifier."""
    model = RandomForestClassifier()
    model.fit(features, target)
    return model

The difference is striking. The second version is not just more readable—it’s also easier to test, modify, and share with teammates. This transformation illustrates the core principle of clean code: writing code that is easy to understand and maintain.

While software engineers have long emphasized clean code practices, these principles haven’t received as much attention in data science. Perhaps because we often work in notebooks, focusing on experimentation and results rather than code quality. However, as data science projects grow in complexity and team size, the ability to write clean code becomes increasingly crucial.

This guide will walk you through essential principles of clean code, tailored specifically for data scientists. We’ll cover naming conventions, code organization, function design, and more—all with practical examples from data science workflows. Whether you’re building machine learning pipelines, performing statistical analyses, or creating data visualizations, these principles will help you write code that’s not just functional, but also maintainable and scalable.

The Art of Naming: Making Code Self-Explanatory

Imagine you’re a data scientist opening a Jupyter notebook you created a few months ago. You scroll through cells filled with variables like df, x, and clf. Would you immediately remember what these represent? This scenario highlights why naming is an extremely important part of writing clean code. Indeed, if poor names are chosen, pretty much all other clean code concepts won’t help that much.

The Core Purpose of Naming

Names have one simple purpose: They should describe what’s stored in a variable or property, what a function or method does, or what kind of object will be created when instantiating a class. With this principle in mind, coming up with good names becomes more straightforward, though finding the best name will often require multiple iterations.

Let’s look at a typical example of code you might find in a data science project:

# Poor naming example
def proc(df, cols, tgt):
    X = df[cols].copy()
    y = df[tgt]
    X = X.fillna(X.mean())
    X_std = StandardScaler().fit_transform(X)
    clf = LogisticRegression()
    clf.fit(X_std, y)
    return clf

# Usage
cols = ['dur', 'amnt', 'freq']
model = proc(customer_data, cols, 'churned')

Let’s rewrite this with clear, descriptive names:

def train_churn_classifier(
    customer_data: pd.DataFrame, 
    feature_columns: List[str],
    target_column: str
) -> LogisticRegression:
    """
    Trains a logistic regression model for customer churn prediction.
    
    Args:
        customer_data: DataFrame containing customer information
        feature_columns: Columns to use as predictors
        target_column: Column containing churn information (1 for churned, 0 for active)
    
    Returns:
        Trained logistic regression model
    """
    features = customer_data[feature_columns].copy()
    target = customer_data[target_column]
    
    cleaned_features = handle_missing_values(features)
    scaled_features = scale_features(cleaned_features)
    
    churn_classifier = LogisticRegression()
    churn_classifier.fit(scaled_features, target)
    return churn_classifier

# Usage
feature_columns = [
    'subscription_duration',
    'transaction_amount',
    'purchase_frequency'
]
churn_model = train_churn_classifier(
    customer_data, 
    feature_columns, 
    'churned'
)

Naming Guidelines for Data Scientists

Variables and Properties

Variables and properties hold data - numbers, text (strings), boolean values, objects, lists, arrays, maps, etc. Hence, the name should imply which kind of data is being stored. Names should typically be nouns or short phrases with adjectives, especially for boolean values.

# Poor naming
X = np.array([[1, 2], [3, 4]])
y = np.array([0, 1])
df1 = pd.read_csv('data.csv')
flag = True

# Better naming
feature_matrix = np.array([[1, 2], [3, 4]])
target_labels = np.array([0, 1])
raw_sales_data = pd.read_csv('data.csv')
is_valid_input = True

# For DataFrames, describe what they contain
preprocessed_customer_data = raw_sales_data[raw_sales_data['revenue'] > 0]

# Boolean variables should use is_, has_, did_ prefixes
is_outlier = np.abs(z_score) > 3
has_missing_values = data.isnull().any()
did_converge = model.n_iter_ < model.max_iter

Functions and Methods

Functions and methods execute code - they perform tasks and operations. Their names should be verbs that describe the action being performed.

# Poor naming - sounds like properties
def email(text_data):
    return process(text_data)

def user(id):
    return find(id)

# Better naming - clear actions
def extract_text_features(text_data: List[str]) -> sparse.csr_matrix:
    """Converts text data into bag-of-words representation."""
    return CountVectorizer().fit_transform(text_data)

def get_user_by_id(user_id: int) -> User:
    """Retrieves user information from database."""
    return database.query(User).filter_by(id=user_id).first()

Classes

Classes are used to create objects (unless it’s a static class). The class name should describe the kind of object it will create. Even for static classes, the name should describe what kind of container it represents. Class names should be nouns.

# Poor naming
class ML:
    def __init__(self, data):
        self.data = data
        
# Better naming
class TextPreprocessor:
    def __init__(self, text_data: List[str]):
        self.text_data = text_data
        
    def remove_stopwords(self) -> List[str]:
        pass
        
    def stem_words(self) -> List[str]:
        pass

Avoid Generic Names

In most situations, you should avoid generic names like handle(), process(), data, or item. While there can be situations where these make sense, typically you should either make these names more specific or choose a different kind of name:

# Too generic
def process(data):
    return data.transform()

# More specific
def normalize_feature_values(feature_data: pd.DataFrame) -> pd.DataFrame:
    return feature_data.transform()

Be Consistent

An important part of using proper names is consistency. If you use fetch_users() in one part of your code, you should use fetch_products() - not get_products() - in another part of that same code. While it generally doesn’t matter if you prefer fetch_, get_, or retrieve_, you should stick to one term throughout your codebase.

Practical Tips for Data Science Projects

Dataset Naming: When working with multiple versions of your dataset, use names that reflect the processing stage:

raw_data = pd.read_csv('sales.csv')
cleaned_data = remove_outliers(raw_data)
feature_engineered_data = add_time_features(cleaned_data)

Model Naming: When experimenting with multiple models, use names that reflect their purpose and configuration:

baseline_model = RandomForestClassifier()
tuned_model = RandomForestClassifier(**optimal_parameters)
production_model = load_model('prod_v2.pkl')

Feature Names: Use descriptive names in feature engineering:

# Poor naming
df['t_diff'] = df['t2'] - df['t1']

# Better naming
df['time_since_last_purchase'] = (
    df['current_transaction_date'] - 
    df['last_transaction_date']
).dt.days

Remember: The few extra characters you type in descriptive names will save hours of confusion later. Think of variable and function names as documentation - when you return to your code months later, you shouldn’t need to decipher what proc or df1 means; the names should tell the story of what your code does.

Comments and Formatting: WHEN Less is More

In data science, we often work with complex transformations, mathematical formulas, and multi-step processes. It might be tempting to explain everything with comments, but let’s see why this isn’t always the best approach.

Comments

Comments in code can be both helpful and harmful. Let’s explore when to use them and when to avoid them with data science examples.

When Comments Are Useful

Legal Information:

# Copyright (c) 2024 Research Institute
# Licensed under the MIT License
# This code implements the algorithm described in:
# Smith et al. (2023) "Novel Approach to Time Series Forecasting"
# Journal of Machine Learning, Vol 42, pp. 101-120

import numpy as np
import pandas as pd

Complex Mathematical Transformations:

def calculate_mahalanobis_distance(data: pd.DataFrame) -> np.ndarray:
    """Calculates the Mahalanobis distance for multivariate outlier detection."""
    # Mahalanobis distance is defined as: sqrt((x-μ)ᵀ Σ⁻¹ (x-μ))
    # where μ is the mean vector and Σ is the covariance matrix
    mean = np.mean(data, axis=0)
    covariance_matrix = np.cov(data, rowvar=False)
    inv_covmat = np.linalg.inv(covariance_matrix)
    
    return np.sqrt(np.sum(np.dot(data - mean, inv_covmat) * (data - mean), axis=1))

Regular Expressions:

def extract_timestamp(text: str) -> str:
    # Matches timestamps in format: YYYY-MM-DD HH:MM:SS.mmm
    timestamp_pattern = r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\.\d{3})'
    return re.search(timestamp_pattern, text).group()

Important Warnings:

def split_time_series(data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    # WARNING: Data must be sorted chronologically
    # Shuffling would cause future data leakage
    split_idx = int(len(data) * 0.8)
    return data[:split_idx], data[split_idx:]

TODO Notes (Use Sparingly):

def train_model(features: pd.DataFrame, target: pd.Series) -> RandomForestClassifier:
    # TODO: Add cross-validation when more data is available
    # TODO: Implement early stopping based on validation loss
    model = RandomForestClassifier()
    return model.fit(features, target)

When to Avoid Comments

Commented-Out Code:

# Bad Example
def preprocess_features(data: pd.DataFrame) -> pd.DataFrame:
    # Remove outliers using IQR method
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    # # Old method using z-score
    # # z_scores = np.abs(stats.zscore(data))
    # # data = data[(z_scores < 3).all(axis=1)]
    
    return data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# Better: Delete unused code and use version control to track changes

Misleading Comments:

# Bad Example
def process_features(df: pd.DataFrame) -> pd.DataFrame:
    # Normalize the data
    standardized = StandardScaler().fit_transform(df)  # Actually standardizing!
    return standardized

# Better Example
def standardize_features(df: pd.DataFrame) -> pd.DataFrame:
    """Standardizes features to zero mean and unit variance."""
    return StandardScaler().fit_transform(df)

Redundant Comments:

# Bad Example
def calculate_metrics(y_true: np.ndarray, y_pred: np.ndarray) -> Dict:
    # Calculate accuracy
    acc = accuracy_score(y_true, y_pred)  # Don't state the obvious
    
    # Calculate precision
    prec = precision_score(y_true, y_pred)
    
    # Return metrics dictionary
    return {'accuracy': acc, 'precision': prec}

# Better Example
def calculate_classification_metrics(
    y_true: np.ndarray,
    y_pred: np.ndarray
) -> Dict[str, float]:
    """Returns a dictionary of classification performance metrics."""
    return {
        'accuracy': accuracy_score(y_true, y_pred),
        'precision': precision_score(y_true, y_pred)
    }

Key Principles for Comments

Comments should explain the “why”, not the “what”:

# Bad: Explains what (obvious from code)
x = x + 1  # Increment x

# Good: Explains why
n_iterations += 1  # Compensate for warm-up period in MCMC

Let code be self-documenting when possible:

# Bad: Relies on comments for clarity
def p(d):  # Process data
    r = d.sum()  # Calculate sum
    return r  # Return result

# Good: Self-documenting code
def calculate_total_sales(daily_sales: pd.Series) -> float:
    return daily_sales.sum()

Use docstrings for documentation:

def detect_anomalies(
    time_series: pd.Series,
    window_size: int = 30
) -> pd.Series:
    """
    Detects anomalies using rolling statistics.
    
    Args:
        time_series: Time series data
        window_size: Size of rolling window
        
    Returns:
        Boolean series where True indicates anomalies
    """
    rolling_mean = time_series.rolling(window_size).mean()
    rolling_std = time_series.rolling(window_size).std()
    z_scores = np.abs((time_series - rolling_mean) / rolling_std)
    return z_scores > 3

Remember: The goal is to write code that is so clear that it doesn’t need comments to be understood. Comments should be used only when they add value that can’t be conveyed through better code structure and naming.

Code Formatting: Making Your Data Science Code More Readable

Vertical Formatting: The Art of Spacing

Think of your code like a well-written research paper - it should have clear paragraphs, logical sections, and a natural flow. In code, we achieve this through vertical formatting.

Bad Example - No Vertical Spacing

def preprocess_dataset(data: pd.DataFrame) -> pd.DataFrame:
    numeric_columns = data.select_dtypes(include=[np.number]).columns
    data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())
    categorical_columns = data.select_dtypes(include=['object']).columns
    data[categorical_columns] = data[categorical_columns].fillna(data[categorical_columns].mode().iloc[0])
    for column in categorical_columns:
        data[column] = LabelEncoder().fit_transform(data[column])
    scaled_features = StandardScaler().fit_transform(data[numeric_columns])
    data[numeric_columns] = scaled_features
    return data

Good Example - With Proper Vertical Spacing

def preprocess_dataset(data: pd.DataFrame) -> pd.DataFrame:
    """Preprocesses dataset by handling missing values and encoding categories."""
    # Handle numeric features
    numeric_columns = data.select_dtypes(include=[np.number]).columns
    data[numeric_columns] = data[numeric_columns].fillna(
        data[numeric_columns].mean()
    )
    
    # Handle categorical features
    categorical_columns = data.select_dtypes(include=['object']).columns
    data[categorical_columns] = data[categorical_columns].fillna(
        data[categorical_columns].mode().iloc[0]
    )
    
    # Encode categorical variables
    for column in categorical_columns:
        data[column] = LabelEncoder().fit_transform(data[column])
    
    # Scale numeric features
    scaled_features = StandardScaler().fit_transform(data[numeric_columns])
    data[numeric_columns] = scaled_features
    
    return data

Key Principles of Vertical Formatting:

Vertical Density: Keep related concepts together

# Good: Related operations are grouped
def calculate_feature_importance(model, X: pd.DataFrame) -> pd.DataFrame:
    # Get importance scores
    importance = model.feature_importances_
    feature_names = X.columns
    
    # Create and sort importance DataFrame
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importance
    })
    return importance_df.sort_values('importance', ascending=False)

Vertical Distance: Separate distinct concepts with blank lines

class ModelTrainer:
    def __init__(self, model_params: Dict):
        self.model_params = model_params
        self.model = None
    
    def prepare_data(self, X: pd.DataFrame, y: pd.Series) -> Tuple:
        """Prepares training and validation data."""
        return train_test_split(X, y, test_size=0.2, random_state=42)
    
    def train_model(self, X_train: pd.DataFrame, y_train: pd.Series) -> None:
        """Trains the model on provided data."""
        self.model = RandomForestClassifier(**self.model_params)
        self.model.fit(X_train, y_train)

File Organization: Follow the “stepdown rule” - called functions should appear below their callers

def train_and_evaluate_model(data: pd.DataFrame) -> Dict[str, float]:
    """Main function that orchestrates model training and evaluation."""
    features, target = prepare_features_and_target(data)
    model = train_model(features, target)
    return evaluate_model(model, features, target)

def prepare_features_and_target(data: pd.DataFrame) -> Tuple:
    """Prepares features and target for modeling."""
    return data.drop('target', axis=1), data['target']

def train_model(features: pd.DataFrame, target: pd.Series) -> RandomForestClassifier:
    """Trains the model."""
    return RandomForestClassifier().fit(features, target)

def evaluate_model(
    model: RandomForestClassifier,
    features: pd.DataFrame,
    target: pd.Series
) -> Dict[str, float]:
    """Evaluates model performance."""
    return {
        'accuracy': accuracy_score(target, model.predict(features)),
        'f1': f1_score(target, model.predict(features))
    }

Horizontal Formatting: Managing Line Length and Readability

Just as vertical spacing helps organize code sections, horizontal formatting makes individual lines more readable.

Bad Example - Long Lines

result = pd.DataFrame({'predicted_values': model.predict(X_test), 'actual_values': y_test, 'feature_1': X_test['feature_1'], 'feature_2': X_test['feature_2'], 'probability_class_1': model.predict_proba(X_test)[:, 1]})

Good Example - Breaking Long Lines

result = pd.DataFrame({
    'predicted_values': model.predict(X_test),
    'actual_values': y_test,
    'feature_1': X_test['feature_1'],
    'feature_2': X_test['feature_2'],
    'probability_class_1': model.predict_proba(X_test)[:, 1]
})

Key Principles of Horizontal Formatting:

Line Length: Keep lines under 88 characters (Python black formatter standard)

# Bad
correlation_matrix = data[['feature1', 'feature2', 'feature3', 'feature4', 'feature5']].corr()

# Good
correlation_matrix = data[[
    'feature1', 'feature2', 'feature3',
    'feature4', 'feature5'
]].corr()

Parameter Lists: Break long parameter lists into multiple lines

# Bad
def train_complex_model(X_train, y_train, n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, random_state):

# Good
def train_complex_model(
    X_train: pd.DataFrame,
    y_train: pd.Series,
    n_estimators: int = 100,
    max_depth: int = None,
    min_samples_split: int = 2,
    min_samples_leaf: int = 1,
    max_features: str = 'auto',
    random_state: int = 42
) -> RandomForestClassifier:

Method Chaining: Break long method chains into multiple lines

# Bad
cleaned_data = data.dropna().reset_index(drop=True).drop_duplicates().sort_values('date').reset_index(drop=True)

# Good
cleaned_data = (
    data
    .dropna()
    .reset_index(drop=True)
    .drop_duplicates()
    .sort_values('date')
    .reset_index(drop=True)
)

As your code gets bigger, consider splitting it between multiple files and use import and export statements to connect your code together.

Remember: The goal of formatting is to make your code tell a clear story. Well-formatted code helps others (and your future self) understand your data science workflow more easily.

Writing Clean Functions: The Building Blocks of Data Science Code

Functions are the fundamental building blocks of data science code. They help us organize code, make it reusable, and maintain clarity in our data processing pipelines. Let’s explore the key concepts that make functions effective.

Function Size and Responsibility

Functions should:

Do exactly one thing
Be small and focused
Operate at a single level of abstraction
Have a clear and specific purpose

# Too many responsibilities
def process_data(df):
    # Handle missing values
    df = df.fillna(df.mean())
    
    # Remove outliers
    z_scores = stats.zscore(df)
    df = df[(z_scores < 3).all(axis=1)]
    
    # Scale features
    return StandardScaler().fit_transform(df)

# Clear, single responsibilities
def handle_missing_values(data: pd.DataFrame) -> pd.DataFrame:
    """Fills missing values with column means."""
    return data.fillna(data.mean())

def remove_outliers(
    data: pd.DataFrame,
    threshold: float = 3.0
) -> pd.DataFrame:
    """Removes outliers based on z-score threshold."""
    z_scores = stats.zscore(data)
    return data[(np.abs(z_scores) < threshold).all(axis=1)]

def scale_features(data: pd.DataFrame) -> pd.DataFrame:
    """Scales features using StandardScaler."""
    return StandardScaler().fit_transform(data)

Pure Functions

A pure function is a function that:

Always produces the same output for the same input (deterministic behavior)
Has no side effects (doesn’t modify external state, print to console, or write to files)
Relies only on its input parameters

Note: Reading from global variables is acceptable in pure functions - it’s modifying global state that creates side effects.

# Not a pure function - modifies input and has side effects
global_scaler = StandardScaler()  # Global mutable state
def preprocess_features(df):
    # Modifies global state
    global_scaler.fit(df)
    # Modifies input df in place
    df['normalized'] = df['value'] / 100
    # Side effect: prints to console
    print("Processing complete")
    return df

# Pure function - no side effects, doesn't modify inputs
def scale_features(
    data: pd.DataFrame,
    columns: List[str],
    scaler: Optional[StandardScaler] = None
) -> Tuple[pd.DataFrame, StandardScaler]:
    """
    Scales specified columns without modifying the input DataFrame.
    
    Args:
        data: Input DataFrame
        columns: Columns to scale
        scaler: Optional pre-fitted scaler
        
    Returns:
        Tuple containing:
        - DataFrame with scaled features
        - Fitted scaler for reuse
    """
    result = data.copy()
    scaler = scaler or StandardScaler()
    result[columns] = scaler.fit_transform(data[columns])
    return result, scaler

While pure functions are ideal for readability and testing, they’re not always possible in data science due to necessary side effects:

# Bad: Side effect not clear from name
def process_data(data):
    data.to_csv('processed.csv')  # Unexpected file I/O
    return data

# Good: Name indicates side effect
def save_data_to_csv(data, filepath):
    data.to_csv(filepath)
    return data

# Bad: Hidden database interaction
def get_user(user_id):
    db.connect()  # Unexpected database connection
    return db.query(f"SELECT * FROM users WHERE id={user_id}")

# Good: Clear about database interaction
def fetch_user_from_database(user_id):
    db.connect()
    return db.query(f"SELECT * FROM users WHERE id={user_id}")

# Bad: Hidden state modification
def train(model, data):
    model.random_state = 42  # Unexpected state change
    model.fit(data)
    return model

# Good: Name indicates state modification
def initialize_and_train_model(model, data, random_state=42):
    model.random_state = random_state
    model.fit(data)
    return model

Common situations requiring impure functions in data science:

File operations (use verbs like ‘save’, ‘write’, ‘load’, ‘read’)
Database interactions (use verbs like ‘fetch’, ‘store’, ‘query’)
Model training (use verbs like ‘train’, ‘fit’, ‘initialize’)
Logging/printing (use verbs like ‘log’, ‘print’, ‘display’)
Random operations (include ‘random’ in name if result varies)

Levels of Abstraction

Functions should operate at consistent levels of abstraction. Each function should have operations that are at the same conceptual level, which should be one level below what the function name implies.

High-Level vs Low-Level Operations

# Mixed levels of abstraction - Hard to understand
def train_model(data):
    X = data.drop('target', axis=1)  # Low-level operation
    model = RandomForestClassifier()  # High-level operation
    model.fit(X, data['target'])     # Mid-level operation
    print('Training complete')        # Low-level operation
    return model

# Consistent level of abstraction - Clear and maintainable
def train_model(data: pd.DataFrame) -> RandomForestClassifier:
    """Trains a random forest model on the provided data."""
    features = prepare_features(data)
    target = extract_target(data)
    model = create_model()
    fit_model(model, features, target)
    return model

def prepare_features(data: pd.DataFrame) -> pd.DataFrame:
    """Extracts and preprocesses feature columns."""
    return data.drop('target', axis=1)

def extract_target(data: pd.DataFrame) -> pd.Series:
    """Extracts the target variable."""
    return data['target']

When to Split Functions

Follow these guidelines to decide when to split functions:

Extract code that works on the same functionality

# Before
def update_user(user_data):
    validate_user_data(user_data)
    user = find_user_by_id(user_data.id)
    user.setAge(user_data.age)
    user.setName(user_data.name)
    user.save()
   
# After
def update_user(user_data):
    validate_user_data(user_data)
    apply_update(user_data)
   
def apply_update(user_data):
    user = find_user_by_id(user_data.id)
    update_user_fields(user, user_data)
    user.save()

Extract code that requires more interpretation

# Before
def process_transaction(transaction):
    if transaction.type == 'UNKNOWN':
        throw new Error('Invalid transaction type.')
    if transaction.type == 'PAYMENT':
        process_payment(transaction)
   
# After
def process_transaction(transaction):
    validate_transaction(transaction)
    if is_payment(transaction):
        process_payment(transaction)

Common Pitfalls to Avoid

Over-splitting functions
- Don’t create functions just for the sake of extraction
- Ensure each function adds meaningful abstraction
Inconsistent abstraction levels
- Keep operations within a function at the same conceptual level
- Don’t mix high-level business logic with low-level implementation details
Hidden side effects
- Make side effects obvious through function names
- Document any state changes in function documentation

Minimizing Function Parameters

The fewer parameters a function has, the easier it is to read, understand, and call. Here’s a guide to parameter counts:

No Parameters

Functions without parameters are very easy to read and digest:

createSession();
user.save();

However, “no parameters” isn’t always an option - parameters make functions dynamic and flexible.

One Parameter

Functions with one parameter are typically straightforward:

isValid(email);
file.write(data);

Two Parameters

Two parameters can be okay, but context matters:

Good examples (clear and intuitive):

login('[email protected]', 'testers');
createProduct('Carpet', 12.99);

Confusing examples (parameter order not obvious):

createSession('abc', 'temp');
sortUsers('email', 'asc');

More than Two Parameters

Should generally be avoided - they become hard to read and use:

# Hard to understand
createRectangle(10, 9, 30, 12);
createUser('[email protected]', 31, 'max');

Solutions for Many Parameters

When working with data science projects, you’ll often find yourself dealing with functions and classes that require many parameters. Whether it’s configuring a machine learning model, setting up data preprocessing steps, or defining experiment parameters, managing these parameters effectively is crucial for maintaining clean and maintainable code.

Let’s explore three powerful patterns that can make your data science code more organized and easier to understand.

Using Objects/Maps Instead of Multiple Parameters

The Problem

Consider a typical data preprocessing function:

def preprocess_data(
    data,
    fill_missing_strategy="mean",
    scaling_method="standard",
    categorical_encoding="one-hot",
    drop_columns=None,
    handle_outliers=True,
    outlier_threshold=3,
    create_polynomial_features=False,
    polynomial_degree=2,
    feature_selection_method="mutual_info",
    n_features_to_select=10
):
    # Implementation here
    pass

This function is difficult to use and maintain because:

It’s hard to remember the order of parameters
Default values are scattered throughout the signature
Adding new parameters requires changing function calls everywhere

The Solution

Instead, use a dictionary or object with named parameters:

def preprocess_data(config):
    """
    Preprocess data according to the configuration.
    
    Args:
        config: dict with preprocessing parameters
    """
    scaling_method = config.get("scaling_method", "standard")
    categorical_encoding = config.get("categorical_encoding", "one-hot")
    # Rest of implementation
    
# Usage
preprocessing_config = {
    "scaling_method": "minmax",
    "categorical_encoding": "label",
    "handle_outliers": True,
    "outlier_threshold": 2.5
}
preprocessed_data = preprocess_data(preprocessing_config)

Configuration Objects: Type-Safe Parameter Management

Configuration objects take the previous concept further by adding type safety and validation. This is especially valuable in data science where incorrect parameter types can cause subtle bugs.

Real-World Example: Training Configuration

from dataclasses import dataclass
from typing import List, Optional, Union
from pathlib import Path

@dataclass
class TrainingConfig:
    # Model parameters
    model_type: str
    hidden_layers: List[int]
    activation: str
    dropout_rate: float
    
    # Training parameters
    batch_size: int
    learning_rate: float
    n_epochs: int
    
    # Data parameters
    train_data_path: Path
    validation_split: float
    target_column: str
    feature_columns: List[str]
    
    # Optional parameters
    early_stopping_patience: Optional[int] = None
    model_checkpoint_path: Optional[Path] = None
    
    def validate(self):
        """Validate configuration parameters."""
        assert 0 < self.dropout_rate < 1, "Dropout rate must be between 0 and 1"
        assert self.batch_size > 0, "Batch size must be positive"
        assert 0 < self.validation_split < 1, "Validation split must be between 0 and 1"
        assert self.train_data_path.exists(), "Training data path must exist"

# Usage example
config = TrainingConfig(
    model_type="feedforward",
    hidden_layers=[128, 64, 32],
    activation="relu",
    dropout_rate=0.3,
    batch_size=64,
    learning_rate=0.001,
    n_epochs=100,
    train_data_path=Path("data/train.csv"),
    validation_split=0.2,
    target_column="target",
    feature_columns=["feature1", "feature2", "feature3"]
)

def train_model(config: TrainingConfig):
    config.validate()
    # Training implementation here

Benefits:

Type hints provide IDE support and catch errors early
Validation ensures parameters are correct before training starts
Documentation is built into the structure
Easy to serialize/deserialize for experiment tracking

Builder Pattern: Complex Object Construction Made Clear

The Builder pattern is particularly useful for experiment configuration where you might want to modify only certain parameters while keeping others at their default values.

Real-World Example: Experiment Configuration Builder

class ExperimentBuilder:
    """Builder for machine learning experiment configuration."""
    
    def __init__(self):
        self._config = {
            "model_params": {},
            "training_params": {},
            "data_params": {}
        }
    
    def with_model(self, model_type: str, **kwargs):
        """Configure model architecture."""
        self._config["model_params"] = {
            "type": model_type,
            **kwargs
        }
        return self
    
    def with_training_params(self, **kwargs):
        """Configure training parameters."""
        self._config["training_params"].update(kwargs)
        return self
    
    def with_data_preprocessing(self, **kwargs):
        """Configure data preprocessing steps."""
        self._config["data_params"].update(kwargs)
        return self
    
    def build(self) -> dict:
        """Validate and return the final configuration."""
        self._validate_config()
        return self._config
    
    def _validate_config(self):
        """Ensure all required parameters are set."""
        required_model_params = ["type"]
        required_training_params = ["learning_rate", "n_epochs"]
        
        missing_model = [param for param in required_model_params 
                        if param not in self._config["model_params"]]
        if missing_model:
            raise ValueError(f"Missing required model parameters: {missing_model}")
        
        missing_training = [param for param in required_training_params 
                          if param not in self._config["training_params"]]
        if missing_training:
            raise ValueError(f"Missing required training parameters: {missing_training}")

# Usage example for a deep learning experiment
experiment_config = (
    ExperimentBuilder()
    .with_model(
        model_type="transformer",
        n_layers=6,
        n_heads=8,
        d_model=512
    )
    .with_training_params(
        learning_rate=0.0001,
        n_epochs=50,
        batch_size=32,
        gradient_clip=1.0
    )
    .with_data_preprocessing(
        sequence_length=128,
        vocab_size=10000,
        padding="post",
        truncating="post"
    )
    .build()
)

This pattern is particularly valuable when:

You’re running multiple experiments with different configurations
Some parameters are interdependent
You want to ensure all required parameters are set before running
You need to maintain default configurations while allowing customization

Best Practices Summary

Use simple parameter objects for functions with more than 2 parameters
Use configuration objects with type hints for complex settings that need validation
Use the builder pattern when you need to:
- Create complex configurations step by step
- Maintain multiple similar configurations
- Ensure parameter validity before execution
- Make configuration creation more readable and maintainable

Remember: The goal is to make your code more maintainable and less error-prone. Choose the pattern that best fits your specific use case and team’s needs.

Control Structures: Taming Complexity in Data Pipelines

Control structures (if statements, for loops, while loops, switch-case statements) are fundamental for coordinating code flow. While essential, they can lead to suboptimal or hard-to-maintain code if not used carefully. Here are three key areas for improvement:

1. Prefer Positive Checks

Using positive wording in if checks can make code more readable.

# Less clear - requires more mental processing
if (!hasContent(blogContent)) {
    throw Error('Invalid input')
}

# More clear - instantly understandable
if (isEmpty(blogContent)) {
    throw Error('Invalid input')
}

2. Avoid Deep Nesting

Deep nesting makes code hard to read and maintain. Here are four techniques to avoid it:

a. Use Guards and Fail Fast

# Deeply nested - hard to follow
def messageUser(user, message):
    if user:
        if message:
            if user.acceptsMessages:
                const success = user.sendMessage(message)
                if success:
                    console.log('Message sent!')

# Using guards - clear and flat
def messageUser(user, message):
    if !user || !message || !user.acceptsMessages:
        return
    
    user.sendMessage(message)
    console.log('Message sent!')

b. Extract Control Structures into Functions

# Complex nested logic
def load_dataset(path):
    if not path:
        raise ValueError('Path to dataset is required!')
    
    if path.endswith('.csv'):
        df = pd.read_csv(path)
        if df.empty:
            if os.path.exists(path + '.parquet'):
                return pd.read_parquet(path + '.parquet')
            else:
                raise ValueError('Dataset is empty!')
    elif path.endswith('.parquet'):
        df = pd.read_parquet(path)
        if df.empty:
            raise ValueError('Dataset is empty!')
    else:
        raise ValueError('Unsupported file format!')
    return df

# Extracted into focused functions
def load_dataset(path):
    validate_path(path)
    file_type = get_file_type(path)
    return read_data(path, file_type)

def validate_path(path):
    if not path:
        raise ValueError('Path to dataset is required!')
    if not os.path.exists(path):
        raise FileNotFoundError(f'Dataset not found at {path}')

def get_file_type(path):
    supported_formats = {'.csv': 'csv', '.parquet': 'parquet'}
    file_ext = os.path.splitext(path)[1].lower()
    if file_ext not in supported_formats:
        raise ValueError(f'Unsupported format. Use: {list(supported_formats.keys())}')
    return supported_formats[file_ext]

def read_data(path, file_type):
    readers = {
        'csv': pd.read_csv,
        'parquet': pd.read_parquet
    }
    df = readers[file_type](path)
    validate_dataframe(df)
    return df

def validate_dataframe(df):
    if df.empty:
        raise ValueError('Dataset is empty!')

c. Use Factory Functions & Polymorphism

# Repetitive checks and nested logic
def process_dataset(data):
    if is_timeseries(data):
        if needs_resampling(data):
            process_timeseries_resampling(data)
        if needs_interpolation(data):
            process_timeseries_interpolation(data)
    else:
        if needs_resampling(data):
            process_tabular_resampling(data)
        if needs_interpolation(data):
            process_tabular_interpolation(data)

# Using factory function and polymorphic object
def get_processors(data):
    processors = {
        'resample': None,
        'interpolate': None
    }
    
    if is_timeseries(data):
        processors['resample'] = lambda x: x.resample('1D').mean()
        processors['interpolate'] = lambda x: x.interpolate(method='time')
    else:
        processors['resample'] = lambda x: x.sample(frac=0.8, random_state=42)
        processors['interpolate'] = lambda x: x.interpolate(method='linear')
    
    return processors

def process_dataset(data):
    processors = get_processors(data)
    
    if needs_resampling(data):
        data = processors['resample'](data)
    if needs_interpolation(data):
        data = processors['interpolate'](data)
    
    return data

# Helper functions
def is_timeseries(data):
    return isinstance(data.index, pd.DatetimeIndex)

def needs_resampling(data):
    return data.isnull().sum().sum() > len(data) * 0.1

def needs_interpolation(data):
    return data.isnull().any().any()

d. Replace If Checks with Error Handling

Following the principle that a function should do exactly one thing, error handling should typically be separated from the main function logic. When a function can’t complete its one job, it should construct and throw an error rather than handling it internally.

# Poor approach: Function doing multiple things
# 1. Validates data
# 2. Creates error codes/messages
# 3. Handles errors (logging)
def validate_dataset(df):
    validity = check_data_quality(df)
    if validity['code'] in [1, 2]:
        print(f"Data validation failed: {validity['message']}")
        return
    
    if validity['code'] == 3:
        print("Warning: Data contains outliers")
    return df

# Better approach: Each function has one responsibility
def validate_dataframe(df):
    """Validates DataFrame structure and content"""
    if not isinstance(df, pd.DataFrame):
        raise TypeError("Input must be a pandas DataFrame")
    
    if df.empty:
        raise ValueError("DataFrame is empty")
        
    required_columns = ['timestamp', 'value', 'category']
    missing_cols = set(required_columns) - set(df.columns)
    if missing_cols:
        raise ValueError(f"Missing required columns: {missing_cols}")
        
    if df['value'].isnull().sum() / len(df) > 0.2:
        raise ValueError("Too many missing values (>20%) in 'value' column")

def validate_data_types(df):
    """Validates data types and formats"""
    if not pd.api.types.is_datetime64_any_dtype(df['timestamp']):
        raise TypeError("'timestamp' column must be datetime")
        
    if not pd.api.types.is_numeric_dtype(df['value']):
        raise TypeError("'value' column must be numeric")
        
    if not df['category'].isin(['A', 'B', 'C']).all():
        raise ValueError("'category' must only contain values: A, B, C")

# Separate function for handling the data processing pipeline
def process_dataset(df):
    try:
        # Structural validation
        validate_dataframe(df)
        
        # Data type validation
        validate_data_types(df)
        
        # Continue with data processing
        return prepare_data(df)
        
    except (TypeError, ValueError) as e:
        logging.error(f"Data validation failed: {str(e)}")
        raise
    except Exception as e:
        logging.error(f"Unexpected error during data processing: {str(e)}")
        raise

def prepare_data(df):
    """Handles the actual data preparation after validation"""
    df = df.copy()
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['value'] = pd.to_numeric(df['value'], errors='coerce')
    df = df.sort_values('timestamp')
    return df

Best Practices

Choose Positive Checks When:
- The positive case is more common
- The condition is simple and doesn’t involve multiple states
- It makes the code more readable
Apply Factory Functions When:
- You have similar but varying behavior
- You’re repeating checks in multiple places
- You need to create objects with different implementations but same interface
Use Error Handling When:
- Validation is a separate concern
- You want to construct and bubble up errors to appropriate handlers
- You want to avoid nested if checks for error conditions

Classes and Objects in Data Science: Organizing Complex Pipelines

While this guide doesn’t exclusively focus on object-oriented programming, classes and objects are crucial aspects of programming in general, even if you don’t follow a purely object-oriented style. When working with complex data science pipelines, understanding how to effectively use classes and objects can significantly improve code organization and maintainability.

Understanding Objects vs Data Containers

A fundamental distinction in data science code is between “real objects” and “data containers”. Let’s understand this key difference:

Data Containers

A data container is exactly what it sounds like - an object that holds data. For example:

class ExperimentMetrics:
    def __init__(self, accuracy: float, f1_score: float):
        self.accuracy = accuracy    # Public property
        self.f1_score = f1_score   # Public property
        
# Usage
experiment_metrics = ExperimentMetrics(0.95, 0.93)
print(experiment_metrics.accuracy)  # Direct access to properties

This class has no methods and both properties are exposed publicly. It’s perfectly valid for storing and transferring data.

Objects with Behavior

In contrast, a proper object hides its data from the public and exposes a public API through methods:

class DataPreprocessor:
    def __init__(self, dataframe: pd.DataFrame):
        self._data = dataframe       # Private property
        self._scaler = None          # Private property
        
    def fit_transform(self) -> pd.DataFrame:
        """Encapsulates the preprocessing logic"""
        self._scaler = StandardScaler()
        normalized = self._scaler.fit_transform(self._data)
        return pd.DataFrame(normalized, columns=self._data.columns)

# Usage
preprocessor = DataPreprocessor(raw_data)
clean_data = preprocessor.fit_transform()  # Interact through methods

When to Use Each Type

Both types have their place in data science:

Use Data Containers When:
- You need to group related data together
- The data structure is simple and doesn’t need behavior
- You’re passing data between functions
```
@dataclass
class ModelMetrics:
    accuracy: float
    precision: float
    recall: float
    f1_score: float
```

Use Objects When:

You need to encapsulate complex behavior
You want to hide implementation details
You need to maintain state

class ModelEvaluator:
    def __init__(self, model, test_data):
        self._model = model
        self._data = test_data
        self._predictions = None
           
    def evaluate(self) -> ModelMetrics:
        self._predictions = self._model.predict(self._data.X)
        return self._calculate_metrics()

Key Rules for Clean Classes

When designing classes for data science workflows, consider these important principles:

Differentiate Between Objects and Data Containers Don’t mix the two styles - either make a pure data container or a proper object with behavior.
Keep Classes Small and Focused with High Cohesion A class should have a single responsibility, and all its methods should actually use its properties. Let’s look at a common example of a class that’s grown too large and lost cohesion:

# Poor example: Large class with low cohesion
class DataProcessor:
    def __init__(self):
        self.raw_data = None              # Used by data methods
        self.cleaned_data = None          # Used by data methods
        self.model = RandomForestClassifier()  # Used by model methods
        self.validation_results = {}      # Used by validation methods
        self.feature_columns = []         # Used by feature methods
        self.target_column = None         # Used by multiple methods
    
    # Data loading methods - only use raw_data
    def load_data(self, filepath: str) -> None:
        self.raw_data = pd.read_csv(filepath)
        
    def split_data(self) -> Tuple[pd.DataFrame, pd.DataFrame]:
        return train_test_split(self.raw_data)
    
    # Feature methods - only use feature columns
    def select_features(self, correlation_threshold: float = 0.1) -> None:
        correlations = self.raw_data.corr()
        self.feature_columns = correlations[correlations > correlation_threshold].index.tolist()
    
    def engineer_features(self) -> None:
        # Complex feature engineering using feature_columns
        # but not using other properties
        pass
    
    # Model methods - only use model property
    def train_model(self) -> None:
        self.model.fit(
            self.cleaned_data[self.feature_columns],
            self.cleaned_data[self.target_column]
        )
    
    def predict(self, X: pd.DataFrame) -> np.ndarray:
        return self.model.predict(X)
    
    # Validation methods - only use validation_results
    def calculate_metrics(self) -> None:
        predictions = self.model.predict(self.cleaned_data[self.feature_columns])
        self.validation_results['accuracy'] = accuracy_score(
            self.cleaned_data[self.target_column],
            predictions
        )
    
    def generate_validation_report(self) -> Dict:
        # Only uses validation_results, not other properties
        return {
            'model_performance': self.validation_results,
            'timestamp': datetime.now()
        }

# Better approach: Split into focused classes with high cohesion

class DataLoader:
    def __init__(self, filepath: str):
        self.data = self.load_data(filepath)
    
    def load_data(self, filepath: str) -> pd.DataFrame:
        return pd.read_csv(filepath)
    
    def split_data(self) -> Tuple[pd.DataFrame, pd.DataFrame]:
        return train_test_split(self.data)

class FeatureEngineer:
    def __init__(self, data: pd.DataFrame):
        self.data = data
        self.feature_columns = []
    
    def select_features(self, correlation_threshold: float = 0.1) -> List[str]:
        correlations = self.data.corr()
        self.feature_columns = (correlations[correlations > correlation_threshold]
                              .index.tolist())
        return self.feature_columns
    
    def engineer_features(self) -> pd.DataFrame:
        # All methods work with the same properties
        engineered_data = self.data.copy()
        # Feature engineering logic
        return engineered_data

class ModelManager:
    def __init__(self, model: Optional[RandomForestClassifier] = None):
        self.model = model or RandomForestClassifier()
        self.metrics = {}
    
    def train(self, features: pd.DataFrame, target: pd.Series) -> None:
        self.model.fit(features, target)
    
    def predict(self, features: pd.DataFrame) -> np.ndarray:
        return self.model.predict(features)
    
    def evaluate(self, features: pd.DataFrame, target: pd.Series) -> Dict:
        predictions = self.predict(features)
        self.metrics = {
            'accuracy': accuracy_score(target, predictions),
            'precision': precision_score(target, predictions, average='weighted'),
            'recall': recall_score(target, predictions, average='weighted')
        }
        return self.metrics

# Usage example showing better organization and cohesion:
loader = DataLoader('data.csv')
train_data, test_data = loader.split_data()

engineer = FeatureEngineer(train_data)
feature_cols = engineer.select_features(correlation_threshold=0.2)
processed_train = engineer.engineer_features()
processed_test = FeatureEngineer(test_data).engineer_features()

model_manager = ModelManager()
model_manager.train(
    processed_train[feature_cols],
    train_data['target']
)

performance = model_manager.evaluate(
    processed_test[feature_cols],
    test_data['target']
)

Follow the Law of Demeter: The Law of Demeter (also known as the principle of least knowledge) is like the “don’t talk to strangers” rule in programming. Imagine you’re working with a complex machine learning pipeline: if you want to know the accuracy of your model, you shouldn’t have to know that it’s stored inside a metrics object, which is inside a validation results object, which is inside your experiment object. Just as you wouldn’t ask your colleague’s manager’s supervisor about your colleague’s schedule – you’d ask your colleague directly.

In data science, we often deal with nested objects (experiments containing models containing metrics containing values). When we chain these objects together (like experiment.model.metrics.accuracy.value), we create brittle code that breaks when internal structures change. For instance, if someone decides to move accuracy metrics into a different object structure, every piece of code that relied on this exact chain would break.

# Example 1: Data Processing Pipeline that Violates Law of Demeter
class DataProcessor:
    def __init__(self, dataset):
        self.dataset = dataset
    
    # Violates Law of Demeter - reaches through multiple objects
    def get_prediction_accuracy(self):
        return self.dataset.validation_results.model_metrics.accuracy_score.value
    
    # Violates Law of Demeter - reaches into nested data structures
    def get_feature_importance(self, feature_name):
        return self.dataset.model.feature_importances_.coefficients[feature_name].weight

# Example 2: Better Design Following Law of Demeter
class Dataset:
    def __init__(self):
        self._validation_results = ValidationResults()
        self._model = Model()
    
    def get_accuracy(self):
        """Encapsulates access to nested accuracy value"""
        return self._validation_results.get_accuracy()
    
    def get_feature_importance(self, feature_name):
        """Delegates to model without exposing its internals"""
        return self._model.get_feature_importance(feature_name)

class ValidationResults:
    def __init__(self):
        self._metrics = ModelMetrics()
    
    def get_accuracy(self):
        """Provides clean interface to access accuracy"""
        return self._metrics.get_accuracy()

class ModelMetrics:
    def __init__(self):
        self._accuracy = AccuracyMetric()
    
    def get_accuracy(self):
        return self._accuracy.get_value()

# Usage that follows Law of Demeter
class DataProcessor:
    def __init__(self, dataset: Dataset):
        self.dataset = dataset
    
    def get_prediction_accuracy(self):
        # Only talks to immediate friend (dataset)
        return self.dataset.get_accuracy()
    
    def get_feature_importance(self, feature_name):
        # Delegates responsibility without knowing internals
        return self.dataset.get_feature_importance(feature_name)

Note: The Law of Demeter specifically applies to property/attribute chaining, not method calls. This is important in data science because while model.fit(X).predict(X) is fine (these are method calls), model.hyperparameters.optimizer.learning_rate violates the law by reaching through multiple object properties.

Using Polymorphism in Data Pipelines

Polymorphism is a powerful concept that helps avoid code duplication and create flexible data processing pipelines. Instead of using long if/else chains or switch statements, we can use polymorphic classes to handle different types of data processing elegantly.

Real-World Scenario

Imagine you’re building a data pipeline that needs to:

Process data from different sources (CSV, JSON, databases)
Apply different preprocessing strategies based on data types
Handle multiple model training approaches
Generate various types of reports
Export results in different formats

Without polymorphism, you might end up with code full of conditional logic. Here’s what that looks like:

# Bad Example - No Polymorphism
class DataProcessor:
    def process_data(self, data_type: str, data: Any) -> pd.DataFrame:
        if data_type == 'csv':
            # Process CSV data
            return pd.read_csv(data)
        elif data_type == 'json':
            # Process JSON data
            return pd.read_json(data)
        elif data_type == 'sql':
            # Process SQL data
            return pd.read_sql(data, self.connection)
        else:
            raise ValueError(f"Unknown data type: {data_type}")
    
    def preprocess_features(self, feature_type: str, data: pd.DataFrame) -> pd.DataFrame:
        if feature_type == 'numeric':
            # Scale numeric features
            return self._scale_numeric(data)
        elif feature_type == 'categorical':
            # Encode categorical features
            return self._encode_categorical(data)
        elif feature_type == 'text':
            # Process text features
            return self._process_text(data)
        else:
            raise ValueError(f"Unknown feature type: {feature_type}")

# Problems with this approach:
# 1. Lots of conditional logic
# 2. Need to modify code to add new types
# 3. Hard to maintain and test
# 4. Code duplication likely

Better Approach Using Polymorphism

from abc import ABC, abstractmethod
from typing import Dict, List, Optional, Any
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

class DataLoader(ABC):
    """Abstract base class for data loading."""
    @abstractmethod
    def load_data(self) -> pd.DataFrame:
        """Load data from source."""
        pass
    
    @abstractmethod
    def validate_schema(self, data: pd.DataFrame) -> bool:
        """Validate data schema."""
        pass

class CSVLoader(DataLoader):
    """Handles loading from CSV files."""
    def __init__(self, filepath: str, expected_columns: List[str]):
        self.filepath = filepath
        self.expected_columns = expected_columns
    
    def load_data(self) -> pd.DataFrame:
        """Load data from CSV file."""
        data = pd.read_csv(self.filepath)
        if not self.validate_schema(data):
            raise ValueError("Invalid CSV schema")
        return data
    
    def validate_schema(self, data: pd.DataFrame) -> bool:
        """Check if all expected columns are present."""
        return all(col in data.columns for col in self.expected_columns)

class JSONLoader(DataLoader):
    """Handles loading from JSON files."""
    def __init__(self, filepath: str, expected_keys: List[str]):
        self.filepath = filepath
        self.expected_keys = expected_keys
    
    def load_data(self) -> pd.DataFrame:
        """Load data from JSON file."""
        data = pd.read_json(self.filepath)
        if not self.validate_schema(data):
            raise ValueError("Invalid JSON schema")
        return data
    
    def validate_schema(self, data: pd.DataFrame) -> bool:
        """Check if all expected keys are present."""
        return all(key in data.columns for key in self.expected_keys)

class SQLLoader(DataLoader):
    """Handles loading from SQL database."""
    def __init__(self, connection, query: str, expected_columns: List[str]):
        self.connection = connection
        self.query = query
        self.expected_columns = expected_columns
    
    def load_data(self) -> pd.DataFrame:
        """Load data from SQL database."""
        data = pd.read_sql(self.query, self.connection)
        if not self.validate_schema(data):
            raise ValueError("Invalid SQL result schema")
        return data
    
    def validate_schema(self, data: pd.DataFrame) -> bool:
        """Check if all expected columns are present."""
        return all(col in data.columns for col in self.expected_columns)

class FeatureProcessor(ABC):
    """Abstract base class for feature processing."""
    @abstractmethod
    def fit(self, data: pd.DataFrame) -> 'FeatureProcessor':
        """Fit processor to data."""
        pass
    
    @abstractmethod
    def transform(self, data: pd.DataFrame) -> pd.DataFrame:
        """Transform the data."""
        pass
    
    def fit_transform(self, data: pd.DataFrame) -> pd.DataFrame:
        """Fit and transform data."""
        return self.fit(data).transform(data)

class NumericFeatureProcessor(FeatureProcessor):
    """Handles numeric feature processing."""
    def __init__(self, columns: List[str]):
        self.columns = columns
        self.scaler = StandardScaler()
    
    def fit(self, data: pd.DataFrame) -> 'NumericFeatureProcessor':
        """Fit scaler to numeric data."""
        self.scaler.fit(data[self.columns])
        return self
    
    def transform(self, data: pd.DataFrame) -> pd.DataFrame:
        """Scale numeric features."""
        data = data.copy()
        data[self.columns] = self.scaler.transform(data[self.columns])
        return data

class CategoricalFeatureProcessor(FeatureProcessor):
    """Handles categorical feature processing."""
    def __init__(self, columns: List[str]):
        self.columns = columns
        self.encoders: Dict[str, LabelEncoder] = {}
    
    def fit(self, data: pd.DataFrame) -> 'CategoricalFeatureProcessor':
        """Fit encoders to categorical data."""
        for column in self.columns:
            self.encoders[column] = LabelEncoder()
            self.encoders[column].fit(data[column])
        return self
    
    def transform(self, data: pd.DataFrame) -> pd.DataFrame:
        """Encode categorical features."""
        data = data.copy()
        for column in self.columns:
            data[column] = self.encoders[column].transform(data[column])
        return data

class TextFeatureProcessor(FeatureProcessor):
    """Handles text feature processing."""
    def __init__(self, columns: List[str]):
        self.columns = columns
        self.vectorizers: Dict[str, TfidfVectorizer] = {
            col: TfidfVectorizer() for col in columns
        }
    
    def fit(self, data: pd.DataFrame) -> 'TextFeatureProcessor':
        """Fit vectorizers to text data."""
        for column in self.columns:
            self.vectorizers[column].fit(data[column].fillna(''))
        return self
    
    def transform(self, data: pd.DataFrame) -> pd.DataFrame:
        """Vectorize text features."""
        data = data.copy()
        for column in self.columns:
            # Convert sparse matrix to dense and create new columns
            vectorized = self.vectorizers[column].transform(data[column].fillna(''))
            feature_names = self.vectorizers[column].get_feature_names_out()
            
            # Create new column names
            vector_cols = [f"{column}_{feat}" for feat in feature_names]
            
            # Add vectorized features to dataframe
            vector_df = pd.DataFrame(
                vectorized.toarray(),
                columns=vector_cols,
                index=data.index
            )
            
            # Drop original column and add vectorized features
            data = data.drop(column, axis=1)
            data = pd.concat([data, vector_df], axis=1)
        
        return data

class DataPipeline:
    """Manages the complete data processing pipeline."""
    def __init__(self, loader: DataLoader):
        self.loader = loader
        self.processors: List[FeatureProcessor] = []
    
    def add_processor(self, processor: FeatureProcessor) -> None:
        """Add a feature processor to the pipeline."""
        self.processors.append(processor)
    
    def process(self) -> pd.DataFrame:
        """Run the complete pipeline."""
        # Load data
        data = self.loader.load_data()
        
        # Apply all processors in sequence
        for processor in self.processors:
            data = processor.fit_transform(data)
        
        return data

# Usage example showing polymorphism in action
def process_customer_data(
    data_source: str,
    numeric_cols: List[str],
    categorical_cols: List[str],
    text_cols: List[str]
) -> pd.DataFrame:
    """Process customer data from various sources."""
    # Choose appropriate loader based on data source
    if data_source.endswith('.csv'):
        loader = CSVLoader(
            data_source,
            expected_columns=numeric_cols + categorical_cols + text_cols
        )
    elif data_source.endswith('.json'):
        loader = JSONLoader(
            data_source,
            expected_keys=numeric_cols + categorical_cols + text_cols
        )
    else:
        raise ValueError(f"Unsupported data source: {data_source}")
    
    # Create pipeline
    pipeline = DataPipeline(loader)
    
    # Add appropriate processors for each feature type
    if numeric_cols:
        pipeline.add_processor(NumericFeatureProcessor(numeric_cols))
    
    if categorical_cols:
        pipeline.add_processor(CategoricalFeatureProcessor(categorical_cols))
    
    if text_cols:
        pipeline.add_processor(TextFeatureProcessor(text_cols))
    
    # Process the data
    return pipeline.process()

# Example usage
customer_data = process_customer_data(
    'customer_data.csv',
    numeric_cols=['age', 'income', 'tenure'],
    categorical_cols=['gender', 'location', 'segment'],
    text_cols=['comments', 'feedback']
)

Remember: The key to effective polymorphism is designing clean, consistent interfaces that allow different implementations to be used interchangeably. This is especially important in data science pipelines where requirements and data sources often change.

SOLID Principles in Data Science

Single Responsibility Principle (SRP)

Think of SRP like different roles in a data science team. Just as you wouldn’t want one person to handle data cleaning, model training, deployment, AND business presentations, you shouldn’t have one class doing all these things.

Real-World Scenario

Imagine you’re building a customer churn prediction system. At first, it seems convenient to create one class that does everything:

Loads customer data from various sources (CSV, databases, APIs)
Handles missing values and outliers
Engineers features
Trains and validates models
Generates PDF reports for stakeholders
Saves models and results to different formats

The problems start emerging when:

Your data engineer wants to modify how data is loaded from the database
The business team requests changes to the PDF report format
You need to add new feature engineering steps
A team member wants to experiment with different model architectures

Each change requires modifying the same class, leading to:

Merge conflicts when multiple team members work simultaneously
Higher risk of breaking existing functionality
Difficulty in testing individual components
Code that’s hard to understand and maintain

Example Implementation

# Bad Example - Violating SRP
class ChurnPredictor:
    def __init__(self, db_connection, model_path=None):
        self.db = db_connection
        self.data = None
        self.model = self.load_model(model_path) if model_path else None
        self.predictions = None
        self.report_path = None
    
    def load_data(self):
        """Loads data from multiple sources."""
        sql_data = self.db.query("SELECT * FROM customers")
        csv_data = pd.read_csv("additional_features.csv")
        self.data = pd.merge(sql_data, csv_data, on="customer_id")
    
    def preprocess_data(self):
        """Handles all preprocessing."""
        # Fill missing values
        self.data = self.data.fillna(self.data.mean())
        
        # Handle outliers
        z_scores = stats.zscore(self.data.select_dtypes(np.number))
        self.data = self.data[(z_scores < 3).all(axis=1)]
        
        # Feature engineering
        self.data['account_age'] = (pd.Timestamp.now() - 
                                  pd.to_datetime(self.data['signup_date'])).dt.days
        self.data['total_spend'] = self.data['monthly_spend'] * self.data['tenure']
        
    def train_model(self):
        """Handles model training and validation."""
        X = self.data.drop(['churned', 'customer_id'], axis=1)
        y = self.data['churned']
        
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        
        self.model = RandomForestClassifier()
        self.model.fit(X_train, y_train)
        
        # Calculate and store metrics
        self.metrics = {
            'accuracy': accuracy_score(y_test, self.model.predict(X_test)),
            'precision': precision_score(y_test, self.model.predict(X_test)),
            'recall': recall_score(y_test, self.model.predict(X_test))
        }
    
    def generate_report(self):
        """Creates PDF report."""
        plt.figure(figsize=(10, 6))
        feature_imp = pd.DataFrame({
            'feature': self.data.columns,
            'importance': self.model.feature_importances_
        }).sort_values('importance', ascending=False)
        
        plt.bar(feature_imp['feature'][:10], feature_imp['importance'][:10])
        plt.xticks(rotation=45)
        plt.title("Top 10 Important Features")
        plt.tight_layout()
        
        # Save to PDF
        plt.savefig("feature_importance.pdf")
    
    def save_results(self):
        """Saves everything to different formats."""
        # Save model
        joblib.dump(self.model, 'model.joblib')
        
        # Save predictions to CSV
        pd.DataFrame({
            'customer_id': self.data['customer_id'],
            'churn_probability': self.predictions
        }).to_csv('predictions.csv', index=False)
        
        # Save metrics to JSON
        with open('metrics.json', 'w') as f:
            json.dump(self.metrics, f)

# Better Example - Following SRP
class DataLoader:
    """Responsible only for loading and combining data."""
    def __init__(self, db_connection):
        self.db = db_connection
    
    def load_customer_data(self) -> pd.DataFrame:
        """Loads and combines data from all sources."""
        sql_data = self.db.query("SELECT * FROM customers")
        csv_data = pd.read_csv("additional_features.csv")
        return pd.merge(sql_data, csv_data, on="customer_id")

class DataPreprocessor:
    """Handles all data preprocessing steps."""
    def clean_data(self, data: pd.DataFrame) -> pd.DataFrame:
        """Handles missing values and outliers."""
        # Create a copy to avoid modifying input
        cleaned = data.copy()
        
        # Fill missing values
        numeric_cols = cleaned.select_dtypes(np.number).columns
        cleaned[numeric_cols] = cleaned[numeric_cols].fillna(cleaned[numeric_cols].mean())
        
        # Remove outliers
        z_scores = stats.zscore(cleaned[numeric_cols])
        cleaned = cleaned[(z_scores < 3).all(axis=1)]
        
        return cleaned
    
    def engineer_features(self, data: pd.DataFrame) -> pd.DataFrame:
        """Creates new features."""
        # Create a copy to avoid modifying input
        featured = data.copy()
        
        # Add new features
        featured['account_age'] = (pd.Timestamp.now() - 
                                 pd.to_datetime(featured['signup_date'])).dt.days
        featured['total_spend'] = featured['monthly_spend'] * featured['tenure']
        
        return featured

class ModelTrainer:
    """Handles model training and evaluation."""
    def __init__(self, model_type: str = 'random_forest'):
        self.model_type = model_type
        self.model = None
        self.metrics = {}
    
    def train(self, X: pd.DataFrame, y: pd.Series) -> None:
        """Trains the model."""
        X_train, X_test, y_train, y_test = train_test_split(X, y)
        
        if self.model_type == 'random_forest':
            self.model = RandomForestClassifier()
        else:
            raise ValueError(f"Unknown model type: {self.model_type}")
        
        self.model.fit(X_train, y_train)
        self._calculate_metrics(X_test, y_test)
    
    def _calculate_metrics(self, X_test: pd.DataFrame, y_test: pd.Series) -> None:
        """Calculates and stores performance metrics."""
        predictions = self.model.predict(X_test)
        self.metrics = {
            'accuracy': accuracy_score(y_test, predictions),
            'precision': precision_score(y_test, predictions),
            'recall': recall_score(y_test, predictions)
        }
    
    def get_feature_importance(self, feature_names: List[str]) -> pd.DataFrame:
        """Returns feature importance data."""
        if not self.model:
            raise ValueError("Model not trained yet")
        
        return pd.DataFrame({
            'feature': feature_names,
            'importance': self.model.feature_importances_
        }).sort_values('importance', ascending=False)

class ReportGenerator:
    """Handles all reporting functionality."""
    def generate_pdf_report(
        self,
        feature_importance: pd.DataFrame,
        metrics: Dict[str, float],
        output_path: str
    ) -> None:
        """Generates PDF report with visualizations."""
        plt.figure(figsize=(10, 6))
        plt.bar(feature_importance['feature'][:10], 
                feature_importance['importance'][:10])
        plt.xticks(rotation=45)
        plt.title("Top 10 Important Features")
        plt.tight_layout()
        plt.savefig(output_path)
        
        # Could add more visualizations and metrics

class ResultSaver:
    """Handles saving all outputs."""
    def save_model(self, model: BaseEstimator, path: str) -> None:
        """Saves the trained model."""
        joblib.dump(model, path)
    
    def save_predictions(
        self,
        customer_ids: np.ndarray,
        predictions: np.ndarray,
        path: str
    ) -> None:
        """Saves predictions to CSV."""
        pd.DataFrame({
            'customer_id': customer_ids,
            'churn_probability': predictions
        }).to_csv(path, index=False)
    
    def save_metrics(self, metrics: Dict[str, float], path: str) -> None:
        """Saves metrics to JSON."""
        with open(path, 'w') as f:
            json.dump(metrics, f)

# Usage showing how the classes work together
def run_churn_prediction_pipeline(db_connection: DBConnection) -> None:
    """Orchestrates the churn prediction pipeline."""
    # Load data
    loader = DataLoader(db_connection)
    raw_data = loader.load_customer_data()
    
    # Preprocess data
    preprocessor = DataPreprocessor()
    cleaned_data = preprocessor.clean_data(raw_data)
    featured_data = preprocessor.engineer_features(cleaned_data)
    
    # Prepare features and target
    X = featured_data.drop(['churned', 'customer_id'], axis=1)
    y = featured_data['churned']
    
    # Train model
    trainer = ModelTrainer('random_forest')
    trainer.train(X, y)
    
    # Generate reports
    feature_importance = trainer.get_feature_importance(X.columns)
    report_gen = ReportGenerator()
    report_gen.generate_pdf_report(
        feature_importance,
        trainer.metrics,
        'churn_report.pdf'
    )
    
    # Save results
    saver = ResultSaver()
    saver.save_model(trainer.model, 'churn_model.joblib')
    saver.save_metrics(trainer.metrics, 'metrics.json')
    saver.save_predictions(
        featured_data['customer_id'],
        trainer.model.predict_proba(X)[:, 1],
        'predictions.csv'
    )

This refactored version demonstrates several benefits:

Each class has a single, clear responsibility
Changes to one aspect (e.g., reporting) don’t affect others
Easy to test each component independently
Easy to modify or extend individual components
Clear dependencies between components
Code is more organized and maintainable

Open/Closed Principle (OCP)

Think of OCP like building a feature engineering pipeline that you want to extend without modifying existing code. Imagine you’re maintaining a data transformation system for your team - when someone wants to add a new type of transformation, they should be able to do so without changing the code that’s already tested and working in production.

Real-World Scenario

You’re working on a large data science project where the feature engineering pipeline currently handles:

Numeric feature scaling
Missing value imputation
Outlier handling
Categorical encoding

Then new requirements start coming in:

A colleague wants to add polynomial features for certain variables
Another team member needs to add custom domain-specific transformations
The business team requests special handling for time-based features
You need to add feature selection based on correlation

Without OCP, you’d need to:

Modify the existing transformation code each time
Risk breaking the working pipeline
Retest everything after each change
Deal with an increasingly complex codebase

Implementation Example

# Bad Example - Violating OCP
class FeatureTransformer:
    """This violates OCP - need to modify code for each new transformation."""
    def __init__(self):
        self.scaler = StandardScaler()
        self.imputer = SimpleImputer()
    
    def transform_features(self, data: pd.DataFrame, transform_type: str) -> pd.DataFrame:
        if transform_type == 'scale':
            return pd.DataFrame(
                self.scaler.fit_transform(data),
                columns=data.columns
            )
        elif transform_type == 'impute':
            return pd.DataFrame(
                self.imputer.fit_transform(data),
                columns=data.columns
            )
        elif transform_type == 'polynomial':
            # Need to modify this file to add new transformations!
            poly = PolynomialFeatures(degree=2)
            return pd.DataFrame(
                poly.fit_transform(data),
                columns=[f"poly_{i}" for i in range(poly.n_output_features_)]
            )
        else:
            raise ValueError(f"Unknown transformation: {transform_type}")

# Better Example - Following OCP
from abc import ABC, abstractmethod
from typing import List, Dict, Optional

class FeatureTransformation(ABC):
    """Abstract base class for all transformations."""
    @abstractmethod
    def fit(self, X: pd.DataFrame) -> 'FeatureTransformation':
        """Fit the transformation to the data."""
        pass
    
    @abstractmethod
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Apply the transformation to the data."""
        pass
    
    @abstractmethod
    def get_feature_names(self, input_features: List[str]) -> List[str]:
        """Get names of transformed features."""
        pass

class StandardScalerTransformation(FeatureTransformation):
    """Standardizes numeric features."""
    def __init__(self, columns: Optional[List[str]] = None):
        self.columns = columns
        self.scaler = StandardScaler()
        self._fitted_columns = None
    
    def fit(self, X: pd.DataFrame) -> 'StandardScalerTransformation':
        """Fit scaler to data."""
        self._fitted_columns = self.columns or X.select_dtypes(include=[np.number]).columns
        self.scaler.fit(X[self._fitted_columns])
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Apply scaling transformation."""
        X_copy = X.copy()
        X_copy[self._fitted_columns] = self.scaler.transform(X[self._fitted_columns])
        return X_copy
    
    def get_feature_names(self, input_features: List[str]) -> List[str]:
        """Return scaled feature names."""
        return [f"scaled_{col}" for col in self._fitted_columns]

class PolynomialTransformation(FeatureTransformation):
    """Creates polynomial features."""
    def __init__(self, degree: int = 2, columns: Optional[List[str]] = None):
        self.degree = degree
        self.columns = columns
        self.poly = PolynomialFeatures(degree=degree)
        self._fitted_columns = None
        self._feature_names = None
    
    def fit(self, X: pd.DataFrame) -> 'PolynomialTransformation':
        """Fit polynomial features."""
        self._fitted_columns = self.columns or X.select_dtypes(include=[np.number]).columns
        self.poly.fit(X[self._fitted_columns])
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Generate polynomial features."""
        poly_features = self.poly.transform(X[self._fitted_columns])
        feature_names = self.get_feature_names(self._fitted_columns)
        
        # Create new dataframe with original and polynomial features
        X_copy = X.copy()
        poly_df = pd.DataFrame(poly_features, columns=feature_names, index=X.index)
        return pd.concat([X_copy, poly_df], axis=1)
    
    def get_feature_names(self, input_features: List[str]) -> List[str]:
        """Get polynomial feature names."""
        return [f"poly_{i}" for i in range(self.poly.n_output_features_)]

class OutlierTransformation(FeatureTransformation):
    """Handles outliers using IQR method."""
    def __init__(self, columns: Optional[List[str]] = None, threshold: float = 1.5):
        self.columns = columns
        self.threshold = threshold
        self.bounds = {}
    
    def fit(self, X: pd.DataFrame) -> 'OutlierTransformation':
        """Calculate outlier bounds."""
        self._fitted_columns = self.columns or X.select_dtypes(include=[np.number]).columns
        
        for column in self._fitted_columns:
            Q1 = X[column].quantile(0.25)
            Q3 = X[column].quantile(0.75)
            IQR = Q3 - Q1
            self.bounds[column] = {
                'lower': Q1 - self.threshold * IQR,
                'upper': Q3 + self.threshold * IQR
            }
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Cap outliers to bounds."""
        X_copy = X.copy()
        for column in self._fitted_columns:
            bounds = self.bounds[column]
            X_copy[column] = X_copy[column].clip(bounds['lower'], bounds['upper'])
        return X_copy
    
    def get_feature_names(self, input_features: List[str]) -> List[str]:
        """Return feature names."""
        return [f"outlier_handled_{col}" for col in self._fitted_columns]

class FeatureTransformationPipeline:
    """Pipeline that can accommodate any number of transformations."""
    def __init__(self):
        self.transformations: List[FeatureTransformation] = []
        self.feature_names: List[str] = []
    
    def add_transformation(self, transformation: FeatureTransformation) -> None:
        """Add a new transformation to the pipeline."""
        self.transformations.append(transformation)
    
    def fit_transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Apply all transformations in sequence."""
        result = X.copy()
        self.feature_names = list(X.columns)
        
        for transformation in self.transformations:
            transformation.fit(result)
            result = transformation.transform(result)
            self.feature_names.extend(
                transformation.get_feature_names(self.feature_names)
            )
        
        return result

# Usage showing extensibility
def prepare_features(data: pd.DataFrame) -> pd.DataFrame:
    """Prepare features using various transformations."""
    pipeline = FeatureTransformationPipeline()
    
    # Add standard transformations
    pipeline.add_transformation(StandardScalerTransformation())
    pipeline.add_transformation(OutlierTransformation())
    
    # Easily add new transformation without changing existing code
    pipeline.add_transformation(
        PolynomialTransformation(degree=2, columns=['age', 'income'])
    )
    
    return pipeline.fit_transform(data)

# Adding new functionality is just creating a new transformer
class TimeFeatureTransformation(FeatureTransformation):
    """Extracts features from datetime columns."""
    def __init__(self, datetime_columns: List[str]):
        self.datetime_columns = datetime_columns
    
    def fit(self, X: pd.DataFrame) -> 'TimeFeatureTransformation':
        """Nothing to fit."""
        return self
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Extract time-based features."""
        X_copy = X.copy()
        
        for column in self.datetime_columns:
            if not pd.api.types.is_datetime64_any_dtype(X_copy[column]):
                X_copy[column] = pd.to_datetime(X_copy[column])
                
            X_copy[f"{column}_hour"] = X_copy[column].dt.hour
            X_copy[f"{column}_day"] = X_copy[column].dt.day
            X_copy[f"{column}_month"] = X_copy[column].dt.month
            X_copy[f"{column}_day_of_week"] = X_copy[column].dt.dayofweek
        
        return X_copy
    
    def get_feature_names(self, input_features: List[str]) -> List[str]:
        """Get names of time-based features."""
        features = []
        for col in self.datetime_columns:
            features.extend([
                f"{col}_hour",
                f"{col}_day",
                f"{col}_month",
                f"{col}_day_of_week"
            ])
        return features

# Example usage with new transformer
pipeline = FeatureTransformationPipeline()
pipeline.add_transformation(StandardScalerTransformation())
pipeline.add_transformation(OutlierTransformation())
pipeline.add_transformation(TimeFeatureTransformation(['transaction_date']))

# Process features
processed_data = pipeline.fit_transform(raw_data)

Liskov Substitution Principle (LSP)

Imagine you’re building a machine learning system that processes customer data. You have a prediction API that expects all models to behave the same way, regardless of whether they’re simple scikit-learn models or complex deep learning ones.

Real-World Scenario:

Your team built an API endpoint /predict that accepts customer features and returns churn predictions. Initially, it worked with a simple logistic regression model:

response = model.predict_proba(customer_features)
churn_risk = response[:, 1]  # Get probability of churn

Then things got complicated:

The deep learning team created a better model, but it returns probabilities differently
AutoML tools produced models with different prediction methods
Your custom ensemble model needs special preprocessing

Without LSP:

Different model types require different handling in your API code
You need multiple endpoints or complex if/else logic
Code becomes brittle and hard to maintain
Testing becomes complicated
Adding new model types requires modifying existing code

With LSP:

All models follow the same contract regardless of their internal implementation
Your API code remains clean and simple
Models can be swapped without changing the surrounding system
Testing is straightforward
Adding new model types is just creating a new class

Why LSP Matters

Here’s a common violation of LSP in data science:

# Violates LSP - different interfaces for different model types
class SklearnModel:
    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        self.model.fit(X, y)
    
    def predict(self, X: pd.DataFrame) -> np.ndarray:
        return self.model.predict(X)

class KerasModel:
    def train(self, X: pd.DataFrame, y: pd.Series, epochs: int = 10) -> None:
        # Different method name and signature
        self.model.fit(X.values, y.values, epochs=epochs)
    
    def infer(self, X: pd.DataFrame) -> np.ndarray:
        # Different method name
        return self.model.predict(X.values)

class OnlineModel:
    def update(self, X: pd.DataFrame, y: pd.Series) -> None:
        # Completely different interface
        for x, y_true in zip(X.values, y.values):
            self.model.partial_fit(x.reshape(1, -1), [y_true])
    
    def predict_one(self, x: np.ndarray) -> float:
        # Incompatible interface
        return self.model.predict(x.reshape(1, -1))[0]

# This leads to complex, conditional code
def train_model(model, X, y):
    if isinstance(model, SklearnModel):
        model.fit(X, y)
    elif isinstance(model, KerasModel):
        model.train(X, y)
    elif isinstance(model, OnlineModel):
        model.update(X, y)
    else:
        raise ValueError("Unknown model type")

Better Approach - Following LSP

# Bad Example - Violates LSP
class MLModel:
    def train(self, X, y):
        pass
    
    def predict(self, X):
        pass

class SklearnModel(MLModel):
    def __init__(self, model):
        self.model = model
    
    def train(self, X: pd.DataFrame, y: pd.Series):
        self.model.fit(X, y)
    
    def predict(self, X: pd.DataFrame):
        return self.model.predict(X)

class DeepLearningModel(MLModel):
    def __init__(self, model):
        self.model = model
    
    def train(self, X: pd.DataFrame, y: pd.Series):
        # Violates LSP: Changes input assumptions
        X = torch.tensor(X.values).float()
        y = torch.tensor(y.values).float()
        self.model.fit(X, y)  # Expects tensors, not pandas objects
    
    def predict(self, X: pd.DataFrame):
        # Violates LSP: Changes output format
        X = torch.tensor(X.values).float()
        predictions = self.model.predict(X)
        return predictions.numpy()  # Returns numpy array instead of pandas
      
from abc import ABC, abstractmethod
import pandas as pd
import numpy as np
import torch
from sklearn.ensemble import RandomForestClassifier
from typing import Union, Tuple

class MLModel(ABC):
    """Base class defining the contract for all ML models"""
    @abstractmethod
    def train(self, X: pd.DataFrame, y: pd.Series) -> None:
        """Train the model on pandas DataFrame/Series"""
        pass
    
    @abstractmethod
    def predict(self, X: pd.DataFrame) -> pd.Series:
        """Make predictions, always return pandas Series"""
        pass
    
    @abstractmethod
    def predict_proba(self, X: pd.DataFrame) -> pd.DataFrame:
        """Get probabilities, always return pandas DataFrame"""
        pass

class SklearnModelWrapper(MLModel):
    """Wrapper for scikit-learn models"""
    def __init__(self, model: RandomForestClassifier):
        self.model = model
        self._feature_names = None
    
    def train(self, X: pd.DataFrame, y: pd.Series) -> None:
        """Maintains pandas interface"""
        self._feature_names = X.columns
        self.model.fit(X, y)
    
    def predict(self, X: pd.DataFrame) -> pd.Series:
        """Always returns pandas Series with index matching input"""
        predictions = self.model.predict(X)
        return pd.Series(predictions, index=X.index)
    
    def predict_proba(self, X: pd.DataFrame) -> pd.DataFrame:
        """Always returns pandas DataFrame with probabilities"""
        probs = self.model.predict_proba(X)
        return pd.DataFrame(
            probs, 
            index=X.index,
            columns=self.model.classes_
        )

class TorchModelWrapper(MLModel):
    """Wrapper for PyTorch models"""
    def __init__(self, model: torch.nn.Module):
        self.model = model
        self._feature_names = None
        self.classes_ = None
    
    def _convert_to_tensor(self, X: pd.DataFrame) -> torch.Tensor:
        """Handle data conversion internally"""
        return torch.tensor(X.values).float()
    
    def train(self, X: pd.DataFrame, y: pd.Series) -> None:
        """Maintains same interface as other models"""
        self._feature_names = X.columns
        self.classes_ = sorted(y.unique())
        
        # Handle conversion internally
        X_tensor = self._convert_to_tensor(X)
        y_tensor = torch.tensor(y.values).float()
        
        # Training logic here...
        self.model.fit(X_tensor, y_tensor)
    
    def predict(self, X: pd.DataFrame) -> pd.Series:
        """Returns pandas Series like other models"""
        X_tensor = self._convert_to_tensor(X)
        with torch.no_grad():
            predictions = self.model(X_tensor)
            pred_labels = predictions.argmax(dim=1).numpy()
        
        return pd.Series(
            [self.classes_[i] for i in pred_labels],
            index=X.index
        )
    
    def predict_proba(self, X: pd.DataFrame) -> pd.DataFrame:
        """Returns probabilities in same format as sklearn"""
        X_tensor = self._convert_to_tensor(X)
        with torch.no_grad():
            probs = torch.softmax(self.model(X_tensor), dim=1).numpy()
        
        return pd.DataFrame(
            probs,
            index=X.index,
            columns=self.classes_
        )

# The magic of LSP: All models work the same way in your pipeline
def evaluate_model(model: MLModel, X_test: pd.DataFrame, y_test: pd.Series) -> dict:
    """Works with ANY model that follows the contract"""
    predictions = model.predict(X_test)
    probabilities = model.predict_proba(X_test)
    
    return {
        'accuracy': accuracy_score(y_test, predictions),
        'roc_auc': roc_auc_score(y_test, probabilities)
    }

# Usage showing true substitutability
models = {
    'sklearn': SklearnModelWrapper(RandomForestClassifier()),
    'pytorch': TorchModelWrapper(torch.nn.Sequential(...)),
    'custom': CustomModelWrapper(MyCustomModel())
}

results = {}
for name, model in models.items():
    # Train with same interface
    model.train(X_train, y_train)
    
    # Evaluate with same interface
    results[name] = evaluate_model(model, X_test, y_test)

# All models can be used interchangeably in production
selected_model = models['sklearn']  # Could be any model type
prediction_service = PredictionService(selected_model)  # Works with any model

Remember: The key to following LSP in data science is ensuring that all specialized types (different model implementations) can be used anywhere the base type (generic model interface) is expected, without breaking the application’s behavior.

Interface Segregation Principle (ISP)

Think of ISP like a modular data science toolkit. Not every project needs every tool. Some projects might need data cleaning but not visualization, others might need model training but not deployment capabilities. ISP suggests creating smaller, focused interfaces rather than one giant interface that does everything.

Real-World Scenario

You’re creating a data processing library for your organization. Different teams have different needs:

The research team only needs data loading and transformation
The production team needs data validation and database connectivity
The visualization team needs plotting and reporting capabilities

Without ISP, every team would need to implement all functionality, even parts they don’t use. With ISP, teams only implement what they need – like choosing specific packages from PyPI instead of installing the entire scientific Python stack.

The ISP states that clients should not be forced to depend on interfaces they don’t use. In data science, this often applies to data processing and model interfaces.

Common Violation:

# Violates ISP - forces classes to implement methods they don't need
class Database:
    def store_data(self, data: pd.DataFrame) -> None:
        pass
    
    def connect(self, uri: str) -> None:
        pass
    
    def disconnect(self) -> None:
        pass
    
    def validate_schema(self, data: pd.DataFrame) -> bool:
        pass

# Classes implementing Database are forced to implement
# all methods, even if they don't need connection handling
class InMemoryDatabase(Database):
    def store_data(self, data: pd.DataFrame) -> None:
        # Can store data
        self.data = data
    
    def connect(self, uri: str) -> None:
        # Doesn't need connection but forced to implement
        pass
    
    def disconnect(self) -> None:
        # Doesn't need disconnection but forced to implement
        pass
    
    def validate_schema(self, data: pd.DataFrame) -> bool:
        return True  # Might need this

Better Approach - Following ISP:

# Split interfaces based on functionality
class DataStorage:
    """Interface for data storage operations."""
    def store_data(self, data: pd.DataFrame) -> None:
        """Stores data."""
        pass

class DatabaseConnection:
    """Interface for database connection operations."""
    def connect(self, uri: str) -> None:
        """Establishes connection."""
        pass
    
    def disconnect(self) -> None:
        """Closes connection."""
        pass

class SchemaValidator:
    """Interface for schema validation."""
    def validate_schema(self, data: pd.DataFrame) -> bool:
        """Validates data schema."""
        pass

# Now classes can implement only what they need
class InMemoryStorage(DataStorage):
    """Simple in-memory storage."""
    def __init__(self):
        self.data = None
    
    def store_data(self, data: pd.DataFrame) -> None:
        self.data = data.copy()

class SQLDatabase(DataStorage, DatabaseConnection, SchemaValidator):
    """Full database implementation needing all functionality."""
    def __init__(self):
        self.connection = None
        self.schema = None
    
    def connect(self, uri: str) -> None:
        self.connection = create_engine(uri)
    
    def disconnect(self) -> None:
        if self.connection:
            self.connection.dispose()
    
    def store_data(self, data: pd.DataFrame) -> None:
        if self.validate_schema(data):
            data.to_sql('table_name', self.connection)
    
    def validate_schema(self, data: pd.DataFrame) -> bool:
        return all(col in data.columns for col in self.schema)

# Usage showing flexibility of segregated interfaces
def store_training_data(
    storage: DataStorage,
    data: pd.DataFrame
) -> None:
    """Only needs data storage functionality."""
    storage.store_data(data)

def process_database_data(
    db: Union[DatabaseConnection, SchemaValidator, DataStorage],
    uri: str,
    data: pd.DataFrame
) -> None:
    """Needs full database functionality."""
    db.connect(uri)
    if db.validate_schema(data):
        db.store_data(data)
    db.disconnect()

# Can use either implementation as needed
in_memory = InMemoryStorage()
sql_db = SQLDatabase()

store_training_data(in_memory)  # Works with simple storage
process_database_data(sql_db, "postgresql://...", data)  # Works with full database

Dependency Inversion Principle (DIP)

Think of DIP like scikit-learn’s Pipeline class. It doesn’t care about the specific preprocessors or models you use – it works with any estimator that follows the right interface. High-level components (like Pipeline) depend on abstractions, not concrete implementations.

Real-World Scenario

You’re building an automated machine learning system that needs to:

Try different preprocessors (StandardScaler, RobustScaler, etc.)
Test various models (RandomForest, XGBoost, etc.)
Use different validation strategies (cross-validation, holdout, etc.)

Without DIP, your system would be tightly coupled to specific implementations. With DIP, it works with any components that follow the right interfaces – just like how scikit-learn’s GridSearchCV works with any estimator. This makes it easy to:

Add new preprocessing methods
Try new models
Implement custom validation strategies

The DIP states that high-level modules should not depend on low-level modules; both should depend on abstractions. Let’s see how this applies in data science pipelines.

Common Violation:

# Violates DIP - high-level pipeline depends on concrete implementations
class ModelPipeline:
    def __init__(self):
        # Direct dependencies on concrete classes
        self.preprocessor = StandardScaler()
        self.model = RandomForestClassifier()
        self.validator = CrossValidator()
    
    def run_pipeline(self, data: pd.DataFrame) -> Dict[str, float]:
        # Tightly coupled to specific implementations
        scaled_data = self.preprocessor.fit_transform(data)
        self.model.fit(scaled_data)
        return self.validator.validate(self.model, scaled_data)

Better Approach - Following DIP:

# Define abstractions
class Preprocessor(Protocol):
    def fit_transform(self, data: pd.DataFrame) -> pd.DataFrame:
        ...

class Model(Protocol):
    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        ...
    
    def predict(self, X: pd.DataFrame) -> np.ndarray:
        ...

class Validator(Protocol):
    def validate(
        self,
        model: Model,
        data: pd.DataFrame
    ) -> Dict[str, float]:
        ...

# Concrete implementations depend on abstractions
class StandardScalerPreprocessor:
    """Concrete preprocessor implementation."""
    def __init__(self):
        self.scaler = StandardScaler()
    
    def fit_transform(self, data: pd.DataFrame) -> pd.DataFrame:
        return pd.DataFrame(
            self.scaler.fit_transform(data),
            columns=data.columns
        )

class RandomForestModel:
    """Concrete model implementation."""
    def __init__(self):
        self.model = RandomForestClassifier()
    
    def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
        self.model.fit(X, y)
    
    def predict(self, X: pd.DataFrame) -> np.ndarray:
        return self.model.predict(X)

class CrossValidationValidator:
    """Concrete validator implementation."""
    def validate(
        self,
        model: Model,
        data: pd.DataFrame
    ) -> Dict[str, float]:
        cv_scores = cross_val_score(model, data)
        return {
            'mean_cv_score': cv_scores.mean(),
            'std_cv_score': cv_scores.std()
        }

# High-level module depends on abstractions
class ModelPipeline:
    """Pipeline depending on abstractions, not concrete implementations."""
    def __init__(
        self,
        preprocessor: Preprocessor,
        model: Model,
        validator: Validator
    ):
        self.preprocessor = preprocessor
        self.model = model
        self.validator = validator
    
    def run_pipeline(self, data: pd.DataFrame) -> Dict[str, float]:
        """Runs the pipeline using abstractions."""
        processed_data = self.preprocessor.fit_transform(data)
        self.model.fit(processed_data)
        return self.validator.validate(self.model, processed_data)

# Usage showing flexibility and loose coupling
# Can easily swap implementations without changing pipeline
standard_pipeline = ModelPipeline(
    preprocessor=StandardScalerPreprocessor(),
    model=RandomForestModel(),
    validator=CrossValidationValidator()
)

# Could easily create alternative pipeline with different implementations
robust_pipeline = ModelPipeline(
    preprocessor=RobustScalerPreprocessor(),  # Different preprocessor
    model=XGBoostModel(),                     # Different model
    validator=BootstrapValidator()            # Different validator
)

Clean Code Checklist for Data Scientists

Using the Clean Code Checklist: An Iterative Approach

This checklist is not meant to be a strict set of requirements that must all be met before committing code. Instead, it serves as a guide for progressive improvement after you’ve got your code working.

When to Use This Checklist

✅ After your initial code is functioning correctly
✅ During code review sessions
✅ When revisiting older notebooks or scripts
✅ Before sharing code with teammates

How to Use It

Don’t Try to Perfect Everything at Once
- Pick 1-2 items to focus on each time you revisit your code
- Start with the most impactful improvements for your specific situation
Progressive Enhancement
- Each time you interact with your code, make it a bit cleaner
- Focus on areas you’re actively modifying
- Gradually improve naming, documentation, and structure

Practical Approach

# Initial working version
def p(d):
    return d.fillna(0)
   
# First improvement: Better naming
def process_missing_values(data):
    return data.fillna(0)
   
# Later improvement: Add type hints and documentation
def process_missing_values(data: pd.DataFrame) -> pd.DataFrame:
    """Fill missing values with zeros in the dataset."""
    return data.fillna(0)

Remember: The goal is continuous improvement, not perfection. Use this checklist as a reference for making incremental enhancements to your code over time.

1. Naming 🏷️

Variables and DataFrames:

Uses descriptive nouns (e.g., customer_data instead of df)
Boolean variables start with is_, has_, or similar (e.g., is_outlier)
DataFrame names indicate their content (e.g., raw_sales_data, cleaned_features)
Follows Python naming convention (snake_case)
Avoids abbreviations (e.g., customer_count instead of cust_cnt)

Functions:

Uses verbs that describe the action (e.g., calculate_mean_return instead of mean_ret)
Name reflects the level of abstraction (e.g., train_model vs fit_random_forest)
Clearly indicates any data modifications (e.g., normalize_features vs process_features)

Classes:

Uses nouns describing the entity (e.g., DataCleaner, ModelEvaluator)
Names reflect single responsibility (e.g., OutlierDetector instead of DataHandler)

2. Function Design 🔧

Structure:

Each function does one thing
Function length is reasonable (typically < 50 lines)
Returns clear, consistent data types
Uses type hints for parameters and return values
Includes docstrings with examples for complex functions

Parameters:

Limits number of parameters (ideally ≤ 3)
Uses dataclasses or configuration objects for multiple parameters
Provides default values where appropriate
Validates input parameters

3. Code Organization 📁

Notebook Structure:

Separates imports, configurations, and main code
Groups related cells together
Includes markdown documentation between logical sections
Moves reusable functions to separate modules

Script Structure:

Uses clear section separation
Follows a logical flow (e.g., data loading → preprocessing → modeling)
Places utility functions in separate modules
Uses if __name__ == '__main__' for script execution

4. Error Handling and Data Validation ⚠️

Validates input data early
Uses appropriate error types
Includes informative error messages
Handles missing values explicitly
Checks for data leakage in preprocessing steps

5. Comments and Documentation 📝

Includes docstrings for functions and classes
Documents complex algorithms or business logic
Explains the ‘why’ not the ‘what’
Removes commented-out code
Uses TODO comments sparingly and meaningfully

6. Code Style and Formatting 🎨

Follows PEP 8 guidelines
Uses consistent indentation
Keeps lines at reasonable length (≤ 88 characters)
Uses blank lines to separate logical sections
Aligns related code elements

7. Data Science Specific 🔬

Feature Engineering:

Uses descriptive feature names
Documents feature transformations
Maintains feature creation reproducibility
Tracks feature dependencies

Model Development:

Sets random seeds for reproducibility
Separates model training from evaluation
Documents model parameters and reasoning
Implements cross-validation appropriately

Pipeline Design:

Creates modular transformation steps
Handles categorical and numerical features separately
Prevents data leakage
Makes pipelines serializable

8. Testing and Validation 🧪

Includes basic unit tests for critical functions
Validates data preprocessing steps
Checks model performance metrics
Tests edge cases in data transformations

9. Version Control 🔄

Uses clear, descriptive commit messages
Separates model iterations in version control
Tracks dependencies (e.g., requirements.txt or environment.yml)
Documents environment setup

10. Performance and Efficiency ⚡

Avoids unnecessary data copies
Uses appropriate data types
Implements efficient data transformations
Considers memory usage for large datasets

Use this checklist before committing code or when reviewing existing code. Not every item will apply to every situation, but it provides a framework for writing cleaner, more maintainable data science code.

Conclusion: The Journey to Cleaner Code

Writing clean code in data science is a journey, not a destination. Much like how we iteratively improve our models, we should approach code quality as a continuous improvement process. Let’s reflect on why this matters and how to move forward.

The Reality of Data Science Code

Most data scientists don’t write clean code from the start - and that’s perfectly fine. When exploring data or prototyping models, our first priority is often to get something working. We might start with a messy Jupyter notebook full of quick experiments and abbreviated variable names. This is a natural part of the data science workflow.

# A typical exploratory analysis might start like this
df = pd.read_csv('data.csv')
x = df.drop('target', 1)
y = df['target']
rf = RandomForestClassifier()
rf.fit(x, y)
print(rf.score(x, y))

The problems arise when this exploratory code makes its way into production systems or when we need to revisit our analysis months later. That’s where clean code principles become invaluable.

The Benefits of Clean Code in Data Science

Clean code isn’t just about aesthetics - it delivers tangible benefits:

Reproducibility: Well-organized code makes it easier to reproduce results, a fundamental requirement in data science.
Collaboration: Clean code enables team members to understand and contribute to each other’s work effectively.
Maintenance: When you need to update models or modify preprocessing steps, clean code makes these changes safer and easier.
Debugging: When issues arise, clean code makes it easier to isolate and fix problems.

The Path Forward

Remember these key points as you develop your clean coding practices:

Start Simple: You don’t need to implement every clean code principle at once. Begin with basic improvements like better naming and function organization.
Iterate: Just as you iterate on your models, iterate on your code quality. Each time you revisit a script or notebook, try to make it a little cleaner.
Review: Use the checklist provided in this guide to review your code periodically. Make it part of your workflow, just like model validation.
Learn from Others: Study well-maintained open-source data science projects. Pay attention to how they structure their code and handle common challenges.

Final Thoughts

Clean code in data science is about finding the right balance. We need to maintain the flexibility and experimentation that makes data science exciting while building maintainable, professional-grade software. The principles and practices we’ve covered in this guide aren’t rigid rules but rather tools to help you find that balance.

Remember: The goal isn’t perfection, but progress. Each step toward cleaner code is a step toward more robust, reliable, and reproducible data science.

As you apply these principles in your work, you’ll likely find that writing clean code becomes second nature. The extra time invested in writing clear, well-organized code pays dividends in the long run through easier maintenance, better collaboration, and fewer bugs.

Keep the provided checklist handy, but don’t let it paralyze you. Use it as a guide to gradually improve your code quality, one commit at a time. After all, the best code is not just technically correct - it’s code that tells a clear story about your data science journey.

How to Write Clean Code: A Data Scientists Guide

How to Write Clean Code: A Data Scientist’s Guide

Introduction

The Art of Naming: Making Code Self-Explanatory

The Core Purpose of Naming

Naming Guidelines for Data Scientists

Variables and Properties

Functions and Methods

Classes

Avoid Generic Names

Be Consistent

Practical Tips for Data Science Projects

Comments and Formatting: WHEN Less is More

Comments

When Comments Are Useful

When to Avoid Comments

Key Principles for Comments

Code Formatting: Making Your Data Science Code More Readable

Vertical Formatting: The Art of Spacing

Bad Example - No Vertical Spacing

Horizontal Formatting: Managing Line Length and Readability

Writing Clean Functions: The Building Blocks of Data Science Code

Function Size and Responsibility

Pure Functions

Levels of Abstraction

High-Level vs Low-Level Operations

When to Split Functions

Common Pitfalls to Avoid

Minimizing Function Parameters

No Parameters

One Parameter

Two Parameters

More than Two Parameters

Solutions for Many Parameters

Using Objects/Maps Instead of Multiple Parameters

The Problem

The Solution

Configuration Objects: Type-Safe Parameter Management

Real-World Example: Training Configuration

Builder Pattern: Complex Object Construction Made Clear

Real-World Example: Experiment Configuration Builder

Best Practices Summary

Control Structures: Taming Complexity in Data Pipelines

1. Prefer Positive Checks

2. Avoid Deep Nesting

a. Use Guards and Fail Fast

b. Extract Control Structures into Functions

c. Use Factory Functions & Polymorphism

d. Replace If Checks with Error Handling

Best Practices

Classes and Objects in Data Science: Organizing Complex Pipelines

Understanding Objects vs Data Containers

Data Containers

Objects with Behavior

When to Use Each Type

Key Rules for Clean Classes

Using Polymorphism in Data Pipelines

Real-World Scenario

Better Approach Using Polymorphism

SOLID Principles in Data Science

Single Responsibility Principle (SRP)

Real-World Scenario

Example Implementation

Open/Closed Principle (OCP)

Real-World Scenario

Implementation Example

Liskov Substitution Principle (LSP)

Real-World Scenario:

Why LSP Matters

Better Approach - Following LSP

Interface Segregation Principle (ISP)

Real-World Scenario

Common Violation:

Better Approach - Following ISP:

Dependency Inversion Principle (DIP)

Real-World Scenario

Common Violation:

Better Approach - Following DIP:

Clean Code Checklist for Data Scientists

Using the Clean Code Checklist: An Iterative Approach