How to Write Clean Code: A Data Scientist’s Guide
If you don’t feel like listening right, you can listen to audio summary generated by NotebookLM.
Introduction
Picture this: You’re opening a Jupyter notebook you created three months ago. As you scroll through cells filled with data preprocessing, model training, and evaluation code, you realize you can barely understand what you wrote. Sound familiar?
This scenario plays out daily in data science teams worldwide. While we focus intensely on model accuracy, feature engineering, and hyperparameter tuning, we often overlook a crucial aspect of our work: writing clean, maintainable code.
Let’s look at a typical example of code you might find in a data science project:
def p(d, ft):
# preprocess data
d = d.fillna(0)
# extract features
f = pd.get_dummies(d[ft])
# train model
X = f.values
y = d['target'].values
m = RandomForestClassifier()
m.fit(X, y)
return m
While this code works, it’s challenging to understand its purpose, maintain it, or collaborate on it. Here’s the same functionality written with clean code principles:
def train_classifier(data: pd.DataFrame, feature_columns: List[str]) -> RandomForestClassifier:
"""
Trains a Random Forest classifier on the given data.
Args:
data: Input DataFrame containing features and target
feature_columns: List of column names to use as features
Returns:
Trained RandomForestClassifier model
"""
preprocessed_data = preprocess_features(data)
feature_matrix = engineer_features(preprocessed_data, feature_columns)
model = train_model(feature_matrix, preprocessed_data['target'])
return model
def preprocess_features(data: pd.DataFrame) -> pd.DataFrame:
"""Handles missing values in the dataset."""
return data.fillna(0)
def engineer_features(data: pd.DataFrame, feature_columns: List[str]) -> np.ndarray:
"""Converts categorical variables into dummy variables."""
return pd.get_dummies(data[feature_columns]).values
def train_model(features: np.ndarray, target: np.ndarray) -> RandomForestClassifier:
"""Trains a Random Forest classifier."""
model = RandomForestClassifier()
model.fit(features, target)
return model
The difference is striking. The second version is not just more readable—it’s also easier to test, modify, and share with teammates. This transformation illustrates the core principle of clean code: writing code that is easy to understand and maintain.
While software engineers have long emphasized clean code practices, these principles haven’t received as much attention in data science. Perhaps because we often work in notebooks, focusing on experimentation and results rather than code quality. However, as data science projects grow in complexity and team size, the ability to write clean code becomes increasingly crucial.
This guide will walk you through essential principles of clean code, tailored specifically for data scientists. We’ll cover naming conventions, code organization, function design, and more—all with practical examples from data science workflows. Whether you’re building machine learning pipelines, performing statistical analyses, or creating data visualizations, these principles will help you write code that’s not just functional, but also maintainable and scalable.
The Art of Naming: Making Code Self-Explanatory
Imagine you’re a data scientist opening a Jupyter notebook you created a few months ago. You scroll through cells filled with variables like df
, x
, and clf
. Would you immediately remember what these represent? This scenario highlights why naming is an extremely important part of writing clean code. Indeed, if poor names are chosen, pretty much all other clean code concepts won’t help that much.
The Core Purpose of Naming
Names have one simple purpose: They should describe what’s stored in a variable or property, what a function or method does, or what kind of object will be created when instantiating a class. With this principle in mind, coming up with good names becomes more straightforward, though finding the best name will often require multiple iterations.
Let’s look at a typical example of code you might find in a data science project:
# Poor naming example
def proc(df, cols, tgt):
X = df[cols].copy()
y = df[tgt]
X = X.fillna(X.mean())
X_std = StandardScaler().fit_transform(X)
clf = LogisticRegression()
clf.fit(X_std, y)
return clf
# Usage
cols = ['dur', 'amnt', 'freq']
model = proc(customer_data, cols, 'churned')
Let’s rewrite this with clear, descriptive names:
def train_churn_classifier(
customer_data: pd.DataFrame,
feature_columns: List[str],
target_column: str
) -> LogisticRegression:
"""
Trains a logistic regression model for customer churn prediction.
Args:
customer_data: DataFrame containing customer information
feature_columns: Columns to use as predictors
target_column: Column containing churn information (1 for churned, 0 for active)
Returns:
Trained logistic regression model
"""
features = customer_data[feature_columns].copy()
target = customer_data[target_column]
cleaned_features = handle_missing_values(features)
scaled_features = scale_features(cleaned_features)
churn_classifier = LogisticRegression()
churn_classifier.fit(scaled_features, target)
return churn_classifier
# Usage
feature_columns = [
'subscription_duration',
'transaction_amount',
'purchase_frequency'
]
churn_model = train_churn_classifier(
customer_data,
feature_columns,
'churned'
)
Naming Guidelines for Data Scientists
Variables and Properties
Variables and properties hold data - numbers, text (strings), boolean values, objects, lists, arrays, maps, etc. Hence, the name should imply which kind of data is being stored. Names should typically be nouns or short phrases with adjectives, especially for boolean values.
# Poor naming
X = np.array([[1, 2], [3, 4]])
y = np.array([0, 1])
df1 = pd.read_csv('data.csv')
flag = True
# Better naming
feature_matrix = np.array([[1, 2], [3, 4]])
target_labels = np.array([0, 1])
raw_sales_data = pd.read_csv('data.csv')
is_valid_input = True
# For DataFrames, describe what they contain
preprocessed_customer_data = raw_sales_data[raw_sales_data['revenue'] > 0]
# Boolean variables should use is_, has_, did_ prefixes
is_outlier = np.abs(z_score) > 3
has_missing_values = data.isnull().any()
did_converge = model.n_iter_ < model.max_iter
Functions and Methods
Functions and methods execute code - they perform tasks and operations. Their names should be verbs that describe the action being performed.
# Poor naming - sounds like properties
def email(text_data):
return process(text_data)
def user(id):
return find(id)
# Better naming - clear actions
def extract_text_features(text_data: List[str]) -> sparse.csr_matrix:
"""Converts text data into bag-of-words representation."""
return CountVectorizer().fit_transform(text_data)
def get_user_by_id(user_id: int) -> User:
"""Retrieves user information from database."""
return database.query(User).filter_by(id=user_id).first()
Classes
Classes are used to create objects (unless it’s a static class). The class name should describe the kind of object it will create. Even for static classes, the name should describe what kind of container it represents. Class names should be nouns.
# Poor naming
class ML:
def __init__(self, data):
self.data = data
# Better naming
class TextPreprocessor:
def __init__(self, text_data: List[str]):
self.text_data = text_data
def remove_stopwords(self) -> List[str]:
pass
def stem_words(self) -> List[str]:
pass
Avoid Generic Names
In most situations, you should avoid generic names like handle()
, process()
, data
, or item
. While there can be situations where these make sense, typically you should either make these names more specific or choose a different kind of name:
# Too generic
def process(data):
return data.transform()
# More specific
def normalize_feature_values(feature_data: pd.DataFrame) -> pd.DataFrame:
return feature_data.transform()
Be Consistent
An important part of using proper names is consistency. If you use fetch_users()
in one part of your code, you should use fetch_products()
- not get_products()
- in another part of that same code. While it generally doesn’t matter if you prefer fetch_
, get_
, or retrieve_
, you should stick to one term throughout your codebase.
Practical Tips for Data Science Projects
- Dataset Naming: When working with multiple versions of your dataset, use names that reflect the processing stage:
raw_data = pd.read_csv('sales.csv')
cleaned_data = remove_outliers(raw_data)
feature_engineered_data = add_time_features(cleaned_data)
- Model Naming: When experimenting with multiple models, use names that reflect their purpose and configuration:
baseline_model = RandomForestClassifier()
tuned_model = RandomForestClassifier(**optimal_parameters)
production_model = load_model('prod_v2.pkl')
- Feature Names: Use descriptive names in feature engineering:
# Poor naming
df['t_diff'] = df['t2'] - df['t1']
# Better naming
df['time_since_last_purchase'] = (
df['current_transaction_date'] -
df['last_transaction_date']
).dt.days
Remember: The few extra characters you type in descriptive names will save hours of confusion later. Think of variable and function names as documentation - when you return to your code months later, you shouldn’t need to decipher what proc
or df1
means; the names should tell the story of what your code does.
Comments and Formatting: WHEN Less is More
In data science, we often work with complex transformations, mathematical formulas, and multi-step processes. It might be tempting to explain everything with comments, but let’s see why this isn’t always the best approach.
Comments
Comments in code can be both helpful and harmful. Let’s explore when to use them and when to avoid them with data science examples.
When Comments Are Useful
- Legal Information:
# Copyright (c) 2024 Research Institute
# Licensed under the MIT License
# This code implements the algorithm described in:
# Smith et al. (2023) "Novel Approach to Time Series Forecasting"
# Journal of Machine Learning, Vol 42, pp. 101-120
import numpy as np
import pandas as pd
- Complex Mathematical Transformations:
def calculate_mahalanobis_distance(data: pd.DataFrame) -> np.ndarray:
"""Calculates the Mahalanobis distance for multivariate outlier detection."""
# Mahalanobis distance is defined as: sqrt((x-μ)ᵀ Σ⁻¹ (x-μ))
# where μ is the mean vector and Σ is the covariance matrix
mean = np.mean(data, axis=0)
covariance_matrix = np.cov(data, rowvar=False)
inv_covmat = np.linalg.inv(covariance_matrix)
return np.sqrt(np.sum(np.dot(data - mean, inv_covmat) * (data - mean), axis=1))
- Regular Expressions:
def extract_timestamp(text: str) -> str:
# Matches timestamps in format: YYYY-MM-DD HH:MM:SS.mmm
timestamp_pattern = r'(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}\.\d{3})'
return re.search(timestamp_pattern, text).group()
- Important Warnings:
def split_time_series(data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
# WARNING: Data must be sorted chronologically
# Shuffling would cause future data leakage
split_idx = int(len(data) * 0.8)
return data[:split_idx], data[split_idx:]
- TODO Notes (Use Sparingly):
def train_model(features: pd.DataFrame, target: pd.Series) -> RandomForestClassifier:
# TODO: Add cross-validation when more data is available
# TODO: Implement early stopping based on validation loss
model = RandomForestClassifier()
return model.fit(features, target)
When to Avoid Comments
- Commented-Out Code:
# Bad Example
def preprocess_features(data: pd.DataFrame) -> pd.DataFrame:
# Remove outliers using IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
# # Old method using z-score
# # z_scores = np.abs(stats.zscore(data))
# # data = data[(z_scores < 3).all(axis=1)]
return data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]
# Better: Delete unused code and use version control to track changes
- Misleading Comments:
# Bad Example
def process_features(df: pd.DataFrame) -> pd.DataFrame:
# Normalize the data
standardized = StandardScaler().fit_transform(df) # Actually standardizing!
return standardized
# Better Example
def standardize_features(df: pd.DataFrame) -> pd.DataFrame:
"""Standardizes features to zero mean and unit variance."""
return StandardScaler().fit_transform(df)
- Redundant Comments:
# Bad Example
def calculate_metrics(y_true: np.ndarray, y_pred: np.ndarray) -> Dict:
# Calculate accuracy
acc = accuracy_score(y_true, y_pred) # Don't state the obvious
# Calculate precision
prec = precision_score(y_true, y_pred)
# Return metrics dictionary
return {'accuracy': acc, 'precision': prec}
# Better Example
def calculate_classification_metrics(
y_true: np.ndarray,
y_pred: np.ndarray
) -> Dict[str, float]:
"""Returns a dictionary of classification performance metrics."""
return {
'accuracy': accuracy_score(y_true, y_pred),
'precision': precision_score(y_true, y_pred)
}
Key Principles for Comments
- Comments should explain the “why”, not the “what”:
# Bad: Explains what (obvious from code)
x = x + 1 # Increment x
# Good: Explains why
n_iterations += 1 # Compensate for warm-up period in MCMC
- Let code be self-documenting when possible:
# Bad: Relies on comments for clarity
def p(d): # Process data
r = d.sum() # Calculate sum
return r # Return result
# Good: Self-documenting code
def calculate_total_sales(daily_sales: pd.Series) -> float:
return daily_sales.sum()
- Use docstrings for documentation:
def detect_anomalies(
time_series: pd.Series,
window_size: int = 30
) -> pd.Series:
"""
Detects anomalies using rolling statistics.
Args:
time_series: Time series data
window_size: Size of rolling window
Returns:
Boolean series where True indicates anomalies
"""
rolling_mean = time_series.rolling(window_size).mean()
rolling_std = time_series.rolling(window_size).std()
z_scores = np.abs((time_series - rolling_mean) / rolling_std)
return z_scores > 3
Remember: The goal is to write code that is so clear that it doesn’t need comments to be understood. Comments should be used only when they add value that can’t be conveyed through better code structure and naming.
Code Formatting: Making Your Data Science Code More Readable
Vertical Formatting: The Art of Spacing
Think of your code like a well-written research paper - it should have clear paragraphs, logical sections, and a natural flow. In code, we achieve this through vertical formatting.
Bad Example - No Vertical Spacing
def preprocess_dataset(data: pd.DataFrame) -> pd.DataFrame:
numeric_columns = data.select_dtypes(include=[np.number]).columns
data[numeric_columns] = data[numeric_columns].fillna(data[numeric_columns].mean())
categorical_columns = data.select_dtypes(include=['object']).columns
data[categorical_columns] = data[categorical_columns].fillna(data[categorical_columns].mode().iloc[0])
for column in categorical_columns:
data[column] = LabelEncoder().fit_transform(data[column])
scaled_features = StandardScaler().fit_transform(data[numeric_columns])
data[numeric_columns] = scaled_features
return data
Good Example - With Proper Vertical Spacing
def preprocess_dataset(data: pd.DataFrame) -> pd.DataFrame:
"""Preprocesses dataset by handling missing values and encoding categories."""
# Handle numeric features
numeric_columns = data.select_dtypes(include=[np.number]).columns
data[numeric_columns] = data[numeric_columns].fillna(
data[numeric_columns].mean()
)
# Handle categorical features
categorical_columns = data.select_dtypes(include=['object']).columns
data[categorical_columns] = data[categorical_columns].fillna(
data[categorical_columns].mode().iloc[0]
)
# Encode categorical variables
for column in categorical_columns:
data[column] = LabelEncoder().fit_transform(data[column])
# Scale numeric features
scaled_features = StandardScaler().fit_transform(data[numeric_columns])
data[numeric_columns] = scaled_features
return data
Key Principles of Vertical Formatting:
- Vertical Density: Keep related concepts together
# Good: Related operations are grouped
def calculate_feature_importance(model, X: pd.DataFrame) -> pd.DataFrame:
# Get importance scores
importance = model.feature_importances_
feature_names = X.columns
# Create and sort importance DataFrame
importance_df = pd.DataFrame({
'feature': feature_names,
'importance': importance
})
return importance_df.sort_values('importance', ascending=False)
- Vertical Distance: Separate distinct concepts with blank lines
class ModelTrainer:
def __init__(self, model_params: Dict):
self.model_params = model_params
self.model = None
def prepare_data(self, X: pd.DataFrame, y: pd.Series) -> Tuple:
"""Prepares training and validation data."""
return train_test_split(X, y, test_size=0.2, random_state=42)
def train_model(self, X_train: pd.DataFrame, y_train: pd.Series) -> None:
"""Trains the model on provided data."""
self.model = RandomForestClassifier(**self.model_params)
self.model.fit(X_train, y_train)
- File Organization: Follow the “stepdown rule” - called functions should appear below their callers
def train_and_evaluate_model(data: pd.DataFrame) -> Dict[str, float]:
"""Main function that orchestrates model training and evaluation."""
features, target = prepare_features_and_target(data)
model = train_model(features, target)
return evaluate_model(model, features, target)
def prepare_features_and_target(data: pd.DataFrame) -> Tuple:
"""Prepares features and target for modeling."""
return data.drop('target', axis=1), data['target']
def train_model(features: pd.DataFrame, target: pd.Series) -> RandomForestClassifier:
"""Trains the model."""
return RandomForestClassifier().fit(features, target)
def evaluate_model(
model: RandomForestClassifier,
features: pd.DataFrame,
target: pd.Series
) -> Dict[str, float]:
"""Evaluates model performance."""
return {
'accuracy': accuracy_score(target, model.predict(features)),
'f1': f1_score(target, model.predict(features))
}
Horizontal Formatting: Managing Line Length and Readability
Just as vertical spacing helps organize code sections, horizontal formatting makes individual lines more readable.
Bad Example - Long Lines
result = pd.DataFrame({'predicted_values': model.predict(X_test), 'actual_values': y_test, 'feature_1': X_test['feature_1'], 'feature_2': X_test['feature_2'], 'probability_class_1': model.predict_proba(X_test)[:, 1]})
Good Example - Breaking Long Lines
result = pd.DataFrame({
'predicted_values': model.predict(X_test),
'actual_values': y_test,
'feature_1': X_test['feature_1'],
'feature_2': X_test['feature_2'],
'probability_class_1': model.predict_proba(X_test)[:, 1]
})
Key Principles of Horizontal Formatting:
- Line Length: Keep lines under 88 characters (Python black formatter standard)
# Bad
correlation_matrix = data[['feature1', 'feature2', 'feature3', 'feature4', 'feature5']].corr()
# Good
correlation_matrix = data[[
'feature1', 'feature2', 'feature3',
'feature4', 'feature5'
]].corr()
- Parameter Lists: Break long parameter lists into multiple lines
# Bad
def train_complex_model(X_train, y_train, n_estimators, max_depth, min_samples_split, min_samples_leaf, max_features, random_state):
# Good
def train_complex_model(
X_train: pd.DataFrame,
y_train: pd.Series,
n_estimators: int = 100,
max_depth: int = None,
min_samples_split: int = 2,
min_samples_leaf: int = 1,
max_features: str = 'auto',
random_state: int = 42
) -> RandomForestClassifier:
- Method Chaining: Break long method chains into multiple lines
# Bad
cleaned_data = data.dropna().reset_index(drop=True).drop_duplicates().sort_values('date').reset_index(drop=True)
# Good
cleaned_data = (
data
.dropna()
.reset_index(drop=True)
.drop_duplicates()
.sort_values('date')
.reset_index(drop=True)
)
As your code gets bigger, consider splitting it between multiple files and use import
and export
statements to connect your code together.
Remember: The goal of formatting is to make your code tell a clear story. Well-formatted code helps others (and your future self) understand your data science workflow more easily.
Writing Clean Functions: The Building Blocks of Data Science Code
Functions are the fundamental building blocks of data science code. They help us organize code, make it reusable, and maintain clarity in our data processing pipelines. Let’s explore the key concepts that make functions effective.
Function Size and Responsibility
Functions should:
- Do exactly one thing
- Be small and focused
- Operate at a single level of abstraction
- Have a clear and specific purpose
# Too many responsibilities
def process_data(df):
# Handle missing values
df = df.fillna(df.mean())
# Remove outliers
z_scores = stats.zscore(df)
df = df[(z_scores < 3).all(axis=1)]
# Scale features
return StandardScaler().fit_transform(df)
# Clear, single responsibilities
def handle_missing_values(data: pd.DataFrame) -> pd.DataFrame:
"""Fills missing values with column means."""
return data.fillna(data.mean())
def remove_outliers(
data: pd.DataFrame,
threshold: float = 3.0
) -> pd.DataFrame:
"""Removes outliers based on z-score threshold."""
z_scores = stats.zscore(data)
return data[(np.abs(z_scores) < threshold).all(axis=1)]
def scale_features(data: pd.DataFrame) -> pd.DataFrame:
"""Scales features using StandardScaler."""
return StandardScaler().fit_transform(data)
Pure Functions
A pure function is a function that:
- Always produces the same output for the same input (deterministic behavior)
- Has no side effects (doesn’t modify external state, print to console, or write to files)
- Relies only on its input parameters
Note: Reading from global variables is acceptable in pure functions - it’s modifying global state that creates side effects.
# Not a pure function - modifies input and has side effects
global_scaler = StandardScaler() # Global mutable state
def preprocess_features(df):
# Modifies global state
global_scaler.fit(df)
# Modifies input df in place
df['normalized'] = df['value'] / 100
# Side effect: prints to console
print("Processing complete")
return df
# Pure function - no side effects, doesn't modify inputs
def scale_features(
data: pd.DataFrame,
columns: List[str],
scaler: Optional[StandardScaler] = None
) -> Tuple[pd.DataFrame, StandardScaler]:
"""
Scales specified columns without modifying the input DataFrame.
Args:
data: Input DataFrame
columns: Columns to scale
scaler: Optional pre-fitted scaler
Returns:
Tuple containing:
- DataFrame with scaled features
- Fitted scaler for reuse
"""
result = data.copy()
scaler = scaler or StandardScaler()
result[columns] = scaler.fit_transform(data[columns])
return result, scaler
While pure functions are ideal for readability and testing, they’re not always possible in data science due to necessary side effects:
# Bad: Side effect not clear from name
def process_data(data):
data.to_csv('processed.csv') # Unexpected file I/O
return data
# Good: Name indicates side effect
def save_data_to_csv(data, filepath):
data.to_csv(filepath)
return data
# Bad: Hidden database interaction
def get_user(user_id):
db.connect() # Unexpected database connection
return db.query(f"SELECT * FROM users WHERE id={user_id}")
# Good: Clear about database interaction
def fetch_user_from_database(user_id):
db.connect()
return db.query(f"SELECT * FROM users WHERE id={user_id}")
# Bad: Hidden state modification
def train(model, data):
model.random_state = 42 # Unexpected state change
model.fit(data)
return model
# Good: Name indicates state modification
def initialize_and_train_model(model, data, random_state=42):
model.random_state = random_state
model.fit(data)
return model
Common situations requiring impure functions in data science:
- File operations (use verbs like ‘save’, ‘write’, ‘load’, ‘read’)
- Database interactions (use verbs like ‘fetch’, ‘store’, ‘query’)
- Model training (use verbs like ‘train’, ‘fit’, ‘initialize’)
- Logging/printing (use verbs like ‘log’, ‘print’, ‘display’)
- Random operations (include ‘random’ in name if result varies)
Levels of Abstraction
Functions should operate at consistent levels of abstraction. Each function should have operations that are at the same conceptual level, which should be one level below what the function name implies.
High-Level vs Low-Level Operations
# Mixed levels of abstraction - Hard to understand
def train_model(data):
X = data.drop('target', axis=1) # Low-level operation
model = RandomForestClassifier() # High-level operation
model.fit(X, data['target']) # Mid-level operation
print('Training complete') # Low-level operation
return model
# Consistent level of abstraction - Clear and maintainable
def train_model(data: pd.DataFrame) -> RandomForestClassifier:
"""Trains a random forest model on the provided data."""
features = prepare_features(data)
target = extract_target(data)
model = create_model()
fit_model(model, features, target)
return model
def prepare_features(data: pd.DataFrame) -> pd.DataFrame:
"""Extracts and preprocesses feature columns."""
return data.drop('target', axis=1)
def extract_target(data: pd.DataFrame) -> pd.Series:
"""Extracts the target variable."""
return data['target']
When to Split Functions
Follow these guidelines to decide when to split functions:
-
Extract code that works on the same functionality
# Before def update_user(user_data): validate_user_data(user_data) user = find_user_by_id(user_data.id) user.setAge(user_data.age) user.setName(user_data.name) user.save() # After def update_user(user_data): validate_user_data(user_data) apply_update(user_data) def apply_update(user_data): user = find_user_by_id(user_data.id) update_user_fields(user, user_data) user.save()
-
Extract code that requires more interpretation
# Before def process_transaction(transaction): if transaction.type == 'UNKNOWN': throw new Error('Invalid transaction type.') if transaction.type == 'PAYMENT': process_payment(transaction) # After def process_transaction(transaction): validate_transaction(transaction) if is_payment(transaction): process_payment(transaction)
Common Pitfalls to Avoid
- Over-splitting functions
- Don’t create functions just for the sake of extraction
- Ensure each function adds meaningful abstraction
- Inconsistent abstraction levels
- Keep operations within a function at the same conceptual level
- Don’t mix high-level business logic with low-level implementation details
- Hidden side effects
- Make side effects obvious through function names
- Document any state changes in function documentation
Minimizing Function Parameters
The fewer parameters a function has, the easier it is to read, understand, and call. Here’s a guide to parameter counts:
No Parameters
Functions without parameters are very easy to read and digest:
createSession();
user.save();
However, “no parameters” isn’t always an option - parameters make functions dynamic and flexible.
One Parameter
Functions with one parameter are typically straightforward:
isValid(email);
file.write(data);
Two Parameters
Two parameters can be okay, but context matters:
Good examples (clear and intuitive):
login('[email protected]', 'testers');
createProduct('Carpet', 12.99);
Confusing examples (parameter order not obvious):
createSession('abc', 'temp');
sortUsers('email', 'asc');
More than Two Parameters
Should generally be avoided - they become hard to read and use:
# Hard to understand
createRectangle(10, 9, 30, 12);
createUser('[email protected]', 31, 'max');
Solutions for Many Parameters
When working with data science projects, you’ll often find yourself dealing with functions and classes that require many parameters. Whether it’s configuring a machine learning model, setting up data preprocessing steps, or defining experiment parameters, managing these parameters effectively is crucial for maintaining clean and maintainable code.
Let’s explore three powerful patterns that can make your data science code more organized and easier to understand.
Using Objects/Maps Instead of Multiple Parameters
The Problem
Consider a typical data preprocessing function:
def preprocess_data(
data,
fill_missing_strategy="mean",
scaling_method="standard",
categorical_encoding="one-hot",
drop_columns=None,
handle_outliers=True,
outlier_threshold=3,
create_polynomial_features=False,
polynomial_degree=2,
feature_selection_method="mutual_info",
n_features_to_select=10
):
# Implementation here
pass
This function is difficult to use and maintain because:
- It’s hard to remember the order of parameters
- Default values are scattered throughout the signature
- Adding new parameters requires changing function calls everywhere
The Solution
Instead, use a dictionary or object with named parameters:
def preprocess_data(config):
"""
Preprocess data according to the configuration.
Args:
config: dict with preprocessing parameters
"""
scaling_method = config.get("scaling_method", "standard")
categorical_encoding = config.get("categorical_encoding", "one-hot")
# Rest of implementation
# Usage
preprocessing_config = {
"scaling_method": "minmax",
"categorical_encoding": "label",
"handle_outliers": True,
"outlier_threshold": 2.5
}
preprocessed_data = preprocess_data(preprocessing_config)
Configuration Objects: Type-Safe Parameter Management
Configuration objects take the previous concept further by adding type safety and validation. This is especially valuable in data science where incorrect parameter types can cause subtle bugs.
Real-World Example: Training Configuration
from dataclasses import dataclass
from typing import List, Optional, Union
from pathlib import Path
@dataclass
class TrainingConfig:
# Model parameters
model_type: str
hidden_layers: List[int]
activation: str
dropout_rate: float
# Training parameters
batch_size: int
learning_rate: float
n_epochs: int
# Data parameters
train_data_path: Path
validation_split: float
target_column: str
feature_columns: List[str]
# Optional parameters
early_stopping_patience: Optional[int] = None
model_checkpoint_path: Optional[Path] = None
def validate(self):
"""Validate configuration parameters."""
assert 0 < self.dropout_rate < 1, "Dropout rate must be between 0 and 1"
assert self.batch_size > 0, "Batch size must be positive"
assert 0 < self.validation_split < 1, "Validation split must be between 0 and 1"
assert self.train_data_path.exists(), "Training data path must exist"
# Usage example
config = TrainingConfig(
model_type="feedforward",
hidden_layers=[128, 64, 32],
activation="relu",
dropout_rate=0.3,
batch_size=64,
learning_rate=0.001,
n_epochs=100,
train_data_path=Path("data/train.csv"),
validation_split=0.2,
target_column="target",
feature_columns=["feature1", "feature2", "feature3"]
)
def train_model(config: TrainingConfig):
config.validate()
# Training implementation here
Benefits:
- Type hints provide IDE support and catch errors early
- Validation ensures parameters are correct before training starts
- Documentation is built into the structure
- Easy to serialize/deserialize for experiment tracking
Builder Pattern: Complex Object Construction Made Clear
The Builder pattern is particularly useful for experiment configuration where you might want to modify only certain parameters while keeping others at their default values.
Real-World Example: Experiment Configuration Builder
class ExperimentBuilder:
"""Builder for machine learning experiment configuration."""
def __init__(self):
self._config = {
"model_params": {},
"training_params": {},
"data_params": {}
}
def with_model(self, model_type: str, **kwargs):
"""Configure model architecture."""
self._config["model_params"] = {
"type": model_type,
**kwargs
}
return self
def with_training_params(self, **kwargs):
"""Configure training parameters."""
self._config["training_params"].update(kwargs)
return self
def with_data_preprocessing(self, **kwargs):
"""Configure data preprocessing steps."""
self._config["data_params"].update(kwargs)
return self
def build(self) -> dict:
"""Validate and return the final configuration."""
self._validate_config()
return self._config
def _validate_config(self):
"""Ensure all required parameters are set."""
required_model_params = ["type"]
required_training_params = ["learning_rate", "n_epochs"]
missing_model = [param for param in required_model_params
if param not in self._config["model_params"]]
if missing_model:
raise ValueError(f"Missing required model parameters: {missing_model}")
missing_training = [param for param in required_training_params
if param not in self._config["training_params"]]
if missing_training:
raise ValueError(f"Missing required training parameters: {missing_training}")
# Usage example for a deep learning experiment
experiment_config = (
ExperimentBuilder()
.with_model(
model_type="transformer",
n_layers=6,
n_heads=8,
d_model=512
)
.with_training_params(
learning_rate=0.0001,
n_epochs=50,
batch_size=32,
gradient_clip=1.0
)
.with_data_preprocessing(
sequence_length=128,
vocab_size=10000,
padding="post",
truncating="post"
)
.build()
)
This pattern is particularly valuable when:
- You’re running multiple experiments with different configurations
- Some parameters are interdependent
- You want to ensure all required parameters are set before running
- You need to maintain default configurations while allowing customization
Best Practices Summary
- Use simple parameter objects for functions with more than 2 parameters
- Use configuration objects with type hints for complex settings that need validation
- Use the builder pattern when you need to:
- Create complex configurations step by step
- Maintain multiple similar configurations
- Ensure parameter validity before execution
- Make configuration creation more readable and maintainable
Remember: The goal is to make your code more maintainable and less error-prone. Choose the pattern that best fits your specific use case and team’s needs.
Control Structures: Taming Complexity in Data Pipelines
Control structures (if statements, for loops, while loops, switch-case statements) are fundamental for coordinating code flow. While essential, they can lead to suboptimal or hard-to-maintain code if not used carefully. Here are three key areas for improvement:
1. Prefer Positive Checks
Using positive wording in if checks can make code more readable.
# Less clear - requires more mental processing
if (!hasContent(blogContent)) {
throw Error('Invalid input')
}
# More clear - instantly understandable
if (isEmpty(blogContent)) {
throw Error('Invalid input')
}
2. Avoid Deep Nesting
Deep nesting makes code hard to read and maintain. Here are four techniques to avoid it:
a. Use Guards and Fail Fast
# Deeply nested - hard to follow
def messageUser(user, message):
if user:
if message:
if user.acceptsMessages:
const success = user.sendMessage(message)
if success:
console.log('Message sent!')
# Using guards - clear and flat
def messageUser(user, message):
if !user || !message || !user.acceptsMessages:
return
user.sendMessage(message)
console.log('Message sent!')
b. Extract Control Structures into Functions
# Complex nested logic
def load_dataset(path):
if not path:
raise ValueError('Path to dataset is required!')
if path.endswith('.csv'):
df = pd.read_csv(path)
if df.empty:
if os.path.exists(path + '.parquet'):
return pd.read_parquet(path + '.parquet')
else:
raise ValueError('Dataset is empty!')
elif path.endswith('.parquet'):
df = pd.read_parquet(path)
if df.empty:
raise ValueError('Dataset is empty!')
else:
raise ValueError('Unsupported file format!')
return df
# Extracted into focused functions
def load_dataset(path):
validate_path(path)
file_type = get_file_type(path)
return read_data(path, file_type)
def validate_path(path):
if not path:
raise ValueError('Path to dataset is required!')
if not os.path.exists(path):
raise FileNotFoundError(f'Dataset not found at {path}')
def get_file_type(path):
supported_formats = {'.csv': 'csv', '.parquet': 'parquet'}
file_ext = os.path.splitext(path)[1].lower()
if file_ext not in supported_formats:
raise ValueError(f'Unsupported format. Use: {list(supported_formats.keys())}')
return supported_formats[file_ext]
def read_data(path, file_type):
readers = {
'csv': pd.read_csv,
'parquet': pd.read_parquet
}
df = readers[file_type](path)
validate_dataframe(df)
return df
def validate_dataframe(df):
if df.empty:
raise ValueError('Dataset is empty!')
c. Use Factory Functions & Polymorphism
# Repetitive checks and nested logic
def process_dataset(data):
if is_timeseries(data):
if needs_resampling(data):
process_timeseries_resampling(data)
if needs_interpolation(data):
process_timeseries_interpolation(data)
else:
if needs_resampling(data):
process_tabular_resampling(data)
if needs_interpolation(data):
process_tabular_interpolation(data)
# Using factory function and polymorphic object
def get_processors(data):
processors = {
'resample': None,
'interpolate': None
}
if is_timeseries(data):
processors['resample'] = lambda x: x.resample('1D').mean()
processors['interpolate'] = lambda x: x.interpolate(method='time')
else:
processors['resample'] = lambda x: x.sample(frac=0.8, random_state=42)
processors['interpolate'] = lambda x: x.interpolate(method='linear')
return processors
def process_dataset(data):
processors = get_processors(data)
if needs_resampling(data):
data = processors['resample'](data)
if needs_interpolation(data):
data = processors['interpolate'](data)
return data
# Helper functions
def is_timeseries(data):
return isinstance(data.index, pd.DatetimeIndex)
def needs_resampling(data):
return data.isnull().sum().sum() > len(data) * 0.1
def needs_interpolation(data):
return data.isnull().any().any()
d. Replace If Checks with Error Handling
Following the principle that a function should do exactly one thing, error handling should typically be separated from the main function logic. When a function can’t complete its one job, it should construct and throw an error rather than handling it internally.
# Poor approach: Function doing multiple things
# 1. Validates data
# 2. Creates error codes/messages
# 3. Handles errors (logging)
def validate_dataset(df):
validity = check_data_quality(df)
if validity['code'] in [1, 2]:
print(f"Data validation failed: {validity['message']}")
return
if validity['code'] == 3:
print("Warning: Data contains outliers")
return df
# Better approach: Each function has one responsibility
def validate_dataframe(df):
"""Validates DataFrame structure and content"""
if not isinstance(df, pd.DataFrame):
raise TypeError("Input must be a pandas DataFrame")
if df.empty:
raise ValueError("DataFrame is empty")
required_columns = ['timestamp', 'value', 'category']
missing_cols = set(required_columns) - set(df.columns)
if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")
if df['value'].isnull().sum() / len(df) > 0.2:
raise ValueError("Too many missing values (>20%) in 'value' column")
def validate_data_types(df):
"""Validates data types and formats"""
if not pd.api.types.is_datetime64_any_dtype(df['timestamp']):
raise TypeError("'timestamp' column must be datetime")
if not pd.api.types.is_numeric_dtype(df['value']):
raise TypeError("'value' column must be numeric")
if not df['category'].isin(['A', 'B', 'C']).all():
raise ValueError("'category' must only contain values: A, B, C")
# Separate function for handling the data processing pipeline
def process_dataset(df):
try:
# Structural validation
validate_dataframe(df)
# Data type validation
validate_data_types(df)
# Continue with data processing
return prepare_data(df)
except (TypeError, ValueError) as e:
logging.error(f"Data validation failed: {str(e)}")
raise
except Exception as e:
logging.error(f"Unexpected error during data processing: {str(e)}")
raise
def prepare_data(df):
"""Handles the actual data preparation after validation"""
df = df.copy()
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['value'] = pd.to_numeric(df['value'], errors='coerce')
df = df.sort_values('timestamp')
return df
Best Practices
- Choose Positive Checks When:
- The positive case is more common
- The condition is simple and doesn’t involve multiple states
- It makes the code more readable
- Apply Factory Functions When:
- You have similar but varying behavior
- You’re repeating checks in multiple places
- You need to create objects with different implementations but same interface
- Use Error Handling When:
- Validation is a separate concern
- You want to construct and bubble up errors to appropriate handlers
- You want to avoid nested if checks for error conditions
Classes and Objects in Data Science: Organizing Complex Pipelines
While this guide doesn’t exclusively focus on object-oriented programming, classes and objects are crucial aspects of programming in general, even if you don’t follow a purely object-oriented style. When working with complex data science pipelines, understanding how to effectively use classes and objects can significantly improve code organization and maintainability.
Understanding Objects vs Data Containers
A fundamental distinction in data science code is between “real objects” and “data containers”. Let’s understand this key difference:
Data Containers
A data container is exactly what it sounds like - an object that holds data. For example:
class ExperimentMetrics:
def __init__(self, accuracy: float, f1_score: float):
self.accuracy = accuracy # Public property
self.f1_score = f1_score # Public property
# Usage
experiment_metrics = ExperimentMetrics(0.95, 0.93)
print(experiment_metrics.accuracy) # Direct access to properties
This class has no methods and both properties are exposed publicly. It’s perfectly valid for storing and transferring data.
Objects with Behavior
In contrast, a proper object hides its data from the public and exposes a public API through methods:
class DataPreprocessor:
def __init__(self, dataframe: pd.DataFrame):
self._data = dataframe # Private property
self._scaler = None # Private property
def fit_transform(self) -> pd.DataFrame:
"""Encapsulates the preprocessing logic"""
self._scaler = StandardScaler()
normalized = self._scaler.fit_transform(self._data)
return pd.DataFrame(normalized, columns=self._data.columns)
# Usage
preprocessor = DataPreprocessor(raw_data)
clean_data = preprocessor.fit_transform() # Interact through methods
When to Use Each Type
Both types have their place in data science:
-
Use Data Containers When:
- You need to group related data together
- The data structure is simple and doesn’t need behavior
- You’re passing data between functions
@dataclass class ModelMetrics: accuracy: float precision: float recall: float f1_score: float
-
Use Objects When:
- You need to encapsulate complex behavior
- You want to hide implementation details
- You need to maintain state
class ModelEvaluator: def __init__(self, model, test_data): self._model = model self._data = test_data self._predictions = None def evaluate(self) -> ModelMetrics: self._predictions = self._model.predict(self._data.X) return self._calculate_metrics()
Key Rules for Clean Classes
When designing classes for data science workflows, consider these important principles:
- Differentiate Between Objects and Data Containers Don’t mix the two styles - either make a pure data container or a proper object with behavior.
- Keep Classes Small and Focused with High Cohesion A class should have a single responsibility, and all its methods should actually use its properties. Let’s look at a common example of a class that’s grown too large and lost cohesion:
# Poor example: Large class with low cohesion
class DataProcessor:
def __init__(self):
self.raw_data = None # Used by data methods
self.cleaned_data = None # Used by data methods
self.model = RandomForestClassifier() # Used by model methods
self.validation_results = {} # Used by validation methods
self.feature_columns = [] # Used by feature methods
self.target_column = None # Used by multiple methods
# Data loading methods - only use raw_data
def load_data(self, filepath: str) -> None:
self.raw_data = pd.read_csv(filepath)
def split_data(self) -> Tuple[pd.DataFrame, pd.DataFrame]:
return train_test_split(self.raw_data)
# Feature methods - only use feature columns
def select_features(self, correlation_threshold: float = 0.1) -> None:
correlations = self.raw_data.corr()
self.feature_columns = correlations[correlations > correlation_threshold].index.tolist()
def engineer_features(self) -> None:
# Complex feature engineering using feature_columns
# but not using other properties
pass
# Model methods - only use model property
def train_model(self) -> None:
self.model.fit(
self.cleaned_data[self.feature_columns],
self.cleaned_data[self.target_column]
)
def predict(self, X: pd.DataFrame) -> np.ndarray:
return self.model.predict(X)
# Validation methods - only use validation_results
def calculate_metrics(self) -> None:
predictions = self.model.predict(self.cleaned_data[self.feature_columns])
self.validation_results['accuracy'] = accuracy_score(
self.cleaned_data[self.target_column],
predictions
)
def generate_validation_report(self) -> Dict:
# Only uses validation_results, not other properties
return {
'model_performance': self.validation_results,
'timestamp': datetime.now()
}
# Better approach: Split into focused classes with high cohesion
class DataLoader:
def __init__(self, filepath: str):
self.data = self.load_data(filepath)
def load_data(self, filepath: str) -> pd.DataFrame:
return pd.read_csv(filepath)
def split_data(self) -> Tuple[pd.DataFrame, pd.DataFrame]:
return train_test_split(self.data)
class FeatureEngineer:
def __init__(self, data: pd.DataFrame):
self.data = data
self.feature_columns = []
def select_features(self, correlation_threshold: float = 0.1) -> List[str]:
correlations = self.data.corr()
self.feature_columns = (correlations[correlations > correlation_threshold]
.index.tolist())
return self.feature_columns
def engineer_features(self) -> pd.DataFrame:
# All methods work with the same properties
engineered_data = self.data.copy()
# Feature engineering logic
return engineered_data
class ModelManager:
def __init__(self, model: Optional[RandomForestClassifier] = None):
self.model = model or RandomForestClassifier()
self.metrics = {}
def train(self, features: pd.DataFrame, target: pd.Series) -> None:
self.model.fit(features, target)
def predict(self, features: pd.DataFrame) -> np.ndarray:
return self.model.predict(features)
def evaluate(self, features: pd.DataFrame, target: pd.Series) -> Dict:
predictions = self.predict(features)
self.metrics = {
'accuracy': accuracy_score(target, predictions),
'precision': precision_score(target, predictions, average='weighted'),
'recall': recall_score(target, predictions, average='weighted')
}
return self.metrics
# Usage example showing better organization and cohesion:
loader = DataLoader('data.csv')
train_data, test_data = loader.split_data()
engineer = FeatureEngineer(train_data)
feature_cols = engineer.select_features(correlation_threshold=0.2)
processed_train = engineer.engineer_features()
processed_test = FeatureEngineer(test_data).engineer_features()
model_manager = ModelManager()
model_manager.train(
processed_train[feature_cols],
train_data['target']
)
performance = model_manager.evaluate(
processed_test[feature_cols],
test_data['target']
)
-
Follow the Law of Demeter: The Law of Demeter (also known as the principle of least knowledge) is like the “don’t talk to strangers” rule in programming. Imagine you’re working with a complex machine learning pipeline: if you want to know the accuracy of your model, you shouldn’t have to know that it’s stored inside a metrics object, which is inside a validation results object, which is inside your experiment object. Just as you wouldn’t ask your colleague’s manager’s supervisor about your colleague’s schedule – you’d ask your colleague directly.
In data science, we often deal with nested objects (experiments containing models containing metrics containing values). When we chain these objects together (like
experiment.model.metrics.accuracy.value
), we create brittle code that breaks when internal structures change. For instance, if someone decides to move accuracy metrics into a different object structure, every piece of code that relied on this exact chain would break.
# Example 1: Data Processing Pipeline that Violates Law of Demeter
class DataProcessor:
def __init__(self, dataset):
self.dataset = dataset
# Violates Law of Demeter - reaches through multiple objects
def get_prediction_accuracy(self):
return self.dataset.validation_results.model_metrics.accuracy_score.value
# Violates Law of Demeter - reaches into nested data structures
def get_feature_importance(self, feature_name):
return self.dataset.model.feature_importances_.coefficients[feature_name].weight
# Example 2: Better Design Following Law of Demeter
class Dataset:
def __init__(self):
self._validation_results = ValidationResults()
self._model = Model()
def get_accuracy(self):
"""Encapsulates access to nested accuracy value"""
return self._validation_results.get_accuracy()
def get_feature_importance(self, feature_name):
"""Delegates to model without exposing its internals"""
return self._model.get_feature_importance(feature_name)
class ValidationResults:
def __init__(self):
self._metrics = ModelMetrics()
def get_accuracy(self):
"""Provides clean interface to access accuracy"""
return self._metrics.get_accuracy()
class ModelMetrics:
def __init__(self):
self._accuracy = AccuracyMetric()
def get_accuracy(self):
return self._accuracy.get_value()
# Usage that follows Law of Demeter
class DataProcessor:
def __init__(self, dataset: Dataset):
self.dataset = dataset
def get_prediction_accuracy(self):
# Only talks to immediate friend (dataset)
return self.dataset.get_accuracy()
def get_feature_importance(self, feature_name):
# Delegates responsibility without knowing internals
return self.dataset.get_feature_importance(feature_name)
Note: The Law of Demeter specifically applies to property/attribute chaining, not method calls. This is important in data science because while model.fit(X).predict(X)
is fine (these are method calls), model.hyperparameters.optimizer.learning_rate
violates the law by reaching through multiple object properties.
Using Polymorphism in Data Pipelines
Polymorphism is a powerful concept that helps avoid code duplication and create flexible data processing pipelines. Instead of using long if/else chains or switch statements, we can use polymorphic classes to handle different types of data processing elegantly.
Real-World Scenario
Imagine you’re building a data pipeline that needs to:
- Process data from different sources (CSV, JSON, databases)
- Apply different preprocessing strategies based on data types
- Handle multiple model training approaches
- Generate various types of reports
- Export results in different formats
Without polymorphism, you might end up with code full of conditional logic. Here’s what that looks like:
# Bad Example - No Polymorphism
class DataProcessor:
def process_data(self, data_type: str, data: Any) -> pd.DataFrame:
if data_type == 'csv':
# Process CSV data
return pd.read_csv(data)
elif data_type == 'json':
# Process JSON data
return pd.read_json(data)
elif data_type == 'sql':
# Process SQL data
return pd.read_sql(data, self.connection)
else:
raise ValueError(f"Unknown data type: {data_type}")
def preprocess_features(self, feature_type: str, data: pd.DataFrame) -> pd.DataFrame:
if feature_type == 'numeric':
# Scale numeric features
return self._scale_numeric(data)
elif feature_type == 'categorical':
# Encode categorical features
return self._encode_categorical(data)
elif feature_type == 'text':
# Process text features
return self._process_text(data)
else:
raise ValueError(f"Unknown feature type: {feature_type}")
# Problems with this approach:
# 1. Lots of conditional logic
# 2. Need to modify code to add new types
# 3. Hard to maintain and test
# 4. Code duplication likely
Better Approach Using Polymorphism
from abc import ABC, abstractmethod
from typing import Dict, List, Optional, Any
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
class DataLoader(ABC):
"""Abstract base class for data loading."""
@abstractmethod
def load_data(self) -> pd.DataFrame:
"""Load data from source."""
pass
@abstractmethod
def validate_schema(self, data: pd.DataFrame) -> bool:
"""Validate data schema."""
pass
class CSVLoader(DataLoader):
"""Handles loading from CSV files."""
def __init__(self, filepath: str, expected_columns: List[str]):
self.filepath = filepath
self.expected_columns = expected_columns
def load_data(self) -> pd.DataFrame:
"""Load data from CSV file."""
data = pd.read_csv(self.filepath)
if not self.validate_schema(data):
raise ValueError("Invalid CSV schema")
return data
def validate_schema(self, data: pd.DataFrame) -> bool:
"""Check if all expected columns are present."""
return all(col in data.columns for col in self.expected_columns)
class JSONLoader(DataLoader):
"""Handles loading from JSON files."""
def __init__(self, filepath: str, expected_keys: List[str]):
self.filepath = filepath
self.expected_keys = expected_keys
def load_data(self) -> pd.DataFrame:
"""Load data from JSON file."""
data = pd.read_json(self.filepath)
if not self.validate_schema(data):
raise ValueError("Invalid JSON schema")
return data
def validate_schema(self, data: pd.DataFrame) -> bool:
"""Check if all expected keys are present."""
return all(key in data.columns for key in self.expected_keys)
class SQLLoader(DataLoader):
"""Handles loading from SQL database."""
def __init__(self, connection, query: str, expected_columns: List[str]):
self.connection = connection
self.query = query
self.expected_columns = expected_columns
def load_data(self) -> pd.DataFrame:
"""Load data from SQL database."""
data = pd.read_sql(self.query, self.connection)
if not self.validate_schema(data):
raise ValueError("Invalid SQL result schema")
return data
def validate_schema(self, data: pd.DataFrame) -> bool:
"""Check if all expected columns are present."""
return all(col in data.columns for col in self.expected_columns)
class FeatureProcessor(ABC):
"""Abstract base class for feature processing."""
@abstractmethod
def fit(self, data: pd.DataFrame) -> 'FeatureProcessor':
"""Fit processor to data."""
pass
@abstractmethod
def transform(self, data: pd.DataFrame) -> pd.DataFrame:
"""Transform the data."""
pass
def fit_transform(self, data: pd.DataFrame) -> pd.DataFrame:
"""Fit and transform data."""
return self.fit(data).transform(data)
class NumericFeatureProcessor(FeatureProcessor):
"""Handles numeric feature processing."""
def __init__(self, columns: List[str]):
self.columns = columns
self.scaler = StandardScaler()
def fit(self, data: pd.DataFrame) -> 'NumericFeatureProcessor':
"""Fit scaler to numeric data."""
self.scaler.fit(data[self.columns])
return self
def transform(self, data: pd.DataFrame) -> pd.DataFrame:
"""Scale numeric features."""
data = data.copy()
data[self.columns] = self.scaler.transform(data[self.columns])
return data
class CategoricalFeatureProcessor(FeatureProcessor):
"""Handles categorical feature processing."""
def __init__(self, columns: List[str]):
self.columns = columns
self.encoders: Dict[str, LabelEncoder] = {}
def fit(self, data: pd.DataFrame) -> 'CategoricalFeatureProcessor':
"""Fit encoders to categorical data."""
for column in self.columns:
self.encoders[column] = LabelEncoder()
self.encoders[column].fit(data[column])
return self
def transform(self, data: pd.DataFrame) -> pd.DataFrame:
"""Encode categorical features."""
data = data.copy()
for column in self.columns:
data[column] = self.encoders[column].transform(data[column])
return data
class TextFeatureProcessor(FeatureProcessor):
"""Handles text feature processing."""
def __init__(self, columns: List[str]):
self.columns = columns
self.vectorizers: Dict[str, TfidfVectorizer] = {
col: TfidfVectorizer() for col in columns
}
def fit(self, data: pd.DataFrame) -> 'TextFeatureProcessor':
"""Fit vectorizers to text data."""
for column in self.columns:
self.vectorizers[column].fit(data[column].fillna(''))
return self
def transform(self, data: pd.DataFrame) -> pd.DataFrame:
"""Vectorize text features."""
data = data.copy()
for column in self.columns:
# Convert sparse matrix to dense and create new columns
vectorized = self.vectorizers[column].transform(data[column].fillna(''))
feature_names = self.vectorizers[column].get_feature_names_out()
# Create new column names
vector_cols = [f"{column}_{feat}" for feat in feature_names]
# Add vectorized features to dataframe
vector_df = pd.DataFrame(
vectorized.toarray(),
columns=vector_cols,
index=data.index
)
# Drop original column and add vectorized features
data = data.drop(column, axis=1)
data = pd.concat([data, vector_df], axis=1)
return data
class DataPipeline:
"""Manages the complete data processing pipeline."""
def __init__(self, loader: DataLoader):
self.loader = loader
self.processors: List[FeatureProcessor] = []
def add_processor(self, processor: FeatureProcessor) -> None:
"""Add a feature processor to the pipeline."""
self.processors.append(processor)
def process(self) -> pd.DataFrame:
"""Run the complete pipeline."""
# Load data
data = self.loader.load_data()
# Apply all processors in sequence
for processor in self.processors:
data = processor.fit_transform(data)
return data
# Usage example showing polymorphism in action
def process_customer_data(
data_source: str,
numeric_cols: List[str],
categorical_cols: List[str],
text_cols: List[str]
) -> pd.DataFrame:
"""Process customer data from various sources."""
# Choose appropriate loader based on data source
if data_source.endswith('.csv'):
loader = CSVLoader(
data_source,
expected_columns=numeric_cols + categorical_cols + text_cols
)
elif data_source.endswith('.json'):
loader = JSONLoader(
data_source,
expected_keys=numeric_cols + categorical_cols + text_cols
)
else:
raise ValueError(f"Unsupported data source: {data_source}")
# Create pipeline
pipeline = DataPipeline(loader)
# Add appropriate processors for each feature type
if numeric_cols:
pipeline.add_processor(NumericFeatureProcessor(numeric_cols))
if categorical_cols:
pipeline.add_processor(CategoricalFeatureProcessor(categorical_cols))
if text_cols:
pipeline.add_processor(TextFeatureProcessor(text_cols))
# Process the data
return pipeline.process()
# Example usage
customer_data = process_customer_data(
'customer_data.csv',
numeric_cols=['age', 'income', 'tenure'],
categorical_cols=['gender', 'location', 'segment'],
text_cols=['comments', 'feedback']
)
Remember: The key to effective polymorphism is designing clean, consistent interfaces that allow different implementations to be used interchangeably. This is especially important in data science pipelines where requirements and data sources often change.
SOLID Principles in Data Science
Single Responsibility Principle (SRP)
Think of SRP like different roles in a data science team. Just as you wouldn’t want one person to handle data cleaning, model training, deployment, AND business presentations, you shouldn’t have one class doing all these things.
Real-World Scenario
Imagine you’re building a customer churn prediction system. At first, it seems convenient to create one class that does everything:
- Loads customer data from various sources (CSV, databases, APIs)
- Handles missing values and outliers
- Engineers features
- Trains and validates models
- Generates PDF reports for stakeholders
- Saves models and results to different formats
The problems start emerging when:
- Your data engineer wants to modify how data is loaded from the database
- The business team requests changes to the PDF report format
- You need to add new feature engineering steps
- A team member wants to experiment with different model architectures
Each change requires modifying the same class, leading to:
- Merge conflicts when multiple team members work simultaneously
- Higher risk of breaking existing functionality
- Difficulty in testing individual components
- Code that’s hard to understand and maintain
Example Implementation
# Bad Example - Violating SRP
class ChurnPredictor:
def __init__(self, db_connection, model_path=None):
self.db = db_connection
self.data = None
self.model = self.load_model(model_path) if model_path else None
self.predictions = None
self.report_path = None
def load_data(self):
"""Loads data from multiple sources."""
sql_data = self.db.query("SELECT * FROM customers")
csv_data = pd.read_csv("additional_features.csv")
self.data = pd.merge(sql_data, csv_data, on="customer_id")
def preprocess_data(self):
"""Handles all preprocessing."""
# Fill missing values
self.data = self.data.fillna(self.data.mean())
# Handle outliers
z_scores = stats.zscore(self.data.select_dtypes(np.number))
self.data = self.data[(z_scores < 3).all(axis=1)]
# Feature engineering
self.data['account_age'] = (pd.Timestamp.now() -
pd.to_datetime(self.data['signup_date'])).dt.days
self.data['total_spend'] = self.data['monthly_spend'] * self.data['tenure']
def train_model(self):
"""Handles model training and validation."""
X = self.data.drop(['churned', 'customer_id'], axis=1)
y = self.data['churned']
X_train, X_test, y_train, y_test = train_test_split(X, y)
self.model = RandomForestClassifier()
self.model.fit(X_train, y_train)
# Calculate and store metrics
self.metrics = {
'accuracy': accuracy_score(y_test, self.model.predict(X_test)),
'precision': precision_score(y_test, self.model.predict(X_test)),
'recall': recall_score(y_test, self.model.predict(X_test))
}
def generate_report(self):
"""Creates PDF report."""
plt.figure(figsize=(10, 6))
feature_imp = pd.DataFrame({
'feature': self.data.columns,
'importance': self.model.feature_importances_
}).sort_values('importance', ascending=False)
plt.bar(feature_imp['feature'][:10], feature_imp['importance'][:10])
plt.xticks(rotation=45)
plt.title("Top 10 Important Features")
plt.tight_layout()
# Save to PDF
plt.savefig("feature_importance.pdf")
def save_results(self):
"""Saves everything to different formats."""
# Save model
joblib.dump(self.model, 'model.joblib')
# Save predictions to CSV
pd.DataFrame({
'customer_id': self.data['customer_id'],
'churn_probability': self.predictions
}).to_csv('predictions.csv', index=False)
# Save metrics to JSON
with open('metrics.json', 'w') as f:
json.dump(self.metrics, f)
# Better Example - Following SRP
class DataLoader:
"""Responsible only for loading and combining data."""
def __init__(self, db_connection):
self.db = db_connection
def load_customer_data(self) -> pd.DataFrame:
"""Loads and combines data from all sources."""
sql_data = self.db.query("SELECT * FROM customers")
csv_data = pd.read_csv("additional_features.csv")
return pd.merge(sql_data, csv_data, on="customer_id")
class DataPreprocessor:
"""Handles all data preprocessing steps."""
def clean_data(self, data: pd.DataFrame) -> pd.DataFrame:
"""Handles missing values and outliers."""
# Create a copy to avoid modifying input
cleaned = data.copy()
# Fill missing values
numeric_cols = cleaned.select_dtypes(np.number).columns
cleaned[numeric_cols] = cleaned[numeric_cols].fillna(cleaned[numeric_cols].mean())
# Remove outliers
z_scores = stats.zscore(cleaned[numeric_cols])
cleaned = cleaned[(z_scores < 3).all(axis=1)]
return cleaned
def engineer_features(self, data: pd.DataFrame) -> pd.DataFrame:
"""Creates new features."""
# Create a copy to avoid modifying input
featured = data.copy()
# Add new features
featured['account_age'] = (pd.Timestamp.now() -
pd.to_datetime(featured['signup_date'])).dt.days
featured['total_spend'] = featured['monthly_spend'] * featured['tenure']
return featured
class ModelTrainer:
"""Handles model training and evaluation."""
def __init__(self, model_type: str = 'random_forest'):
self.model_type = model_type
self.model = None
self.metrics = {}
def train(self, X: pd.DataFrame, y: pd.Series) -> None:
"""Trains the model."""
X_train, X_test, y_train, y_test = train_test_split(X, y)
if self.model_type == 'random_forest':
self.model = RandomForestClassifier()
else:
raise ValueError(f"Unknown model type: {self.model_type}")
self.model.fit(X_train, y_train)
self._calculate_metrics(X_test, y_test)
def _calculate_metrics(self, X_test: pd.DataFrame, y_test: pd.Series) -> None:
"""Calculates and stores performance metrics."""
predictions = self.model.predict(X_test)
self.metrics = {
'accuracy': accuracy_score(y_test, predictions),
'precision': precision_score(y_test, predictions),
'recall': recall_score(y_test, predictions)
}
def get_feature_importance(self, feature_names: List[str]) -> pd.DataFrame:
"""Returns feature importance data."""
if not self.model:
raise ValueError("Model not trained yet")
return pd.DataFrame({
'feature': feature_names,
'importance': self.model.feature_importances_
}).sort_values('importance', ascending=False)
class ReportGenerator:
"""Handles all reporting functionality."""
def generate_pdf_report(
self,
feature_importance: pd.DataFrame,
metrics: Dict[str, float],
output_path: str
) -> None:
"""Generates PDF report with visualizations."""
plt.figure(figsize=(10, 6))
plt.bar(feature_importance['feature'][:10],
feature_importance['importance'][:10])
plt.xticks(rotation=45)
plt.title("Top 10 Important Features")
plt.tight_layout()
plt.savefig(output_path)
# Could add more visualizations and metrics
class ResultSaver:
"""Handles saving all outputs."""
def save_model(self, model: BaseEstimator, path: str) -> None:
"""Saves the trained model."""
joblib.dump(model, path)
def save_predictions(
self,
customer_ids: np.ndarray,
predictions: np.ndarray,
path: str
) -> None:
"""Saves predictions to CSV."""
pd.DataFrame({
'customer_id': customer_ids,
'churn_probability': predictions
}).to_csv(path, index=False)
def save_metrics(self, metrics: Dict[str, float], path: str) -> None:
"""Saves metrics to JSON."""
with open(path, 'w') as f:
json.dump(metrics, f)
# Usage showing how the classes work together
def run_churn_prediction_pipeline(db_connection: DBConnection) -> None:
"""Orchestrates the churn prediction pipeline."""
# Load data
loader = DataLoader(db_connection)
raw_data = loader.load_customer_data()
# Preprocess data
preprocessor = DataPreprocessor()
cleaned_data = preprocessor.clean_data(raw_data)
featured_data = preprocessor.engineer_features(cleaned_data)
# Prepare features and target
X = featured_data.drop(['churned', 'customer_id'], axis=1)
y = featured_data['churned']
# Train model
trainer = ModelTrainer('random_forest')
trainer.train(X, y)
# Generate reports
feature_importance = trainer.get_feature_importance(X.columns)
report_gen = ReportGenerator()
report_gen.generate_pdf_report(
feature_importance,
trainer.metrics,
'churn_report.pdf'
)
# Save results
saver = ResultSaver()
saver.save_model(trainer.model, 'churn_model.joblib')
saver.save_metrics(trainer.metrics, 'metrics.json')
saver.save_predictions(
featured_data['customer_id'],
trainer.model.predict_proba(X)[:, 1],
'predictions.csv'
)
This refactored version demonstrates several benefits:
- Each class has a single, clear responsibility
- Changes to one aspect (e.g., reporting) don’t affect others
- Easy to test each component independently
- Easy to modify or extend individual components
- Clear dependencies between components
- Code is more organized and maintainable
Open/Closed Principle (OCP)
Think of OCP like building a feature engineering pipeline that you want to extend without modifying existing code. Imagine you’re maintaining a data transformation system for your team - when someone wants to add a new type of transformation, they should be able to do so without changing the code that’s already tested and working in production.
Real-World Scenario
You’re working on a large data science project where the feature engineering pipeline currently handles:
- Numeric feature scaling
- Missing value imputation
- Outlier handling
- Categorical encoding
Then new requirements start coming in:
- A colleague wants to add polynomial features for certain variables
- Another team member needs to add custom domain-specific transformations
- The business team requests special handling for time-based features
- You need to add feature selection based on correlation
Without OCP, you’d need to:
- Modify the existing transformation code each time
- Risk breaking the working pipeline
- Retest everything after each change
- Deal with an increasingly complex codebase
Implementation Example
# Bad Example - Violating OCP
class FeatureTransformer:
"""This violates OCP - need to modify code for each new transformation."""
def __init__(self):
self.scaler = StandardScaler()
self.imputer = SimpleImputer()
def transform_features(self, data: pd.DataFrame, transform_type: str) -> pd.DataFrame:
if transform_type == 'scale':
return pd.DataFrame(
self.scaler.fit_transform(data),
columns=data.columns
)
elif transform_type == 'impute':
return pd.DataFrame(
self.imputer.fit_transform(data),
columns=data.columns
)
elif transform_type == 'polynomial':
# Need to modify this file to add new transformations!
poly = PolynomialFeatures(degree=2)
return pd.DataFrame(
poly.fit_transform(data),
columns=[f"poly_{i}" for i in range(poly.n_output_features_)]
)
else:
raise ValueError(f"Unknown transformation: {transform_type}")
# Better Example - Following OCP
from abc import ABC, abstractmethod
from typing import List, Dict, Optional
class FeatureTransformation(ABC):
"""Abstract base class for all transformations."""
@abstractmethod
def fit(self, X: pd.DataFrame) -> 'FeatureTransformation':
"""Fit the transformation to the data."""
pass
@abstractmethod
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Apply the transformation to the data."""
pass
@abstractmethod
def get_feature_names(self, input_features: List[str]) -> List[str]:
"""Get names of transformed features."""
pass
class StandardScalerTransformation(FeatureTransformation):
"""Standardizes numeric features."""
def __init__(self, columns: Optional[List[str]] = None):
self.columns = columns
self.scaler = StandardScaler()
self._fitted_columns = None
def fit(self, X: pd.DataFrame) -> 'StandardScalerTransformation':
"""Fit scaler to data."""
self._fitted_columns = self.columns or X.select_dtypes(include=[np.number]).columns
self.scaler.fit(X[self._fitted_columns])
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Apply scaling transformation."""
X_copy = X.copy()
X_copy[self._fitted_columns] = self.scaler.transform(X[self._fitted_columns])
return X_copy
def get_feature_names(self, input_features: List[str]) -> List[str]:
"""Return scaled feature names."""
return [f"scaled_{col}" for col in self._fitted_columns]
class PolynomialTransformation(FeatureTransformation):
"""Creates polynomial features."""
def __init__(self, degree: int = 2, columns: Optional[List[str]] = None):
self.degree = degree
self.columns = columns
self.poly = PolynomialFeatures(degree=degree)
self._fitted_columns = None
self._feature_names = None
def fit(self, X: pd.DataFrame) -> 'PolynomialTransformation':
"""Fit polynomial features."""
self._fitted_columns = self.columns or X.select_dtypes(include=[np.number]).columns
self.poly.fit(X[self._fitted_columns])
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Generate polynomial features."""
poly_features = self.poly.transform(X[self._fitted_columns])
feature_names = self.get_feature_names(self._fitted_columns)
# Create new dataframe with original and polynomial features
X_copy = X.copy()
poly_df = pd.DataFrame(poly_features, columns=feature_names, index=X.index)
return pd.concat([X_copy, poly_df], axis=1)
def get_feature_names(self, input_features: List[str]) -> List[str]:
"""Get polynomial feature names."""
return [f"poly_{i}" for i in range(self.poly.n_output_features_)]
class OutlierTransformation(FeatureTransformation):
"""Handles outliers using IQR method."""
def __init__(self, columns: Optional[List[str]] = None, threshold: float = 1.5):
self.columns = columns
self.threshold = threshold
self.bounds = {}
def fit(self, X: pd.DataFrame) -> 'OutlierTransformation':
"""Calculate outlier bounds."""
self._fitted_columns = self.columns or X.select_dtypes(include=[np.number]).columns
for column in self._fitted_columns:
Q1 = X[column].quantile(0.25)
Q3 = X[column].quantile(0.75)
IQR = Q3 - Q1
self.bounds[column] = {
'lower': Q1 - self.threshold * IQR,
'upper': Q3 + self.threshold * IQR
}
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Cap outliers to bounds."""
X_copy = X.copy()
for column in self._fitted_columns:
bounds = self.bounds[column]
X_copy[column] = X_copy[column].clip(bounds['lower'], bounds['upper'])
return X_copy
def get_feature_names(self, input_features: List[str]) -> List[str]:
"""Return feature names."""
return [f"outlier_handled_{col}" for col in self._fitted_columns]
class FeatureTransformationPipeline:
"""Pipeline that can accommodate any number of transformations."""
def __init__(self):
self.transformations: List[FeatureTransformation] = []
self.feature_names: List[str] = []
def add_transformation(self, transformation: FeatureTransformation) -> None:
"""Add a new transformation to the pipeline."""
self.transformations.append(transformation)
def fit_transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Apply all transformations in sequence."""
result = X.copy()
self.feature_names = list(X.columns)
for transformation in self.transformations:
transformation.fit(result)
result = transformation.transform(result)
self.feature_names.extend(
transformation.get_feature_names(self.feature_names)
)
return result
# Usage showing extensibility
def prepare_features(data: pd.DataFrame) -> pd.DataFrame:
"""Prepare features using various transformations."""
pipeline = FeatureTransformationPipeline()
# Add standard transformations
pipeline.add_transformation(StandardScalerTransformation())
pipeline.add_transformation(OutlierTransformation())
# Easily add new transformation without changing existing code
pipeline.add_transformation(
PolynomialTransformation(degree=2, columns=['age', 'income'])
)
return pipeline.fit_transform(data)
# Adding new functionality is just creating a new transformer
class TimeFeatureTransformation(FeatureTransformation):
"""Extracts features from datetime columns."""
def __init__(self, datetime_columns: List[str]):
self.datetime_columns = datetime_columns
def fit(self, X: pd.DataFrame) -> 'TimeFeatureTransformation':
"""Nothing to fit."""
return self
def transform(self, X: pd.DataFrame) -> pd.DataFrame:
"""Extract time-based features."""
X_copy = X.copy()
for column in self.datetime_columns:
if not pd.api.types.is_datetime64_any_dtype(X_copy[column]):
X_copy[column] = pd.to_datetime(X_copy[column])
X_copy[f"{column}_hour"] = X_copy[column].dt.hour
X_copy[f"{column}_day"] = X_copy[column].dt.day
X_copy[f"{column}_month"] = X_copy[column].dt.month
X_copy[f"{column}_day_of_week"] = X_copy[column].dt.dayofweek
return X_copy
def get_feature_names(self, input_features: List[str]) -> List[str]:
"""Get names of time-based features."""
features = []
for col in self.datetime_columns:
features.extend([
f"{col}_hour",
f"{col}_day",
f"{col}_month",
f"{col}_day_of_week"
])
return features
# Example usage with new transformer
pipeline = FeatureTransformationPipeline()
pipeline.add_transformation(StandardScalerTransformation())
pipeline.add_transformation(OutlierTransformation())
pipeline.add_transformation(TimeFeatureTransformation(['transaction_date']))
# Process features
processed_data = pipeline.fit_transform(raw_data)
Liskov Substitution Principle (LSP)
Imagine you’re building a machine learning system that processes customer data. You have a prediction API that expects all models to behave the same way, regardless of whether they’re simple scikit-learn models or complex deep learning ones.
Real-World Scenario:
Your team built an API endpoint /predict
that accepts customer features and returns churn predictions. Initially, it worked with a simple logistic regression model:
response = model.predict_proba(customer_features)
churn_risk = response[:, 1] # Get probability of churn
Then things got complicated:
- The deep learning team created a better model, but it returns probabilities differently
- AutoML tools produced models with different prediction methods
- Your custom ensemble model needs special preprocessing
Without LSP:
- Different model types require different handling in your API code
- You need multiple endpoints or complex if/else logic
- Code becomes brittle and hard to maintain
- Testing becomes complicated
- Adding new model types requires modifying existing code
With LSP:
- All models follow the same contract regardless of their internal implementation
- Your API code remains clean and simple
- Models can be swapped without changing the surrounding system
- Testing is straightforward
- Adding new model types is just creating a new class
Why LSP Matters
Here’s a common violation of LSP in data science:
# Violates LSP - different interfaces for different model types
class SklearnModel:
def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
self.model.fit(X, y)
def predict(self, X: pd.DataFrame) -> np.ndarray:
return self.model.predict(X)
class KerasModel:
def train(self, X: pd.DataFrame, y: pd.Series, epochs: int = 10) -> None:
# Different method name and signature
self.model.fit(X.values, y.values, epochs=epochs)
def infer(self, X: pd.DataFrame) -> np.ndarray:
# Different method name
return self.model.predict(X.values)
class OnlineModel:
def update(self, X: pd.DataFrame, y: pd.Series) -> None:
# Completely different interface
for x, y_true in zip(X.values, y.values):
self.model.partial_fit(x.reshape(1, -1), [y_true])
def predict_one(self, x: np.ndarray) -> float:
# Incompatible interface
return self.model.predict(x.reshape(1, -1))[0]
# This leads to complex, conditional code
def train_model(model, X, y):
if isinstance(model, SklearnModel):
model.fit(X, y)
elif isinstance(model, KerasModel):
model.train(X, y)
elif isinstance(model, OnlineModel):
model.update(X, y)
else:
raise ValueError("Unknown model type")
Better Approach - Following LSP
# Bad Example - Violates LSP
class MLModel:
def train(self, X, y):
pass
def predict(self, X):
pass
class SklearnModel(MLModel):
def __init__(self, model):
self.model = model
def train(self, X: pd.DataFrame, y: pd.Series):
self.model.fit(X, y)
def predict(self, X: pd.DataFrame):
return self.model.predict(X)
class DeepLearningModel(MLModel):
def __init__(self, model):
self.model = model
def train(self, X: pd.DataFrame, y: pd.Series):
# Violates LSP: Changes input assumptions
X = torch.tensor(X.values).float()
y = torch.tensor(y.values).float()
self.model.fit(X, y) # Expects tensors, not pandas objects
def predict(self, X: pd.DataFrame):
# Violates LSP: Changes output format
X = torch.tensor(X.values).float()
predictions = self.model.predict(X)
return predictions.numpy() # Returns numpy array instead of pandas
from abc import ABC, abstractmethod
import pandas as pd
import numpy as np
import torch
from sklearn.ensemble import RandomForestClassifier
from typing import Union, Tuple
class MLModel(ABC):
"""Base class defining the contract for all ML models"""
@abstractmethod
def train(self, X: pd.DataFrame, y: pd.Series) -> None:
"""Train the model on pandas DataFrame/Series"""
pass
@abstractmethod
def predict(self, X: pd.DataFrame) -> pd.Series:
"""Make predictions, always return pandas Series"""
pass
@abstractmethod
def predict_proba(self, X: pd.DataFrame) -> pd.DataFrame:
"""Get probabilities, always return pandas DataFrame"""
pass
class SklearnModelWrapper(MLModel):
"""Wrapper for scikit-learn models"""
def __init__(self, model: RandomForestClassifier):
self.model = model
self._feature_names = None
def train(self, X: pd.DataFrame, y: pd.Series) -> None:
"""Maintains pandas interface"""
self._feature_names = X.columns
self.model.fit(X, y)
def predict(self, X: pd.DataFrame) -> pd.Series:
"""Always returns pandas Series with index matching input"""
predictions = self.model.predict(X)
return pd.Series(predictions, index=X.index)
def predict_proba(self, X: pd.DataFrame) -> pd.DataFrame:
"""Always returns pandas DataFrame with probabilities"""
probs = self.model.predict_proba(X)
return pd.DataFrame(
probs,
index=X.index,
columns=self.model.classes_
)
class TorchModelWrapper(MLModel):
"""Wrapper for PyTorch models"""
def __init__(self, model: torch.nn.Module):
self.model = model
self._feature_names = None
self.classes_ = None
def _convert_to_tensor(self, X: pd.DataFrame) -> torch.Tensor:
"""Handle data conversion internally"""
return torch.tensor(X.values).float()
def train(self, X: pd.DataFrame, y: pd.Series) -> None:
"""Maintains same interface as other models"""
self._feature_names = X.columns
self.classes_ = sorted(y.unique())
# Handle conversion internally
X_tensor = self._convert_to_tensor(X)
y_tensor = torch.tensor(y.values).float()
# Training logic here...
self.model.fit(X_tensor, y_tensor)
def predict(self, X: pd.DataFrame) -> pd.Series:
"""Returns pandas Series like other models"""
X_tensor = self._convert_to_tensor(X)
with torch.no_grad():
predictions = self.model(X_tensor)
pred_labels = predictions.argmax(dim=1).numpy()
return pd.Series(
[self.classes_[i] for i in pred_labels],
index=X.index
)
def predict_proba(self, X: pd.DataFrame) -> pd.DataFrame:
"""Returns probabilities in same format as sklearn"""
X_tensor = self._convert_to_tensor(X)
with torch.no_grad():
probs = torch.softmax(self.model(X_tensor), dim=1).numpy()
return pd.DataFrame(
probs,
index=X.index,
columns=self.classes_
)
# The magic of LSP: All models work the same way in your pipeline
def evaluate_model(model: MLModel, X_test: pd.DataFrame, y_test: pd.Series) -> dict:
"""Works with ANY model that follows the contract"""
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
return {
'accuracy': accuracy_score(y_test, predictions),
'roc_auc': roc_auc_score(y_test, probabilities)
}
# Usage showing true substitutability
models = {
'sklearn': SklearnModelWrapper(RandomForestClassifier()),
'pytorch': TorchModelWrapper(torch.nn.Sequential(...)),
'custom': CustomModelWrapper(MyCustomModel())
}
results = {}
for name, model in models.items():
# Train with same interface
model.train(X_train, y_train)
# Evaluate with same interface
results[name] = evaluate_model(model, X_test, y_test)
# All models can be used interchangeably in production
selected_model = models['sklearn'] # Could be any model type
prediction_service = PredictionService(selected_model) # Works with any model
Remember: The key to following LSP in data science is ensuring that all specialized types (different model implementations) can be used anywhere the base type (generic model interface) is expected, without breaking the application’s behavior.
Interface Segregation Principle (ISP)
Think of ISP like a modular data science toolkit. Not every project needs every tool. Some projects might need data cleaning but not visualization, others might need model training but not deployment capabilities. ISP suggests creating smaller, focused interfaces rather than one giant interface that does everything.
Real-World Scenario
You’re creating a data processing library for your organization. Different teams have different needs:
- The research team only needs data loading and transformation
- The production team needs data validation and database connectivity
- The visualization team needs plotting and reporting capabilities
Without ISP, every team would need to implement all functionality, even parts they don’t use. With ISP, teams only implement what they need – like choosing specific packages from PyPI instead of installing the entire scientific Python stack.
The ISP states that clients should not be forced to depend on interfaces they don’t use. In data science, this often applies to data processing and model interfaces.
Common Violation:
# Violates ISP - forces classes to implement methods they don't need
class Database:
def store_data(self, data: pd.DataFrame) -> None:
pass
def connect(self, uri: str) -> None:
pass
def disconnect(self) -> None:
pass
def validate_schema(self, data: pd.DataFrame) -> bool:
pass
# Classes implementing Database are forced to implement
# all methods, even if they don't need connection handling
class InMemoryDatabase(Database):
def store_data(self, data: pd.DataFrame) -> None:
# Can store data
self.data = data
def connect(self, uri: str) -> None:
# Doesn't need connection but forced to implement
pass
def disconnect(self) -> None:
# Doesn't need disconnection but forced to implement
pass
def validate_schema(self, data: pd.DataFrame) -> bool:
return True # Might need this
Better Approach - Following ISP:
# Split interfaces based on functionality
class DataStorage:
"""Interface for data storage operations."""
def store_data(self, data: pd.DataFrame) -> None:
"""Stores data."""
pass
class DatabaseConnection:
"""Interface for database connection operations."""
def connect(self, uri: str) -> None:
"""Establishes connection."""
pass
def disconnect(self) -> None:
"""Closes connection."""
pass
class SchemaValidator:
"""Interface for schema validation."""
def validate_schema(self, data: pd.DataFrame) -> bool:
"""Validates data schema."""
pass
# Now classes can implement only what they need
class InMemoryStorage(DataStorage):
"""Simple in-memory storage."""
def __init__(self):
self.data = None
def store_data(self, data: pd.DataFrame) -> None:
self.data = data.copy()
class SQLDatabase(DataStorage, DatabaseConnection, SchemaValidator):
"""Full database implementation needing all functionality."""
def __init__(self):
self.connection = None
self.schema = None
def connect(self, uri: str) -> None:
self.connection = create_engine(uri)
def disconnect(self) -> None:
if self.connection:
self.connection.dispose()
def store_data(self, data: pd.DataFrame) -> None:
if self.validate_schema(data):
data.to_sql('table_name', self.connection)
def validate_schema(self, data: pd.DataFrame) -> bool:
return all(col in data.columns for col in self.schema)
# Usage showing flexibility of segregated interfaces
def store_training_data(
storage: DataStorage,
data: pd.DataFrame
) -> None:
"""Only needs data storage functionality."""
storage.store_data(data)
def process_database_data(
db: Union[DatabaseConnection, SchemaValidator, DataStorage],
uri: str,
data: pd.DataFrame
) -> None:
"""Needs full database functionality."""
db.connect(uri)
if db.validate_schema(data):
db.store_data(data)
db.disconnect()
# Can use either implementation as needed
in_memory = InMemoryStorage()
sql_db = SQLDatabase()
store_training_data(in_memory) # Works with simple storage
process_database_data(sql_db, "postgresql://...", data) # Works with full database
Dependency Inversion Principle (DIP)
Think of DIP like scikit-learn’s Pipeline class. It doesn’t care about the specific preprocessors or models you use – it works with any estimator that follows the right interface. High-level components (like Pipeline) depend on abstractions, not concrete implementations.
Real-World Scenario
You’re building an automated machine learning system that needs to:
- Try different preprocessors (StandardScaler, RobustScaler, etc.)
- Test various models (RandomForest, XGBoost, etc.)
- Use different validation strategies (cross-validation, holdout, etc.)
Without DIP, your system would be tightly coupled to specific implementations. With DIP, it works with any components that follow the right interfaces – just like how scikit-learn’s GridSearchCV works with any estimator. This makes it easy to:
- Add new preprocessing methods
- Try new models
- Implement custom validation strategies
The DIP states that high-level modules should not depend on low-level modules; both should depend on abstractions. Let’s see how this applies in data science pipelines.
Common Violation:
# Violates DIP - high-level pipeline depends on concrete implementations
class ModelPipeline:
def __init__(self):
# Direct dependencies on concrete classes
self.preprocessor = StandardScaler()
self.model = RandomForestClassifier()
self.validator = CrossValidator()
def run_pipeline(self, data: pd.DataFrame) -> Dict[str, float]:
# Tightly coupled to specific implementations
scaled_data = self.preprocessor.fit_transform(data)
self.model.fit(scaled_data)
return self.validator.validate(self.model, scaled_data)
Better Approach - Following DIP:
# Define abstractions
class Preprocessor(Protocol):
def fit_transform(self, data: pd.DataFrame) -> pd.DataFrame:
...
class Model(Protocol):
def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
...
def predict(self, X: pd.DataFrame) -> np.ndarray:
...
class Validator(Protocol):
def validate(
self,
model: Model,
data: pd.DataFrame
) -> Dict[str, float]:
...
# Concrete implementations depend on abstractions
class StandardScalerPreprocessor:
"""Concrete preprocessor implementation."""
def __init__(self):
self.scaler = StandardScaler()
def fit_transform(self, data: pd.DataFrame) -> pd.DataFrame:
return pd.DataFrame(
self.scaler.fit_transform(data),
columns=data.columns
)
class RandomForestModel:
"""Concrete model implementation."""
def __init__(self):
self.model = RandomForestClassifier()
def fit(self, X: pd.DataFrame, y: pd.Series) -> None:
self.model.fit(X, y)
def predict(self, X: pd.DataFrame) -> np.ndarray:
return self.model.predict(X)
class CrossValidationValidator:
"""Concrete validator implementation."""
def validate(
self,
model: Model,
data: pd.DataFrame
) -> Dict[str, float]:
cv_scores = cross_val_score(model, data)
return {
'mean_cv_score': cv_scores.mean(),
'std_cv_score': cv_scores.std()
}
# High-level module depends on abstractions
class ModelPipeline:
"""Pipeline depending on abstractions, not concrete implementations."""
def __init__(
self,
preprocessor: Preprocessor,
model: Model,
validator: Validator
):
self.preprocessor = preprocessor
self.model = model
self.validator = validator
def run_pipeline(self, data: pd.DataFrame) -> Dict[str, float]:
"""Runs the pipeline using abstractions."""
processed_data = self.preprocessor.fit_transform(data)
self.model.fit(processed_data)
return self.validator.validate(self.model, processed_data)
# Usage showing flexibility and loose coupling
# Can easily swap implementations without changing pipeline
standard_pipeline = ModelPipeline(
preprocessor=StandardScalerPreprocessor(),
model=RandomForestModel(),
validator=CrossValidationValidator()
)
# Could easily create alternative pipeline with different implementations
robust_pipeline = ModelPipeline(
preprocessor=RobustScalerPreprocessor(), # Different preprocessor
model=XGBoostModel(), # Different model
validator=BootstrapValidator() # Different validator
)
Clean Code Checklist for Data Scientists
Using the Clean Code Checklist: An Iterative Approach
This checklist is not meant to be a strict set of requirements that must all be met before committing code. Instead, it serves as a guide for progressive improvement after you’ve got your code working.
When to Use This Checklist
- ✅ After your initial code is functioning correctly
- ✅ During code review sessions
- ✅ When revisiting older notebooks or scripts
- ✅ Before sharing code with teammates
How to Use It
-
Don’t Try to Perfect Everything at Once
- Pick 1-2 items to focus on each time you revisit your code
- Start with the most impactful improvements for your specific situation
-
Progressive Enhancement
- Each time you interact with your code, make it a bit cleaner
- Focus on areas you’re actively modifying
- Gradually improve naming, documentation, and structure
-
Practical Approach
# Initial working version def p(d): return d.fillna(0) # First improvement: Better naming def process_missing_values(data): return data.fillna(0) # Later improvement: Add type hints and documentation def process_missing_values(data: pd.DataFrame) -> pd.DataFrame: """Fill missing values with zeros in the dataset.""" return data.fillna(0)
Remember: The goal is continuous improvement, not perfection. Use this checklist as a reference for making incremental enhancements to your code over time.
1. Naming 🏷️
Variables and DataFrames:
- Uses descriptive nouns (e.g.,
customer_data
instead ofdf
) - Boolean variables start with
is_
,has_
, or similar (e.g.,is_outlier
) - DataFrame names indicate their content (e.g.,
raw_sales_data
,cleaned_features
) - Follows Python naming convention (snake_case)
- Avoids abbreviations (e.g.,
customer_count
instead ofcust_cnt
)
Functions:
- Uses verbs that describe the action (e.g.,
calculate_mean_return
instead ofmean_ret
) - Name reflects the level of abstraction (e.g.,
train_model
vsfit_random_forest
) - Clearly indicates any data modifications (e.g.,
normalize_features
vsprocess_features
)
Classes:
- Uses nouns describing the entity (e.g.,
DataCleaner
,ModelEvaluator
) - Names reflect single responsibility (e.g.,
OutlierDetector
instead ofDataHandler
)
2. Function Design 🔧
Structure:
- Each function does one thing
- Function length is reasonable (typically < 50 lines)
- Returns clear, consistent data types
- Uses type hints for parameters and return values
- Includes docstrings with examples for complex functions
Parameters:
- Limits number of parameters (ideally ≤ 3)
- Uses dataclasses or configuration objects for multiple parameters
- Provides default values where appropriate
- Validates input parameters
3. Code Organization 📁
Notebook Structure:
- Separates imports, configurations, and main code
- Groups related cells together
- Includes markdown documentation between logical sections
- Moves reusable functions to separate modules
Script Structure:
- Uses clear section separation
- Follows a logical flow (e.g., data loading → preprocessing → modeling)
- Places utility functions in separate modules
- Uses
if __name__ == '__main__'
for script execution
4. Error Handling and Data Validation ⚠️
- Validates input data early
- Uses appropriate error types
- Includes informative error messages
- Handles missing values explicitly
- Checks for data leakage in preprocessing steps
5. Comments and Documentation 📝
- Includes docstrings for functions and classes
- Documents complex algorithms or business logic
- Explains the ‘why’ not the ‘what’
- Removes commented-out code
- Uses TODO comments sparingly and meaningfully
6. Code Style and Formatting 🎨
- Follows PEP 8 guidelines
- Uses consistent indentation
- Keeps lines at reasonable length (≤ 88 characters)
- Uses blank lines to separate logical sections
- Aligns related code elements
7. Data Science Specific 🔬
Feature Engineering:
- Uses descriptive feature names
- Documents feature transformations
- Maintains feature creation reproducibility
- Tracks feature dependencies
Model Development:
- Sets random seeds for reproducibility
- Separates model training from evaluation
- Documents model parameters and reasoning
- Implements cross-validation appropriately
Pipeline Design:
- Creates modular transformation steps
- Handles categorical and numerical features separately
- Prevents data leakage
- Makes pipelines serializable
8. Testing and Validation 🧪
- Includes basic unit tests for critical functions
- Validates data preprocessing steps
- Checks model performance metrics
- Tests edge cases in data transformations
9. Version Control 🔄
- Uses clear, descriptive commit messages
- Separates model iterations in version control
- Tracks dependencies (e.g.,
requirements.txt
orenvironment.yml
) - Documents environment setup
10. Performance and Efficiency ⚡
- Avoids unnecessary data copies
- Uses appropriate data types
- Implements efficient data transformations
- Considers memory usage for large datasets
Use this checklist before committing code or when reviewing existing code. Not every item will apply to every situation, but it provides a framework for writing cleaner, more maintainable data science code.
Conclusion: The Journey to Cleaner Code
Writing clean code in data science is a journey, not a destination. Much like how we iteratively improve our models, we should approach code quality as a continuous improvement process. Let’s reflect on why this matters and how to move forward.
The Reality of Data Science Code
Most data scientists don’t write clean code from the start - and that’s perfectly fine. When exploring data or prototyping models, our first priority is often to get something working. We might start with a messy Jupyter notebook full of quick experiments and abbreviated variable names. This is a natural part of the data science workflow.
# A typical exploratory analysis might start like this
df = pd.read_csv('data.csv')
x = df.drop('target', 1)
y = df['target']
rf = RandomForestClassifier()
rf.fit(x, y)
print(rf.score(x, y))
The problems arise when this exploratory code makes its way into production systems or when we need to revisit our analysis months later. That’s where clean code principles become invaluable.
The Benefits of Clean Code in Data Science
Clean code isn’t just about aesthetics - it delivers tangible benefits:
-
Reproducibility: Well-organized code makes it easier to reproduce results, a fundamental requirement in data science.
-
Collaboration: Clean code enables team members to understand and contribute to each other’s work effectively.
-
Maintenance: When you need to update models or modify preprocessing steps, clean code makes these changes safer and easier.
-
Debugging: When issues arise, clean code makes it easier to isolate and fix problems.
The Path Forward
Remember these key points as you develop your clean coding practices:
- Start Simple: You don’t need to implement every clean code principle at once. Begin with basic improvements like better naming and function organization.
- Iterate: Just as you iterate on your models, iterate on your code quality. Each time you revisit a script or notebook, try to make it a little cleaner.
- Review: Use the checklist provided in this guide to review your code periodically. Make it part of your workflow, just like model validation.
- Learn from Others: Study well-maintained open-source data science projects. Pay attention to how they structure their code and handle common challenges.
Final Thoughts
Clean code in data science is about finding the right balance. We need to maintain the flexibility and experimentation that makes data science exciting while building maintainable, professional-grade software. The principles and practices we’ve covered in this guide aren’t rigid rules but rather tools to help you find that balance.
Remember: The goal isn’t perfection, but progress. Each step toward cleaner code is a step toward more robust, reliable, and reproducible data science.
As you apply these principles in your work, you’ll likely find that writing clean code becomes second nature. The extra time invested in writing clear, well-organized code pays dividends in the long run through easier maintenance, better collaboration, and fewer bugs.
Keep the provided checklist handy, but don’t let it paralyze you. Use it as a guide to gradually improve your code quality, one commit at a time. After all, the best code is not just technically correct - it’s code that tells a clear story about your data science journey.