Statistical Models

Introduction
Statistical models are mathematical representations of relationships between variables in a dataset. These models are used to make predictions, infer causal relationships, and understand patterns in data. This module focuses on two fundamental models: linear regression and logistic regression.
Learning Objectives
By the end of this module, you will be able to:
- Understand when to use linear vs logistic regression
- Build linear and logistic regression models using Python
- Check and validate model assumptions
- Interpret model coefficients and outputs
- Evaluate model performance using appropriate metrics
- Diagnose and address common model problems
Prerequisites
Before starting this module, you should be familiar with: - Descriptive Statistics - Probability Foundations - Inferential Statistics - Basic Python programming with NumPy and Pandas
## Uncomment and run this cell to install the packages
# !pip install pandas numpy statsmodels matplotlib seaborn scikit-learn
import statsmodels.api as sm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix, classification_report
1. Linear Regression
Linear regression models the relationship between one or more independent variables (predictors) and a continuous dependent variable (outcome).
Model Equation
For simple linear regression (one predictor):
For multiple linear regression:
where: - \(y\) is the dependent variable - \(x_i\) are independent variables - \(\beta_0\) is the intercept - \(\beta_i\) are coefficients (slopes) - \(\epsilon\) is the error term
Assumptions of Linear Regression
- Linearity: The relationship between X and Y is linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of errors
- Normality: Errors are normally distributed
- No multicollinearity: Predictors are not highly correlated (for multiple regression)
Building a Linear Regression Model
# Generate example data
np.random.seed(42)
n = 100
# Create features
X1 = np.random.randn(n)
X2 = np.random.randn(n)
# True relationship: y = 2 + 3*X1 + 1.5*X2 + noise
y = 2 + 3*X1 + 1.5*X2 + np.random.randn(n) * 0.5
# Prepare data for statsmodels (needs constant term)
X = pd.DataFrame({'X1': X1, 'X2': X2})
X = sm.add_constant(X) # Adds intercept term
# Fit the model
model = sm.OLS(y, X).fit()
# Print summary
print(model.summary())
Interpreting Linear Regression Output
# Key metrics from summary:
print(f"\nR-squared: {model.rsquared:.4f}")
print(f"Adjusted R-squared: {model.rsquared_adj:.4f}")
print(f"F-statistic: {model.fvalue:.4f}")
print(f"F-statistic p-value: {model.f_pvalue:.4f}")
# Coefficients
print("\nCoefficients:")
print(model.params)
# Confidence intervals
print("\n95% Confidence Intervals:")
print(model.conf_int())
Key Interpretations: - R-squared: Proportion of variance in y explained by the model (0 to 1, higher is better) - Adjusted R-squared: R-squared adjusted for number of predictors (penalizes overfitting) - F-statistic: Tests if model is significantly better than intercept-only model - Coefficients: Change in y for one-unit change in predictor (holding others constant) - p-values: Significance of each coefficient
Model Diagnostics
# 1. Residuals vs Fitted Values (check for homoscedasticity)
fitted = model.fittedvalues
residuals = model.resid
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(fitted, residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted Values')
plt.grid(True, alpha=0.3)
# 2. Q-Q Plot (check for normality of residuals)
from scipy import stats
plt.subplot(1, 3, 2)
stats.probplot(residuals, dist="norm", plot=plt)
plt.title('Q-Q Plot (Normality Check)')
plt.grid(True, alpha=0.3)
# 3. Scale-Location Plot (check for homoscedasticity)
plt.subplot(1, 3, 3)
sqrt_abs_residuals = np.sqrt(np.abs(residuals))
plt.scatter(fitted, sqrt_abs_residuals, alpha=0.6)
plt.xlabel('Fitted Values')
plt.ylabel('√|Standardized Residuals|')
plt.title('Scale-Location Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Evaluating Model Performance
# Split data for evaluation
X_train, X_test, y_train, y_test = train_test_split(
X.drop('const', axis=1), y, test_size=0.2, random_state=42
)
# Fit on training data
X_train_sm = sm.add_constant(X_train)
model_train = sm.OLS(y_train, X_train_sm).fit()
# Predict on test data
X_test_sm = sm.add_constant(X_test)
y_pred = model_train.predict(X_test_sm)
# Evaluation metrics
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse:.4f}")
print(f"Root Mean Squared Error: {rmse:.4f}")
print(f"R-squared: {r2:.4f}")
# Visualize predictions
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs Predicted Values')
plt.grid(True, alpha=0.3)
plt.show()
Common Issues and Solutions
1. Multicollinearity
# Check for multicollinearity using correlation matrix
correlation_matrix = X.drop('const', axis=1).corr()
print("Correlation Matrix:")
print(correlation_matrix)
# High correlation (>0.8) indicates multicollinearity
# Solution: Remove one of the highly correlated variables or use regularization
2. Non-linear Relationships
# If relationship is non-linear, consider:
# - Polynomial features
# - Transformations (log, sqrt)
# - Non-linear models
3. Outliers
# Detect outliers using residuals
standardized_residuals = residuals / np.std(residuals)
outliers = np.abs(standardized_residuals) > 3
print(f"Number of outliers: {outliers.sum()}")
# Solution: Investigate outliers, consider robust regression
2. Logistic Regression
Logistic regression is used when the dependent variable is binary (0/1, Yes/No, Success/Failure). It models the probability of the outcome.
Model Equation
The logistic function (sigmoid) transforms linear combination to probabilities:
Or in log-odds form:
Assumptions of Logistic Regression
- Binary outcome: Dependent variable is binary
- Independence: Observations are independent
- Linearity: Log-odds are linear in predictors
- No multicollinearity: Predictors are not highly correlated
- Large sample size: Generally need n ≥ 10 per predictor
Building a Logistic Regression Model
# Generate example data for logistic regression
np.random.seed(42)
n = 200
# Create features
X1 = np.random.randn(n)
X2 = np.random.randn(n)
# True relationship: log-odds = -1 + 2*X1 + 1.5*X2
log_odds = -1 + 2*X1 + 1.5*X2
prob = 1 / (1 + np.exp(-log_odds)) # Sigmoid function
# Generate binary outcome
y = (np.random.rand(n) < prob).astype(int)
# Prepare data
X = pd.DataFrame({'X1': X1, 'X2': X2})
X = sm.add_constant(X)
# Fit logistic regression model
logit_model = sm.Logit(y, X).fit()
# Print summary
print(logit_model.summary())
Interpreting Logistic Regression Output
# Coefficients are in log-odds scale
print("Coefficients (log-odds):")
print(logit_model.params)
# Convert to odds ratios
odds_ratios = np.exp(logit_model.params)
print("\nOdds Ratios:")
print(odds_ratios)
# Interpretation example
print(f"\nFor X1:")
print(f" Coefficient: {logit_model.params['X1']:.4f}")
print(f" Odds Ratio: {odds_ratios['X1']:.4f}")
print(f" Interpretation: One unit increase in X1 multiplies odds by {odds_ratios['X1']:.4f}")
Key Interpretations: - Coefficients: Change in log-odds for one-unit change in predictor - Odds Ratio: Multiplicative change in odds (e^(coefficient)) - Pseudo R-squared: Measures model fit (McFadden's, Nagelkerke's) - Log-likelihood: Measure of model fit (higher is better)
Model Evaluation for Classification
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X.drop('const', axis=1), y, test_size=0.2, random_state=42, stratify=y
)
# Fit on training data
X_train_sm = sm.add_constant(X_train)
logit_train = sm.Logit(y_train, X_train_sm).fit()
# Predict probabilities
X_test_sm = sm.add_constant(X_test)
y_pred_proba = logit_train.predict(X_test_sm)
# Convert to binary predictions (using 0.5 threshold)
y_pred = (y_pred_proba >= 0.5).astype(int)
# Evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# ROC Curve
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()
Model Diagnostics for Logistic Regression
# 1. Check for separation (perfect prediction)
# If model has very large coefficients, may indicate separation
# 2. Residuals analysis (deviance residuals)
deviance_residuals = logit_model.resid_dev
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.scatter(logit_model.fittedvalues, deviance_residuals, alpha=0.6)
plt.axhline(y=0, color='r', linestyle='--')
plt.xlabel('Fitted Values')
plt.ylabel('Deviance Residuals')
plt.title('Deviance Residuals vs Fitted Values')
plt.grid(True, alpha=0.3)
# 3. Influence measures
influence = logit_model.get_influence()
cooks_d = influence.cooks_distance[0]
plt.subplot(1, 2, 2)
plt.scatter(range(len(cooks_d)), cooks_d, alpha=0.6)
plt.axhline(y=4/len(cooks_d), color='r', linestyle='--', label='Threshold')
plt.xlabel('Observation')
plt.ylabel("Cook's Distance")
plt.title("Cook's Distance (Influence)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
3. Model Comparison and Selection
Information Criteria
# AIC (Akaike Information Criterion) - lower is better
# BIC (Bayesian Information Criterion) - lower is better
print("Linear Regression:")
print(f" AIC: {model.aic:.2f}")
print(f" BIC: {model.bic:.2f}")
print("\nLogistic Regression:")
print(f" AIC: {logit_model.aic:.2f}")
print(f" BIC: {logit_model.bic:.2f}")
# Compare models using AIC/BIC (only for same type of outcome)
Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
# Using scikit-learn for easier cross-validation
lr_sklearn = LogisticRegression()
X_features = X.drop('const', axis=1)
# 5-fold cross-validation
cv_scores = cross_val_score(lr_sklearn, X_features, y, cv=5, scoring='accuracy')
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
Key Takeaways
-
Linear regression is for continuous outcomes; logistic regression is for binary outcomes.
-
Always check assumptions before interpreting results; violations can lead to incorrect conclusions.
-
Model diagnostics help identify problems like non-linearity, heteroscedasticity, and outliers.
-
Interpret coefficients carefully:
- Linear regression: change in outcome per unit change in predictor
-
Logistic regression: change in log-odds (or odds ratio) per unit change in predictor
-
Use appropriate evaluation metrics:
- Regression: MSE, RMSE, R²
-
Classification: Accuracy, Precision, Recall, F1-score, ROC-AUC
-
Cross-validation helps assess model generalizability to new data.
Applications in Data Science
- Predictive modeling: Forecast continuous or categorical outcomes
- Feature importance: Identify which predictors matter most
- Causal inference: Understand relationships (with caution)
- Risk assessment: Predict probabilities of events
- A/B testing: Compare treatment effects
Next Steps
Now that you understand statistical models, you can explore: - Advanced Linear Algebra for dimensionality reduction techniques - Machine Learning models for more complex relationships - Time series models for temporal data
Conclusion
Statistical models provide powerful tools for understanding relationships in data. This module covered: - Building and interpreting linear regression models - Building and interpreting logistic regression models - Model diagnostics and assumption checking - Model evaluation and comparison - Common issues and solutions
Remember: A good model is not just about high R² or accuracy—it should be interpretable, generalizable, and based on sound statistical principles.