Machine Learning Documentation
A practical guide to understanding and implementing machine learning systems.
ML has become the backbone of modern software — from recommendation engines and fraud detection to medical imaging and autonomous vehicles. This documentation covers the fundamentals, walks through practical code, and provides guidance for building production-ready ML systems.
Reading Progress
Section 1 of 5 — click sidebar links to navigate
Quick Start
Get a model trained and predicting in under 10 lines of Python.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
print(f"Accuracy: {model.score(X_test, y_test):.2%}")
pip install scikit-learn numpy pandas before running the examples.
Types of Machine Learning
ML approaches are categorized by how the model learns from data.
| Type | Label? | Use Case | Example |
|---|---|---|---|
| Supervised | Yes | Classification, Regression | Spam detection, price prediction |
| Unsupervised | No | Clustering, Dimensionality Reduction | Customer segmentation, anomaly detection |
| Semi-supervised | Partial | Text classification with few labels | Document categorization |
| Reinforcement | Reward signal | Decision making, Control | Game AI, robotics, trading |
Supervised Learning — Deep Dive
Supervised learning uses labeled training data — each input has a corresponding known output. The model learns a mapping function from inputs to outputs.
Classification predicts discrete categories (e.g., spam vs. not spam). Regression predicts continuous values (e.g., house prices).
Common algorithms: Linear Regression, Logistic Regression, Decision Trees, SVMs, Neural Networks.
Unsupervised Learning — Deep Dive
The model finds hidden patterns in data without labels. This is useful for exploratory data analysis and feature engineering.
Clustering groups similar data points together. Dimensionality reduction compresses features while preserving structure (e.g., PCA, t-SNE).
Reinforcement Learning — Deep Dive
An agent interacts with an environment, receiving rewards or penalties for actions. It learns a policy that maximizes cumulative reward over time.
Key concepts: State, Action, Reward, Policy, Q-value. Notable successes include AlphaGo and robotic control systems.
The ML Pipeline
Click each step to learn more about it.
Collection Gather raw data
Engineering Extract signals
Training Fit to data
Common Algorithms
Linear Regression
Fits a line (or hyperplane) to minimize the sum of squared residuals between predicted and actual values.
y = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
Strengths: interpretable, fast, good baseline. Weaknesses: assumes linear relationships, sensitive to outliers.
Logistic Regression
Despite the name, it's a classification algorithm. Applies the sigmoid function to linear output to produce a probability between 0 and 1.
P(y=1|x) = σ(w·x + b) = 1 / (1 + e-(w·x+b))
Decision Trees
Recursively splits the feature space based on thresholds that maximize information gain (or minimize impurity).
Strengths: interpretable, handles non-linear relationships, no feature scaling needed.
Random Forest
An ensemble of decision trees trained on random subsets of data and features. Reduces variance through averaging.
Gradient Boosting (XGBoost / LightGBM)
Sequentially trains weak learners, each correcting errors of the previous one. Often the top performer in structured/tabular data competitions.
n_estimators, learning_rate, and max_depth with cross-validation.
Feedforward Neural Networks
Layers of interconnected neurons with non-linear activation functions. Universal function approximators — given enough neurons, they can model any relationship.
Convolutional Neural Networks (CNNs)
Specialized for grid-like data (images). Use convolutional filters to detect local patterns, pooling layers to reduce spatial dimensions.
Transformers
Attention-based architecture that processes sequences in parallel. Foundation of modern NLP (BERT, GPT) and increasingly used in vision and multimodal tasks.
# Simple neural network in PyTorch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 10),
)
Code Examples
Data Preprocessing
import pandas as pd
from sklearn.preprocessing import StandardScaler, LabelEncoder
df = pd.read_csv("data.csv")
# Handle missing values
df.fillna(df.median(numeric_only=True), inplace=True)
# Encode categorical features
le = LabelEncoder()
df["category"] = le.fit_transform(df["category"])
# Scale numerical features
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
Training with Cross-Validation
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=4,
)
scores = cross_val_score(model, X, y, cv=5, scoring="accuracy")
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
Model Evaluation
| Metric | Task | Formula | When to Use |
|---|---|---|---|
| Accuracy | Classification | correct / total |
Balanced classes |
| Precision | Classification | TP / (TP + FP) |
Minimizing false positives matters |
| Recall | Classification | TP / (TP + FN) |
Minimizing false negatives matters |
| F1 Score | Classification | 2 · P·R / (P+R) |
Imbalanced classes |
| RMSE | Regression | √(mean(errors²)) |
Penalize large errors |
| AUC-ROC | Classification | Area under ROC curve | Probability-based ranking |
Best Practices
1. Start Simple
Always establish a baseline with a simple model (logistic regression, decision tree) before trying complex approaches. You need a benchmark to know if complexity is worth it.
2. Avoid Data Leakage
Never let information from your test set influence training. Fit preprocessing (scaling, encoding) on training data only, then transform test data.
# WRONG - fits on all data
scaler.fit(X)
# RIGHT - fits only on training data
scaler.fit(X_train)
X_test_scaled = scaler.transform(X_test)
3. Use Cross-Validation
A single train/test split can be misleading. Use k-fold cross-validation (typically k=5 or k=10) for a more reliable estimate of model performance.
4. Monitor for Overfitting
If training accuracy is much higher than validation accuracy, the model is memorizing noise. Use regularization, dropout, early stopping, or gather more data.
5. Version Everything
Track your data, code, hyperparameters, and model artifacts. Tools like MLflow, DVC, and Weights & Biases help with experiment tracking.
Glossary
| Term | Definition |
|---|---|
| Epoch | One full pass through the entire training dataset. |
| Batch Size | Number of samples processed before the model updates weights. |
| Learning Rate | Step size for weight updates during optimization. Too high = diverge, too low = slow convergence. |
| Regularization | Techniques (L1, L2, dropout) that penalize complexity to prevent overfitting. |
| Feature | An individual measurable property of the data used as input to a model. |
| Hyperparameter | A parameter set before training (not learned from data), e.g., learning rate, number of layers. |
| Gradient Descent | Optimization algorithm that iteratively adjusts weights in the direction that reduces loss. |
| Overfitting | Model performs well on training data but poorly on unseen data. |
| Underfitting | Model is too simple to capture the underlying patterns in the data. |
Resources
Books
- Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow — Aurélien Géron
- The Elements of Statistical Learning — Hastie, Tibshirani, Friedman
- Deep Learning — Goodfellow, Bengio, Courville
Libraries
| Library | Purpose |
|---|---|
scikit-learn | Classical ML algorithms, preprocessing, evaluation |
PyTorch | Deep learning with dynamic computation graphs |
TensorFlow | Deep learning with production deployment tooling |
XGBoost | Optimized gradient boosting |
pandas | Data manipulation and analysis |
MLflow | Experiment tracking, model registry |