Building on the supervised learning concepts from Post 5, we now zoom into how we move from raw data to actionable predictions. This is the most widely used form of machine learning—and the basis for many real-world systems.
Introduction
Supervised learning excels when we have labeled data and clear objectives. It underpins systems like fraud detection, recommendation engines, and medical diagnosis tools.
The recommendation systems we discussed in Post 3 use these techniques to personalize suggestions. As we defined in Post 1, training data is the historical dataset used to teach a model how to predict outcomes.
1. Classification vs. Regression
Classification:
Output is a discrete label (e.g., Yes/No, Cat/Dog)
Example: Spam detection, image classification
Regression:
Output is a continuous value
Example: Predicting house prices or blood pressure levels
2. Feature Selection and Engineering
Features are the inputs that help the model make predictions.
Selection: Choosing the most relevant variables
Engineering: Creating new variables or transforming existing ones
Example: In house price prediction:
Features might include square footage, number of rooms, location
Engineered feature: Price per square meter
3. Model Training Process (Step-by-Step)
Collect and clean data
Split into training and testing sets
Select a model (e.g., decision tree)
Train on the training set
Validate using the testing set
Tune hyperparameters if needed
Deploy the final model
4. Cross-Validation and Performance Metrics
Cross-validation ensures the model performs well on different data slices
Metrics:
Accuracy
Precision & Recall
F1-score
Confusion matrix
Example: A cancer screening model must balance recall (catching true cases) with precision (avoiding false alarms).
5. Practical Example: Predicting Student Exam Scores
Let’s build a simple regression model:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
# Sample dataset
data = pd.DataFrame({
'study_hours': [1, 2, 3, 4, 5, 6],
'score': [50, 55, 65, 70, 75, 85]
})
X = data[['study_hours']]
y = data['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(predictions)
Key Takeaways
Supervised learning is widely used due to its predictability and control
Good features often matter more than complex models
Evaluating and validating models helps ensure real-world performance
Discussion Questions
Why is feature selection so important in model performance?
How can regression be used in climate change modeling?
What are the trade-offs between precision and recall in health applications?
How do you know when a model is “good enough”?
When should you retrain a deployed model?
Hands-On Exercises
Modify the Study Hours Model: Add a second feature (e.g., sleep hours) and retrain.
Evaluate Precision/Recall: Use a classification dataset (like breast cancer) and compute precision/recall.
Build a Decision Tree: Use Scikit-learn to train a decision tree classifier and visualize it.
Coming Next: Unsupervised Learning – Discovering Patterns Without Labels (by Google Gemini)