Interview¶
Interviewer: "Can you explain the purpose of a train/test split in machine learning?"¶
The train/test split is a fundamental technique used in machine learning to evaluate how well a model generalizes to unseen data. The idea is to divide the available dataset into two parts:
The training set, which is used to train the model β meaning the model learns the patterns, features, and relationships from this data.
The test set, which is used only after the training is complete, to test how well the model performs on new, unseen data.
This helps in identifying whether the model is overfitting β that is, performing well on training data but poorly on new data β or if itβs generalizing well to different scenarios.
A common practice is to split the data into 80% for training and 20% for testing, but this can vary depending on the size and nature of the dataset.
π‘ Code Walkthrough (Interview Explanation) Interviewer: "Can you show a simple example?"
Sure. Hereβs a quick example using the Iris dataset with a Random Forest classifier:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split dataset: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Predict on test data
y_pred = model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy)
In this example, we first load the dataset and split it using train_test_split. Then we train a RandomForestClassifier on the training set. Finally, we test the model on the test set and measure accuracy using accuracy_score.
This way, we get a clear idea of how well our model might perform in a real-world scenario.
Great! Let's do a mock interview focused on model evaluation, starting from the train/test split, and moving into related topics like overfitting, cross-validation, and evaluation metrics.
π§ͺ MOCK INTERVIEW SESSION β MODEL EVALUATION¶
π€ Interviewer¶
"Can you explain the purpose of train/test split in machine learning?"
β Your Answer¶
Yes, the train/test split is a fundamental technique used to evaluate the performance of a machine learning model.
We divide the dataset into two parts β the training set is used to train the model, and the test set is used to evaluate how well it generalizes to unseen data.
This helps ensure that the model doesnβt just memorize the training data (i.e., overfit), but can also make accurate predictions on new, real-world data.
π€ Interviewer¶
"You mentioned overfitting. Can you explain what overfitting is and how to detect it?"
β Your Answer¶
Overfitting occurs when a model learns not only the underlying patterns in the training data but also the noise and random fluctuations.
As a result, it performs very well on the training set but poorly on the test set.
We can detect overfitting by comparing training and test accuracy:
- If training accuracy is high and test accuracy is low, itβs a sign of overfitting.
To prevent it, we can:
- Use simpler models
- Apply regularization
- Get more training data
- Use techniques like cross-validation for more robust evaluation
π€ Interviewer¶
"Interesting. What is cross-validation, and how is it better than a simple train/test split?"
β Your Answer¶
Cross-validation is a more reliable method of evaluating a model, especially when the dataset is small. The most common type is k-fold cross-validation, where the data is split into
k
parts. The model is trained onk-1
parts and tested on the remaining part. This process repeatsk
times, and the results are averaged.Compared to a simple train/test split, cross-validation:
- Reduces variance in the evaluation
- Provides a more stable and unbiased estimate of model performance
- Helps ensure that every data point is used for both training and testing
π€ Interviewer¶
"What metrics would you use to evaluate a classification model?"
β Your Answer¶
It depends on the problem, but commonly used metrics for classification include:
- Accuracy: Ratio of correctly predicted instances. Good when classes are balanced.
- Precision: How many predicted positives are actually positive. Useful when false positives are costly.
- Recall: How many actual positives were correctly predicted. Important when missing positives is costly.
- F1-score: Harmonic mean of precision and recall β good balance when data is imbalanced.
- Confusion Matrix: Gives a detailed breakdown of true/false positives and negatives.
- For probabilistic models, I might also look at ROC-AUC to evaluate performance at different thresholds.
π€ Interviewer¶
"Let's say you get 95% accuracy on your model. Is that enough to consider your model good?"
β Your Answer¶
Not necessarily. Accuracy alone can be misleading β especially in imbalanced datasets.
For example, if 95% of the data belongs to class A and only 5% to class B, a model that predicts only class A will still get 95% accuracy, but it fails at detecting class B entirely.
In such cases, I would use metrics like precision, recall, and F1-score, and examine the confusion matrix to get a complete picture of the modelβs performance.
That's totally okay β this is exactly what practice is for!
Let me give you a solid interview-style answer you can learn and rephrase in your own words later.
β Model Interview Answer:¶
A confusion matrix is a performance evaluation tool used for classification problems. It is a 2x2 table for binary classification that compares the actual labels with the predicted labels made by the model.
The matrix includes:
- True Positives (TP): The model correctly predicted the positive class.
- True Negatives (TN): The model correctly predicted the negative class.
- False Positives (FP): The model predicted positive when it was actually negative. (Type I error)
- False Negatives (FN): The model predicted negative when it was actually positive. (Type II error)
From the confusion matrix, we can calculate several key metrics:
- Accuracy = (TP + TN) / (Total)
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1 Score = 2 Γ (Precision Γ Recall) / (Precision + Recall)
It helps us understand where the model is making mistakes β for example, whether it's misclassifying more positives or negatives β which is especially important when the classes are imbalanced.
π Pro Tip¶
In interviews, itβs okay to say:
βIβm not sure, but Iβd like to learn about it.β That shows humility and a willingness to grow β qualities interviewers respect.
"Can you explain the difference between precision and recall? When would you prefer one over the other?"
Precision and recall are two important metrics used to evaluate the performance of classification models, especially when dealing with imbalanced datasets.
Precision is the ratio of true positives to all predicted positives. It answers: "Of all the instances the model predicted as positive, how many were actually correct?" Formula: Precision = TP / (TP + FP)
Recall is the ratio of true positives to all actual positives. It answers: "Of all the actual positive cases, how many did the model correctly identify?" Formula: Recall = TP / (TP + FN)
π§ When to prefer one over the other: Use precision when false positives are costly. For example, in spam detection, you donβt want to wrongly mark a real email as spam.
Use recall when false negatives are more serious. For example, in disease detection, you want to catch as many actual positive cases as possible β missing one could be dangerous.
In many cases, we balance both using the F1-score, which is the harmonic mean of precision and recall.
What is the F1-score? The F1-score is a metric used to evaluate the performance of a classification model, especially when dealing with imbalanced datasets.
It is the harmonic mean of precision and recall, and provides a single score that balances both concerns:
F1 = 2 \times \frac{{\text{Precision} \times \text{Recall}}}{{\text{Precision} + \text{Recall}}} ]
If a model has high precision but low recall, or vice versa, the F1-score will be lower β so it rewards models that have a good balance of both.
It's especially useful when false positives and false negatives are both important to consider.
π Example Use Case: In medical diagnosis or fraud detection, where missing a positive case or raising too many false alarms both have serious consequences, the F1-score gives a better picture of model performance than accuracy alone.
Excellent β this is a classic data analysis interview question often asked to test your analytical thinking, problem-solving approach, and business understanding.
π― Interview Question:¶
"Revenue dropped last month. What would you check?"
β Model Interview Answer (Detailed & Structured):¶
If revenue dropped last month, I would follow a structured approach to identify the root cause. Hereβs how I would break it down:
π§© 1. Verify the Drop Is Real (Data Sanity Check)¶
-
First, I would ensure the data is correct and up-to-date:
-
Are we comparing the same time frames (e.g., full month vs partial month)?
- Were there any issues in data pipelines, delays, or missing entries?
- Are currency conversions or reporting time zones consistent?
π 2. Break Revenue into Components¶
Revenue is typically calculated as:
So, I would check:
- π’ Transaction Count β Did fewer customers buy something?
- πΈ AOV β Are people spending less per order?
- π§Ύ Refunds or Cancellations β Did returns increase?
π 3. Segment the Drop¶
Break down revenue by key dimensions to localize the issue:
- By Product or Category β Did a specific product or service decline?
- By Customer Segment β Are certain customer groups (new vs returning) behaving differently?
- By Geography β Did the drop happen in a specific region?
- By Channel β Did revenue drop in paid ads, organic search, referrals, etc.?
π 4. Compare Trends Over Time¶
-
How does this month compare with:
-
Last month?
- The same month last year (to account for seasonality)?
- Are there any long-term trends or one-off events?
β οΈ 5. Look for External Factors¶
- Market changes, economic conditions, competitor actions.
- Did any promotions end?
- Were there product stockouts or supply chain issues?
- Did any website or app outages occur that may have affected sales?
π οΈ 6. Run Diagnostics with KPIs¶
I would check supporting KPIs such as:
- Website traffic & conversion rate
- Cart abandonment rate
- Marketing campaign performance (CTR, ROI, etc.)
- Customer acquisition and retention rates
π§ 7. Talk to Stakeholders¶
- Collaborate with Sales, Marketing, and Product teams.
- Any known pricing changes, feature rollouts, or customer feedback trends?
π Conclusion¶
By systematically checking these areas, I can isolate where the revenue drop came from and then provide data-backed recommendations to address the issue β whether it's boosting traffic, improving conversion rates, adjusting pricing, or fixing operational problems.
Awesome! Here's a hands-on example of how to investigate a monthly revenue drop using Python (Pandas). This is something you might do in a Jupyter Notebook or an interview take-home test.
π Revenue Drop Investigation β Python Notebook Walkthrough¶
Letβs assume you have an e-commerce transaction dataset like this:
order_id | customer_id | order_date | amount | product_category | region | channel |
---|---|---|---|---|---|---|
101 | 1 | 2024-03-01 | 200 | Electronics | North | SEO |
102 | 2 | 2024-04-01 | 150 | Fashion | South | Paid Ads |
... | ... | ... | ... | ... | ... | ... |
β Step-by-Step Revenue Drop Analysis in Python¶
import pandas as pd
# Load data
df = pd.read_csv("ecommerce_transactions.csv", parse_dates=["order_date"])
# Add 'month' column for grouping
df["month"] = df["order_date"].dt.to_period("M")
# Step 1: Monthly revenue trend
monthly_revenue = df.groupby("month")["amount"].sum()
print(monthly_revenue)
monthly_revenue.plot(kind="bar", title="Monthly Revenue")
π Step 2: Drill down β Check key components¶
# Number of orders per month
monthly_orders = df.groupby("month")["order_id"].nunique()
# Average Order Value (AOV)
monthly_aov = monthly_revenue / monthly_orders
print("Orders:\n", monthly_orders)
print("AOV:\n", monthly_aov)
π¦ Step 3: Segment by category, region, or channel¶
# Compare last 2 months by category
df_last_2_months = df[df["month"].isin(df["month"].unique()[-2:])]
category_revenue = df_last_2_months.groupby(["month", "product_category"])["amount"].sum().unstack()
category_revenue.plot(kind="bar", title="Revenue by Category")
# Segment by region
region_revenue = df_last_2_months.groupby(["month", "region"])["amount"].sum().unstack()
region_revenue.plot(kind="bar", title="Revenue by Region")
π Step 4: Channel performance check¶
channel_revenue = df_last_2_months.groupby(["month", "channel"])["amount"].sum().unstack()
channel_revenue.plot(kind="bar", title="Revenue by Channel")
π Final Step: Summary Table¶
summary = pd.DataFrame({
"Revenue": monthly_revenue,
"Orders": monthly_orders,
"AOV": monthly_aov
})
print(summary.tail(2)) # Last 2 months
π‘ Interpretation:¶
From the plots and tables, you can:
- Spot where the revenue dropped (channel, category, region, etc.)
- See if itβs due to fewer orders, lower AOV, or both
- Understand which segment contributed most to the change
Great! Here's the SQL version of how you'd investigate a drop in monthly revenue β using real-world queries that apply to most relational databases like PostgreSQL, MySQL, or BigQuery.
π§Ύ Assumptions:¶
You're working with a table named orders
that contains the following fields:
orders(
order_id INT,
customer_id INT,
order_date DATE,
amount DECIMAL,
product_category VARCHAR,
region VARCHAR,
channel VARCHAR
)
π Step-by-Step SQL Analysis of Revenue Drop¶
β 1. Monthly Revenue Trend¶
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS total_revenue
FROM orders
GROUP BY month
ORDER BY month;
π 2. Orders and Average Order Value (AOV)¶
SELECT
DATE_TRUNC('month', order_date) AS month,
COUNT(DISTINCT order_id) AS total_orders,
SUM(amount) AS total_revenue,
ROUND(SUM(amount) * 1.0 / COUNT(DISTINCT order_id), 2) AS avg_order_value
FROM orders
GROUP BY month
ORDER BY month;
π§± 3. Breakdown by Product Category¶
SELECT
DATE_TRUNC('month', order_date) AS month,
product_category,
SUM(amount) AS category_revenue
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '2 months'
GROUP BY month, product_category
ORDER BY month, category_revenue DESC;
πΊοΈ 4. Breakdown by Region¶
SELECT
DATE_TRUNC('month', order_date) AS month,
region,
SUM(amount) AS regional_revenue
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '2 months'
GROUP BY month, region
ORDER BY month, regional_revenue DESC;
π£ 5. Channel-wise Revenue¶
SELECT
DATE_TRUNC('month', order_date) AS month,
channel,
SUM(amount) AS channel_revenue
FROM orders
WHERE order_date >= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '2 months'
GROUP BY month, channel
ORDER BY month, channel_revenue DESC;
π§Ύ 6. Compare Two Specific Months¶
SELECT
TO_CHAR(order_date, 'YYYY-MM') AS month,
SUM(amount) AS total_revenue,
COUNT(DISTINCT order_id) AS total_orders,
ROUND(SUM(amount) * 1.0 / COUNT(DISTINCT order_id), 2) AS avg_order_value
FROM orders
WHERE order_date BETWEEN '2024-03-01' AND '2024-04-30'
GROUP BY month
ORDER BY month;
π§ Optional Insights¶
You can also analyze:
- π High-return products (
WHERE amount < 0
) - π§ββοΈ Revenue from new vs returning customers
- π¦ Stock availability logs (if available)
Got it! Here's a sample dataset schema and some example rows matching the SQL queries above β so you can load it into your SQL environment and try the queries yourself.
π Sample Data for orders
Table¶
order_id | customer_id | order_date | amount | product_category | region | channel |
---|---|---|---|---|---|---|
101 | 1 | 2024-03-10 | 200 | Electronics | North | SEO |
102 | 2 | 2024-03-15 | 150 | Fashion | South | Paid Ads |
103 | 3 | 2024-04-01 | 300 | Electronics | East | Organic |
104 | 1 | 2024-04-05 | 100 | Home & Garden | West | Referral |
105 | 4 | 2024-04-20 | 250 | Fashion | North | Paid Ads |
106 | 5 | 2024-05-02 | 350 | Electronics | South | SEO |
107 | 6 | 2024-05-15 | 400 | Sports | East | Organic |
108 | 7 | 2024-05-20 | 150 | Fashion | West | Paid Ads |
How to Load This Sample Data into SQL (Example for PostgreSQL)¶
CREATE TABLE orders (
order_id INT PRIMARY KEY,
customer_id INT,
order_date DATE,
amount NUMERIC(10,2),
product_category VARCHAR(50),
region VARCHAR(50),
channel VARCHAR(50)
);
INSERT INTO orders VALUES
(101, 1, '2024-03-10', 200, 'Electronics', 'North', 'SEO'),
(102, 2, '2024-03-15', 150, 'Fashion', 'South', 'Paid Ads'),
(103, 3, '2024-04-01', 300, 'Electronics', 'East', 'Organic'),
(104, 1, '2024-04-05', 100, 'Home & Garden', 'West', 'Referral'),
(105, 4, '2024-04-20', 250, 'Fashion', 'North', 'Paid Ads'),
(106, 5, '2024-05-02', 350, 'Electronics', 'South', 'SEO'),
(107, 6, '2024-05-15', 400, 'Sports', 'East', 'Organic'),
(108, 7, '2024-05-20', 150, 'Fashion', 'West', 'Paid Ads');
Next Steps¶
- Run the SQL queries I shared earlier on this sample data.
- Youβll be able to see revenue trends by month, category, region, and channel.
- This will help you understand how to identify drops and segment revenue changes.
Got it! Hereβs a clear step-by-step guide to load your sample data into Power BI and create key visuals to analyze the revenue drop, just like in your SQL/Python analysis.
How to Analyze Revenue Drop in Power BI¶
Step 1: Prepare Your Sample Data¶
Create a CSV file named orders.csv
with these columns and sample rows:
order_id,customer_id,order_date,amount,product_category,region,channel
101,1,2024-03-10,200,Electronics,North,SEO
102,2,2024-03-15,150,Fashion,South,Paid Ads
103,3,2024-04-01,300,Electronics,East,Organic
104,1,2024-04-05,100,Home & Garden,West,Referral
105,4,2024-04-20,250,Fashion,North,Paid Ads
106,5,2024-05-02,350,Electronics,South,SEO
107,6,2024-05-15,400,Sports,East,Organic
108,7,2024-05-20,150,Fashion,West,Paid Ads
Step 2: Load Data into Power BI¶
- Open Power BI Desktop.
- Click Get Data > Text/CSV and select your
orders.csv
. - Click Load to import the data.
Step 3: Create Date and Month Columns¶
- Go to the Data view.
- Select the
order_date
column. - In the ribbon, use Column Tools > Date > Month to extract the month (or create a new column with DAX):
Step 4: Build Key Visuals¶
1. Monthly Revenue Trend¶
- Create a Line Chart or Column Chart.
- Axis:
Month
- Values: Sum of
amount
- This shows how total revenue changes month to month.
2. Number of Orders and Average Order Value¶
- Create a Card visual for Total Orders:
- Create another Card visual for Average Order Value (AOV):
- Or create a table visual with these measures grouped by
Month
.
3. Revenue by Product Category¶
- Use a Stacked Column Chart.
- Axis:
Month
- Legend:
product_category
- Values: Sum of
amount
4. Revenue by Region¶
- Use another Stacked Column Chart.
- Axis:
Month
- Legend:
region
- Values: Sum of
amount
5. Revenue by Channel¶
- Repeat similar steps with a Stacked Column Chart for
channel
.
Step 5: Add Filters and Slicers¶
- Add Slicers for filtering by
Month
,Region
,Product Category
, orChannel
to drill down interactively.
Step 6: Interpret Your Report¶
- Compare revenue across months and segments.
- Spot the month with revenue drop.
- Identify whether fewer orders or lower AOV caused it.
- Look for which categories, regions, or channels contributed most to the drop.