Mini Project Challenge¶
1. Clean the Customer Data ๐¶
๐ผ Scenario: You work as a Data Analyst Intern at a retail company. Youโve received messy customer data with missing values. Your job is to clean the data so the marketing team can use it for campaign targeting.
๐ Sample Data:
import pandas as pd
data = {
'CustomerID': [1001, 1002, 1003, 1004, 1005],
'Name': ['Alice', None, 'Charlie', 'David', None],
'Age': [25, None, 29, None, 35],
'Email': ['alice@gmail.com', 'bob@gmail.com', None, 'david@gmail.com', 'eve@gmail.com'],
'City': ['New York', 'London', 'Paris', None, 'Berlin']
}
df = pd.DataFrame(data)
print("Original Data:\n", df)
๐ฏ Your Tasks:
-
Identify how many missing values are in each column.
-
Drop rows where both Name and Email are missing.
-
Fill:
- Name with "Unknown"
- City with "Not Provided"
- Age with the median age
- Email with "noemail@company.com"
-
Show the final cleaned DataFrame.
๐ก Bonus Task (Optional): Add a new column "IsAdult":
- True if Age >= 18
- False otherwise
Ans
๐ 1. Identify Missing Values
๐งน 2. Drop rows where both Name and Email are missing
In Pandas, the ~ symbol is a bitwise NOT operator.
It is commonly used to invert a boolean mask โ in other words, it flips True to False, and False to True.
๐ Use Case in Pandas: Filtering Rows When filtering rows in a DataFrame, you often write:
But if you want the opposite (i.e., rows where the condition is not true), you use:
```import pandas as pd
df = pd.DataFrame({ 'Name': ['Alice', None, 'Bob'], 'Email': ['alice@gmail.com', None, 'bob@gmail.com'] })
Condition: rows where both Name and Email are missing¶
condition = df['Name'].isnull() & df['Email'].isnull()
Invert it: keep rows where NOT both Name and Email are missing¶
filtered_df = df[~condition]
๐ ~condition means:
> "Select rows where NOT both Name and Email are null"
๐ 3. Fill Missing Values
```python
# Fill Name with "Unknown"
df['Name'].fillna('Unknown', inplace=True)
# Fill City with "Not Provided"
df['City'].fillna('Not Provided', inplace=True)
# Fill Email with generic email
df['Email'].fillna('noemail@company.com', inplace=True)
# Fill Age with median
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)
๐ 4. Add "IsAdult" Column
๐งพ โ Final Cleaned DataFrame:
Output: