2026-03-18 14:25 Tags:

📌 Core Concept

Many machine learning models cannot handle categorical data as strings.

Example:

Linear regression cannot assign coefficients to values like "red" or "blue"

👉 Therefore, we must convert categorical variables into numeric form

🔄 Solution: Dummy Variables (One-Hot Encoding)

Convert categories into binary columns:

pd.get_dummies(data)

👉 Also known as:

Dummy variables
One-hot encoding

📦 Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

📂 Load Data

df = pd.read_csv("../DATA/Ames_NO_Missing_Data.csv")
df.head()

📖 Data Description

with open('../DATA/Ames_Housing_Feature_Description.txt','r') as f: 
    print(f.read())

⚠️ Numerical Column to Categorical

Some columns look numeric but are actually categorical codes

Example: `MSSubClass`

Values:

20 → 1-STORY 1946 & NEWER  
30 → 1-STORY 1945 & OLDER  
...

👉 These numbers do NOT represent magnitude or order

💡 Although 30 > 20, it does NOT mean “better” or “larger”

✅ Convert to String (Categorical)

df['MS SubClass'] = df['MS SubClass'].apply(str)

⚠️ Dummy Variable Trap (Multicollinearity)

Example:

person_state = pd.Series(['Dead','Alive','Dead','Alive','Dead','Dead'])
pd.get_dummies(person_state)

👉 This creates redundant columns

✅ Solution: Drop First Column

pd.get_dummies(person_state, drop_first=True)

💡 Removes one category to avoid multicollinearity

🔍 Select Column Types

Separate numeric and categorical features

df.select_dtypes(include='object')
df_nums = df.select_dtypes(exclude='object')
df_objs = df.select_dtypes(include='object')

Inspect

df_nums.info()
df_objs.info()

🔄 Convert Categorical Variables

df_objs = pd.get_dummies(df_objs, drop_first=True)

🔗 Combine Back

final_df = pd.concat([df_nums, df_objs], axis=1)
final_df

⚠️ Final Thoughts

The dataset now has many more columns (e.g. 274)
More features does NOT guarantee better performance

💡 May lead to:

Overfitting
Worse model generalization

🔍 Feature Correlation

final_df.corr()['SalePrice'].sort_values()

📌 Example Feature: `OverallQual`

10 → Very Excellent  
1 → Very Poor

👉 This feature is likely human-rated

💡 Implication:

It may already summarize other features
Future predictions may require human input

💾 Save Final Dataset

final_df.to_csv('../DATA/AMES_Final_DF.csv')

🧠 Dummy Variable Trap

📌 Core Idea (one sentence)

The dummy variable trap happens when your encoded features are perfectly predictable from each other, causing multicollinearity.

🧩 Start with an Example

Original categorical variable:

State = ['Alive', 'Dead']

One-hot encoding:

Alive   Dead
1       0
0       1
1       0

🚨 Where is the problem?

Look closely:

Dead = 1 - Alive

👉 That means:

If you know Alive, you automatically know Dead
One column is redundant

🔥 Why is this bad?

Think like a regression model:

Model tries to learn:

y = β1 * Alive + β2 * Dead

But since:

Dead = 1 - Alive

Substitute:

y = β1 * Alive + β2 * (1 - Alive)
  = β2 + (β1 - β2) * Alive

👉 Now we have:

Infinite combinations of (β1, β2) that give the SAME result

💥 Result

Coefficients become unstable
Model cannot uniquely determine weights
Interpretability becomes meaningless

This is called:

👉 Perfect multicollinearity

✅ Solution: Drop One Column

pd.get_dummies(data, drop_first=True)

Now:

Dead
1
0

👉 Interpretation:

Dead = 1 → Dead
Dead = 0 → Alive (baseline)

🧠 Intuition (this is the key)

Think of it like:

You don’t need both:

“Is Alive?”

“Is Dead?”

👉 One is enough.

🎯 General Rule

If a categorical variable has:

k categories

👉 You only need:

k - 1 dummy variables

What is `pd.concat()`?

Core idea

pd.concat() combines multiple pandas objects (DataFrames or Series) along a specified axis.

You can think of it as stacking or aligning tables together.

Two main modes

1. Concatenate rows (`axis=0`)

pd.concat([df1, df2], axis=0)

Stacks DataFrames vertically
Adds more rows
Columns must match (otherwise NaN appears)

2. Concatenate columns (`axis=1`) ← your case

pd.concat([df1, df2], axis=1)

Combines DataFrames side by side
Adds more columns
Rows are aligned by index

Your specific usage

final_df = pd.concat([df_nums, df_objs], axis=1)

What’s happening:

df_nums: numeric features
df_objs: categorical features after one-hot encoding
axis=1: combine columns

Result:

A single DataFrame with all features
All columns are numeric → ready for ML models

Important detail: index alignment

pd.concat() aligns data by index.

Example:

df1.index = [0,1]
df2.index = [1,2]
 
pd.concat([df1, df2], axis=1)

Result:

index	df1	df2
0	✔	NaN
1	✔	✔
2	NaN	✔

So if indices don’t match, you will introduce missing values.

Common mistake

pd.concat([df_nums, df_objs], axis=0)

This will:

Stack rows instead of columns
Create many NaNs because columns differ

When to use `concat`

Use pd.concat() when:

You already have separate DataFrames
You just want to combine them
No key-based matching is needed

Mental model

axis=0: add more observations
axis=1: add more features

If you want, next I can explain the difference between concat, merge, and join — that’s where most people get confused.

🪴LYC

Dealing with Categorical Data

📌 Core Concept

🔄 Solution: Dummy Variables (One-Hot Encoding)

📦 Imports

📂 Load Data

📖 Data Description

⚠️ Numerical Column to Categorical

Example: MSSubClass

✅ Convert to String (Categorical)

⚠️ Dummy Variable Trap (Multicollinearity)

✅ Solution: Drop First Column

🔍 Select Column Types

Separate numeric and categorical features

Inspect

🔄 Convert Categorical Variables

🔗 Combine Back

⚠️ Final Thoughts

🔍 Feature Correlation

📌 Example Feature: OverallQual

💾 Save Final Dataset

🧠 Dummy Variable Trap

📌 Core Idea (one sentence)

🧩 Start with an Example

One-hot encoding:

🚨 Where is the problem?

🔥 Why is this bad?

Think like a regression model:

💥 Result

✅ Solution: Drop One Column

🧠 Intuition (this is the key)

🎯 General Rule

What is pd.concat()?

Core idea

Two main modes

1. Concatenate rows (axis=0)

2. Concatenate columns (axis=1) ← your case

Your specific usage

What’s happening:

Important detail: index alignment

Common mistake

When to use concat

Mental model

Graph View

Table of Contents

Backlinks

Example: `MSSubClass`

📌 Example Feature: `OverallQual`

What is `pd.concat()`?

1. Concatenate rows (`axis=0`)

2. Concatenate columns (`axis=1`) ← your case

When to use `concat`