Representations of Missing Data
Pandas uses the following to represent missing data:
np.nan: For float data.None: For object-dtype data.pd.NaT: For datetime-like data.pd.NA: A new scalar value introduced for consistent representation across all data types.
Example
import numpy as np
import pandas as pd
print(np.nan) # Float missing value
print(pd.NA) # General missing value
print(pd.NaT) # Datetime missing valueImportant Notes on Comparisons
- Comparisons with missing values can lead to unexpected results:
np.nan == np.nan→Falsenp.nan in [np.nan]→Truepd.NA == pd.NA→False
- This happens because missing values are considered “unknown,” so equality cannot be determined.
Further Reading:
Example Dataset
Imagine a dataset of people’s ratings of actors before and after watching a movie, but with some missing data:
df = pd.read_csv('movie_scores.csv')
print(df)Identifying Missing Values
isnull(): Detects missing values.notnull(): Detects non-missing values.
Examples
# Check for missing values
df.isnull()
# Select rows where 'first_name' is not null
df[df['first_name'].notnull()]
# Filter rows with specific conditions
df[(df['pre_movie_score'].isnull()) & (df['sex'].notnull())]Dropping Missing Data
dropna(): Removes rows or columns with missing data.
Options
thresh: Minimum non-NA values required to retain the row/column.axis: Drop rows (axis=0) or columns (axis=1).
Examples
# Drop rows with any missing values
df.dropna()
# Drop columns with less than 4 non-NA values
df.dropna(thresh=4, axis=1)Filling Missing Data
fillna(): Replace missing values with a specific value or computation.
Examples
# Fill missing values with a string
df.fillna("NEW VALUE!")
# Fill a specific column with a default value
df['first_name'].fillna("Empty", inplace=True)
# Fill with the column mean
df['pre_movie_score'].fillna(df['pre_movie_score'].mean(), inplace=True)
# Fill all numeric columns with their mean
df.fillna(df.mean(), inplace=True)Interpolation
Interpolation is a method of estimating missing values based on existing data.
- Default method: Linear interpolation.
- Other methods:
spline,polynomial, etc.
Example
# Create a sample series
airline_tix = {'first': 100, 'business': np.nan, 'economy-plus': 50, 'economy': 30}
ser = pd.Series(airline_tix)
# Linear interpolation
ser.interpolate()
# Spline interpolation
ser.interpolate(method='spline', order=2)Docs: Pandas Interpolate Documentation
Summary
- Identify missing data using
isnull()andnotnull(). - Drop rows or columns using
dropna(). - Fill missing values with default values or computed statistics using
fillna(). - Use interpolation to estimate missing values carefully.