Resample unbalanced (panel) datasets in Pandas fast
Aug 6, 2024
Working with panel data can be challenging, particularly when the data is unbalanced, meaning the observations do not occur at regular intervals. This irregularity poses a problem when applying operations like .diff()
or .shift()
, as Pandas assumes consecutive observations correspond to consecutive time periods. Without a fixed periodicity, calculating differences or shifts correctly becomes difficult. To address this, we can use a series of custom functions to balance the dataset by generating the missing intermediate values for every column if required. These new resampled values by default are set to NaN.
In this blog post, we’ll explore a set of custom functions designed to handle unbalanced panel data in Pandas. These functions will allow you to add a pseudodate, resample missing dates, fill missing values, and ensure that your data is correctly aligned for time-based operations.
The add_pseudodate
function adds a pseudodate column to the dataframe, aligning irregular time periods to a continuous timeline. This is crucial for operations that require a consistent time index.
import pandas as pd
def add_pseudodate(df, start_pseudodate=pd.Timestamp("1900-01-01")):
# offset such that the first monthly date aligns
# with the first daily date
offset = df["date"].min().year * 12 + df["date"].min().month
# map
df["pseudodate"] = start_pseudodate + pd.to_timedelta(
df["date"].dt.year * 12 + df["date"].dt.month - offset, unit="D"
)
return df
This function calculates an offset based on the minimum date in your dataset and maps each date to a corresponding pseudodate, starting from a specified start_pseudodate
. The result is a new column in your dataframe that maintains temporal continuity.
Once the pseudodate is added, we need to fill in the missing dates. The resample_missing_pseudodates
function offers two methods (fast
and slow
) to resample the data. The slow
method is the straight forward pandas approach. This can be slow for panel datasets with high N (~500’000) and comparatively small T (~100). The fast
method manually constructs a new MultiIndex
with all the required observations and is about 4 times faster than the slow method.
import itertools
import pandas as pd
def resample_missing_pseudodates(
df, pseudodate="pseudodate", group_id="group", method="fast"
):
if method == "fast":
return _resample_fast(df, pseudodate=pseudodate, group_id=group_id)
if method == "slow":
return _resample_slow(df, pseudodate=pseudodate, group_id=group_id)
return df
def _resample_slow(df, pseudodate, group_id):
return (
df.set_index(pseudodate)
.groupby(group_id)
.resample("1D")
.asfreq()
.drop(columns=[group_id])
.reset_index()
)
def _resample_fast(df, pseudodate, group_id):
# determine the first and last observation of each household
date_ranges = df.groupby(group_id, as_index=False)[pseudodate].agg(["min", "max"])
# construct all dates between the first and last date for each households
combinations = [
([row[group_id]], list(pd.date_range(row["min"], row["max"], freq="D")))
for _, row in date_ranges.iterrows()
]
# combine alln products of household and dates in a single list
result = [
product
for combination in combinations
for product in itertools.product(*combination)
]
mindex = pd.MultiIndex.from_tuples(result).set_names([group_id, pseudodate])
# reindex as a faster alternative for .resample().asfreq()
return df.set_index([group_id, pseudodate]).reindex(mindex).reset_index()
Once missing dates are resampled, the next step is to fill in the missing values in the resampled columns.
def fill_resampled_columns(
df,
id_columns=["date", "resampled"],
group_id="group",
fill_value=0,
):
id_columns = [group_id] + id_columns
if isinstance(fill_value, (int, float)):
columns = list(set(df.columns).difference(set(id_columns)))
df[columns] = df[columns].fillna(fill_value)
return df
This function fills the missing values with a specified fill_value
, ensuring that your dataset is complete and ready for further analysis. These two functions add labels to a new column resampled
to differentiate between original and resampled data points.
def label_original_observations(df):
df["resampled"] = False
return df
def label_resampled_observations(df):
df["resampled"] = df["resampled"].astype(bool).fillna(True)
return df
The impute_resampled_dates
function adjusts the pseudodates back to actual dates, maintaining the temporal alignment. Currently (pandas v2.2.2) doesn’t support vectorized additions of DateOffsets, i.e. adding a column of DateOffsets to a datetime column. Pandas does allow fast addition of a single DateOffset to a datetime column. The function below partially vectorizes the addition by looping over all unique DateOffsets values. If T is small compared to N, this is much faster than other approaches.
import pandas as pd
from pandas.tseries.offsets import DateOffset
from tqdm.auto import tqdm
def impute_resampled_dates(
df,
pseudodate="pseudodate",
start_pseudodate=pd.Timestamp("1900-01-01"),
progress_bar=True,
group_id="group",
imputed_date="date",
):
df["offset_days"] = (df[pseudodate] - start_pseudodate).dt.days
# adding a column of DateOffsets isn't vectorized
# adding a single DateOffset IS vectorized
# making this much faster than non vectorized approaches
min_date = df["date"].min()
for offset in tqdm(df["offset_days"].unique(), disable=not progress_bar):
condition = df["offset_days"] == offset
df.loc[condition, imputed_date] = min_date + DateOffset(months=offset)
return df.drop(columns=["offset_days"]).sort_values(by=[group_id, "date"])
This function ensures that the imputed dates align with the original date values, keeping the data consistent.
Finally, the remove_pseudodate
function cleans up the dataframe by removing the pseudodate column, leaving you with a balanced dataset ready for analysis.
def remove_pseudodate(df):
return df.drop(columns=["pseudodate"])
To use these functions, you can chain them together using Pandas’ pipe
function:
df = (
df.pipe(add_pseudodate)
.pipe(label_original_observations)
.pipe(resample_missing_pseudodates)
.pipe(fill_resampled_columns)
.pipe(label_resampled_observations)
.pipe(impute_resampled_dates)
.pipe(remove_pseudodate)
)
This pipeline will take your unbalanced panel data, fill in the missing values, and prepare it for time series operations like .diff()
or .shift()
.