Johannes Weytjens

Resample unbalanced (panel) datasets in Pandas fast

Aug 6, 2024

Working with panel data can be challenging, particularly when the data is unbalanced, meaning the observations do not occur at regular intervals. This irregularity poses a problem when applying operations like .diff() or .shift(), as Pandas assumes consecutive observations correspond to consecutive time periods. Without a fixed periodicity, calculating differences or shifts correctly becomes difficult. To address this, we can use a series of custom functions to balance the dataset by generating the missing intermediate values for every column if required. These new resampled values by default are set to NaN.

In this blog post, we’ll explore a set of custom functions designed to handle unbalanced panel data in Pandas. These functions will allow you to add a pseudodate, resample missing dates, fill missing values, and ensure that your data is correctly aligned for time-based operations.

The add_pseudodate function adds a pseudodate column to the dataframe, aligning irregular time periods to a continuous timeline. This is crucial for operations that require a consistent time index.

import pandas as pd


def add_pseudodate(df, start_pseudodate=pd.Timestamp("1900-01-01")):
    # offset such that the first monthly date aligns
    # with the first daily date
    offset = df["date"].min().year * 12 + df["date"].min().month

    # map
    df["pseudodate"] = start_pseudodate + pd.to_timedelta(
        df["date"].dt.year * 12 + df["date"].dt.month - offset, unit="D"
    )

    return df

This function calculates an offset based on the minimum date in your dataset and maps each date to a corresponding pseudodate, starting from a specified start_pseudodate. The result is a new column in your dataframe that maintains temporal continuity.

Once the pseudodate is added, we need to fill in the missing dates. The resample_missing_pseudodates function offers two methods (fast and slow) to resample the data. The slow method is the straight forward pandas approach. This can be slow for panel datasets with high N (~500’000) and comparatively small T (~100). The fast method manually constructs a new MultiIndex with all the required observations and is about 4 times faster than the slow method.

import itertools

import pandas as pd


def resample_missing_pseudodates(
    df, pseudodate="pseudodate", group_id="group", method="fast"
):
    if method == "fast":
        return _resample_fast(df, pseudodate=pseudodate, group_id=group_id)

    if method == "slow":
        return _resample_slow(df, pseudodate=pseudodate, group_id=group_id)

    return df


def _resample_slow(df, pseudodate, group_id):
    return (
        df.set_index(pseudodate)
        .groupby(group_id)
        .resample("1D")
        .asfreq()
        .drop(columns=[group_id])
        .reset_index()
    )


def _resample_fast(df, pseudodate, group_id):
    # determine the first and last observation of each household
    date_ranges = df.groupby(group_id, as_index=False)[pseudodate].agg(["min", "max"])

    # construct all dates between the first and last date for each households
    combinations = [
        ([row[group_id]], list(pd.date_range(row["min"], row["max"], freq="D")))
        for _, row in date_ranges.iterrows()
    ]

    # combine alln products of household and dates in a single list
    result = [
        product
        for combination in combinations
        for product in itertools.product(*combination)
    ]
    mindex = pd.MultiIndex.from_tuples(result).set_names([group_id, pseudodate])

    # reindex as a faster alternative for .resample().asfreq()
    return df.set_index([group_id, pseudodate]).reindex(mindex).reset_index()

Once missing dates are resampled, the next step is to fill in the missing values in the resampled columns.

def fill_resampled_columns(
    df,
    id_columns=["date", "resampled"],
    group_id="group",
    fill_value=0,
):
    id_columns = [group_id] + id_columns

    if isinstance(fill_value, (int, float)):
        columns = list(set(df.columns).difference(set(id_columns)))
        df[columns] = df[columns].fillna(fill_value)

    return df

This function fills the missing values with a specified fill_value, ensuring that your dataset is complete and ready for further analysis. These two functions add labels to a new column resampled to differentiate between original and resampled data points.

def label_original_observations(df):
    df["resampled"] = False

    return df


def label_resampled_observations(df):
    df["resampled"] = df["resampled"].astype(bool).fillna(True)

    return df

The impute_resampled_dates function adjusts the pseudodates back to actual dates, maintaining the temporal alignment. Currently (pandas v2.2.2) doesn’t support vectorized additions of DateOffsets, i.e. adding a column of DateOffsets to a datetime column. Pandas does allow fast addition of a single DateOffset to a datetime column. The function below partially vectorizes the addition by looping over all unique DateOffsets values. If T is small compared to N, this is much faster than other approaches.

import pandas as pd
from pandas.tseries.offsets import DateOffset
from tqdm.auto import tqdm


def impute_resampled_dates(
    df,
    pseudodate="pseudodate",
    start_pseudodate=pd.Timestamp("1900-01-01"),
    progress_bar=True,
    group_id="group",
    imputed_date="date",
):
    df["offset_days"] = (df[pseudodate] - start_pseudodate).dt.days

    # adding a column of DateOffsets isn't vectorized
    # adding a single DateOffset IS vectorized
    # making this much faster than non vectorized approaches
    min_date = df["date"].min()
    for offset in tqdm(df["offset_days"].unique(), disable=not progress_bar):
        condition = df["offset_days"] == offset
        df.loc[condition, imputed_date] = min_date + DateOffset(months=offset)

    return df.drop(columns=["offset_days"]).sort_values(by=[group_id, "date"])

This function ensures that the imputed dates align with the original date values, keeping the data consistent.

Finally, the remove_pseudodate function cleans up the dataframe by removing the pseudodate column, leaving you with a balanced dataset ready for analysis.

def remove_pseudodate(df):
    return df.drop(columns=["pseudodate"])

To use these functions, you can chain them together using Pandas’ pipe function:

df = (
    df.pipe(add_pseudodate)
    .pipe(label_original_observations)
    .pipe(resample_missing_pseudodates)
    .pipe(fill_resampled_columns)
    .pipe(label_resampled_observations)
    .pipe(impute_resampled_dates)
    .pipe(remove_pseudodate)
)

This pipeline will take your unbalanced panel data, fill in the missing values, and prepare it for time series operations like .diff() or .shift().