Pandas is an excellent choice for handling datasets that meet the following conditions:
The dataset is stored in a single file.
The dataset fits within the available memory.
If these conditions are not met, additional packages may be necessary for efficient data processing. For datasets that do not fit in memory, libraries such as Dask and Modin provide out-of-memory processing and parallel loading capabilities, along with a pandas-inspired API.
These frameworks are well-suited for processing terabytes of data on large clusters but may be excessive for datasets that fit in memory but take a long time to load. Slow loading can occur when datasets are spread across multiple files that need to be concatenated. Moreover, these frameworks may not support the entire pandas API, depending on the specific analysis requirements.
To accelerate pandas operations with larger datasets that do not fully benefit from Dask or Modin, consider using pandarallel for parallelizing both apply and groupby.apply. Additionally, parmap, a convenient wrapper around multiprocessing’s Pool, provides a parallel map function for tasks that can be divided into independent parts.
However, a parallel method for reading multiple files with pandas, regardless of file type, is still needed. The following function demonstrates how to read a dataset split across multiple parquet.gz files by loading individual files in parallel and concatenating them afterward. This approach can be adapted for other filetype supported by pandas.
The only requirements for this function are pandas, tqdm, and a multicore processor. The code utilizes Python’s built-in concurrent.futures module, and incorporates an optional tqdm progress bar and minor optimizations inspired by StackOverflow discussions to further improve performance.