Pandas is a great tool if your dataset satisfies two criteria. It’s a single file which will fit in memory. If both of these are not the case, you need to rely on external packages. If your dataset does not fit in memory for example, there’s a list of libraries such as dask and modin that provide methods for both out-of-memory processing and parallel loading of large datasets with a pandas inspired API.
These frameworks are great when analyzing TB’s of data on large clusters, but are overkill when your dataset is small enough to fit in memory, but is just slow to load. Loading data can be slow when e.g. your dataset is spread across multiple files that need to concatenated. Depending on your exact needs for your analysis, these frameworks do not currently support the entire pandas API .
So, what is the best way to speed up pandas with larger datasets that are too small to fully benefit from frameworks such as Dask or Modin? Once the data is loaded into memory, you can parallelize some common operations such as apply and groupby.apply with pandarallel. Other tasks that can be easily split in independent parts, can be parallelized with parmap, a convenient wrapper using multiprocessing’s Pool to provide a parallel map function. What is still missing is a parallel method to read multiple files with pandas, regardless of the filetype. The code below provides such as function for parquet files, but the general idea can be applied to any filetype supported by pandas.
The function below can read a dataset, split across multiple parquet.gz files by reading the individual files in parallel and concatenating them afterwards. The code can easily be adopted to load other filetypes. The only requirements are pandas, tqdm and a multicore processor. The code uses the built in Python module concurrent.futures modules and adds an optional tqdm progress bar and some minor optimizations, inspired by some StackOverflow threads, to further increase speed.