The library allows you not to think about the number of workers, creating processes and provides an interactive interface for monitoring progress.
pip install pandas jupyter pandarallel requests tqdm
As you can see, I also install tqdm. With it, I will clearly demonstrate the difference in the speed of code execution in a sequential and parallel approach.
import pandas as pd import requests from tqdm import tqdm tqdm.pandas() from pandarallel import pandarallel pandarallel.initialize(progress_bar=True)
You can find the complete list of settings in the pandarallel documentation.
For experiments, let's create a simple data frame - 100 rows, 1 column.
df = pd.DataFrame( [i for i in range(100)], columns=["sample_column"] )
As we know, the solution of not all problems can be paralleled. A simple example of a suitable task is to call some external source, such as an API or database. In the function below, I call an API that returns one random word. My goal is to add a column with words derived from this API to the data frame.
def function_to_apply(i): r = requests.get(f'https://random-word-api.herokuapp.com/word').json() return r
The classic way to solve the problem
df["sample-word"] = df.sample_column.progress_apply(function_to_apply)
For people not familiar with the tqdm library, I’ll explain that the progress_apply function acts similarly to the usual apply. And in addition to this, we get an interactive progress bar.
Such a “naive” solution took 35 seconds.
Parallel way to solve the problem
To use all available cores, just use the parallel_apply function:
df["sample-word"] = df.sample_column.parallel_apply(function_to_apply)
Solving the problem using all available cores took only 5 seconds.
A full list of pandas features supported by pandarallel is listed on the GitHub project.
Thanks for attention!