Thursday, September 17, 2020

Using joblib to speed up your Python pipelines

Why joblib?

There are several reasons to integrate joblib tools as a part of the ML pipeline. There are major two reasons mentioned on their website to use it. However, I thought to rephrase it again:

Capability to use cache which avoids recomputation of some of the steps

Execute Parallelization to fully utilize all the cores of CPU/GPU.


Beyond this, there are several other reasons why I would recommend joblib:

Can be easily integrated

No specific dependencies

Saves cost and time

Easy to learn


1. Using Cached results

Basically, store the computed results in memory 

from joblib import Memory


# Define a location to store cache

location = '~/Desktop/temp/cache_dir'

memory = Memory(location, verbose=0)


result = []


# Function to compute square of a range of a number:

def get_square_range_cached(start_no, end_no):

    for i in np.arange(start_no, end_no):

        time.sleep(1)

        result.append(square_number(i))

    return result


get_square_range_cached = memory.cache(get_square_range_cached)


start = time.time()

# Getting square of 1 to 50:

final_result = get_square_range_cached(1, 21)

end = time.time()


# Total time to compute

print('\nThe function took {:.2f} s to compute.'.format(end - start))

print(final_result)



To clear the cache results just do the below 


memory.clear(warn=False)



2. Parallelization


As the name suggests, we can compute in parallel any specified function with even multiple arguments using “joblib.Parallel”. Behind the scenes, when using multiple jobs (if specified), each calculation does not wait for the previous one to complete and can use different processors to get the task done. For better understanding, I have shown how Parallel jobs can be run inside caching.


#Import package

from joblib import Parallel, delayed

from joblib import Memory


location = 'C:/Users/pg021/Desktop/temp/cache_dir'

memory = Memory(location, verbose=0)

costly_compute_cached = memory.cache(costly_compute)


def data_processing_mean_using_cache(data, column):

    """Compute the mean of a column."""

    return costly_compute_cached(data, column).mean()


start = time.time()

results = Parallel(n_jobs=2)(

    delayed(data_processing_mean_using_cache)(data, col)

    for col in range(data.shape[1]))

stop = time.time()


print('Elapsed time for the entire processing: {:.2f} s'

      .format(stop - start))



Here we can see that time for processing using the Parallel method was reduced by 2x.



3. Dump and Load


We often need to store and load the datasets, models, computed results, etc. to and from a location on the computer. Joblib provides functions that can be used to dump and load easily:


4. Compression methods


Supported ones are: 


a. Simple Compression:

b. Using Zlib compression:

c. Using lz4 compression:




I find joblib to be a really useful library. I have started integrating them into a lot of my Machine Learning Pipelines and definitely seeing a lot of improvements.


References:

https://towardsdatascience.com/using-joblib-to-speed-up-your-python-pipelines-dd97440c653d




No comments:

Post a Comment