Why joblib?
There are several reasons to integrate joblib tools as a part of the ML pipeline. There are major two reasons mentioned on their website to use it. However, I thought to rephrase it again:
Capability to use cache which avoids recomputation of some of the steps
Execute Parallelization to fully utilize all the cores of CPU/GPU.
Beyond this, there are several other reasons why I would recommend joblib:
Can be easily integrated
No specific dependencies
Saves cost and time
Easy to learn
1. Using Cached results
Basically, store the computed results in memory
from joblib import Memory
# Define a location to store cache
location = '~/Desktop/temp/cache_dir'
memory = Memory(location, verbose=0)
result = []
# Function to compute square of a range of a number:
def get_square_range_cached(start_no, end_no):
for i in np.arange(start_no, end_no):
time.sleep(1)
result.append(square_number(i))
return result
get_square_range_cached = memory.cache(get_square_range_cached)
start = time.time()
# Getting square of 1 to 50:
final_result = get_square_range_cached(1, 21)
end = time.time()
# Total time to compute
print('\nThe function took {:.2f} s to compute.'.format(end - start))
print(final_result)
To clear the cache results just do the below
memory.clear(warn=False)
2. Parallelization
As the name suggests, we can compute in parallel any specified function with even multiple arguments using “joblib.Parallel”. Behind the scenes, when using multiple jobs (if specified), each calculation does not wait for the previous one to complete and can use different processors to get the task done. For better understanding, I have shown how Parallel jobs can be run inside caching.
#Import package
from joblib import Parallel, delayed
from joblib import Memory
location = 'C:/Users/pg021/Desktop/temp/cache_dir'
memory = Memory(location, verbose=0)
costly_compute_cached = memory.cache(costly_compute)
def data_processing_mean_using_cache(data, column):
"""Compute the mean of a column."""
return costly_compute_cached(data, column).mean()
start = time.time()
results = Parallel(n_jobs=2)(
delayed(data_processing_mean_using_cache)(data, col)
for col in range(data.shape[1]))
stop = time.time()
print('Elapsed time for the entire processing: {:.2f} s'
.format(stop - start))
Here we can see that time for processing using the Parallel method was reduced by 2x.
3. Dump and Load
We often need to store and load the datasets, models, computed results, etc. to and from a location on the computer. Joblib provides functions that can be used to dump and load easily:
4. Compression methods
Supported ones are:
a. Simple Compression:
b. Using Zlib compression:
c. Using lz4 compression:
I find joblib to be a really useful library. I have started integrating them into a lot of my Machine Learning Pipelines and definitely seeing a lot of improvements.
References:
https://towardsdatascience.com/using-joblib-to-speed-up-your-python-pipelines-dd97440c653d
No comments:
Post a Comment