Batch processing - an engineering approach to expedite analyses

4.19.22

| By

Ryan Gaspar

About the author

Ryan Gaspar is the Senior Machine Learning Engineer (MLE) at Rune Labs. He provides machine learning services both internally and externally to engineering and data scientists. Previously, he worked as an Electrical Engineer before moving fully into software and AI as a Data Scientist/MLE. His experience spans from large and small datasets, to machine learning and deep learning applications for Natural Language Processing (NLP) and Computer Vision (CV).

Introduction

Working with large datasets and managing them can be a difficult challenge researchers face when analyzing data. Currently, those working locally are confined to a couple options:

Attempt to download the entire dataset locally
Work on a small subset to reflect the entire dataset

Other alternative options include taking advantage of Cloud services, such as Amazon Web Services (AWS). At Rune Labs, we have built our infrastructure to support batch processing as a way to parallelize analyses efficiently for an entire dataset. This allows researchers to take full advantage of the entire dataset and draw better conclusions from the data.

Attempt to Download the Entire Dataset Locally

Downloading the entire dataset locally allows researchers to work with the full set of data to ensure analyses are as accurate as possible, but it comes with its challenges. Downloading can take anywhere from several hours to days, which causes delays in project plans. Not only does downloading take time, but each analysis step in the workflow can take several hours as well.

Work on a Small Subset to Reflect the Entire Dataset

Working on a small subset of data also comes with its challenges. Though you can get your results a lot faster than working with the entire dataset, it is imperative to have a deep understanding of the complete dataset to make informative decisions reflecting the subset used. Even then, the complete analysis of a subset of data may not fully reflect the entire dataset, especially when one is conducting models.

Solution: Batch Processing

Batch Processing allows data scientists and researchers to perform their data analyses at scale. For our use case, we decided to use AWS Batch, which enables data analyses to be done through parallelization. Instead of a single step process in a researcher’s workflow to run on the entire dataset subsequently, the dataset can be broken up into subsets (or batches). This approach allows each subset to run the same analysis as needed.

In addition to parallel processing, we take advantage of AWS Cloud Services to increase compute power, compute time, and security without sacrificing anything on our local machine.

What does this look like?

Imagine a step in a researcher’s workflow that requires computing the power spectral density (PSD). What does this process look like with and without batch processing?

If we do not use batch processing, we have to download the entire dataset locally. This requires us to run analyses on each row, line by line. If we use batch processing, the dataset is broken up into subsets and simultaneously performs PSD computations on each subset of data. This allows for a faster processing time.

In the image above, 4 of the 5 parallel child jobs have successfully processed through. The 5th child job has initiated the process for completion.

Why is this Useful?

There are few reasons why batch processing is an effective part of a researcher’s workflow when running their data exploration and analyses:

Helps manage large datasets
By parallelizing analyses for data subsets, time to compute PSD is reduced and allows the researcher to analyze data more efficiently
Automation - the computations and analyses happen in the background allowing the user to continue analyzing the data for the next step

Practical Case

This case study walks you through an example of computing PSD.

A user creates a function that computes PSD for a set of data
The user starts a parent batch job in AWS Batch
AWS Batch creates several parallel child jobs of this parent job that all run the same computation—in this case, computing PSD
As each child job finishes, it pushes results to AWS S3
Upon completion of the entire job, the user can now pull computed results from S3 to see the updated DataFrame where each record has PSD computed.

Note: Steps 3 and 4 are done passively in the background and completely automated—they do not require the user to monitor activities.

In addition, we have included version control to the workflow by taking advantage of the JobID provided by AWS Batch. Each job’s results can be sorted by this ID to have a history of results.

Conclusion

At Rune Labs, it is our goal to improve treatment options and outcomes for people with Parkinson’s Diseases. We enable clinicians and researchers to develop and deliver precision neurology by making brain data useful at scale. Our team uses engineering approaches like batch processing to not only help expedite the analysis time, but to help us discover and push results forward for our partners.

‍

BACK TO BLOG