Shashank Chandavarkar
Software Developer Engineer I
Shashank is a software developer from India living in Bangalore. He did his Bachelors in JSS University in Computer Science and Engineering. He enjoys writing code, learning new technologies and solving problems, playing sports,trekking and exploring new places.
Shashank has worked at Sixt since 2022, started as an intern in the BranchOps Team, and continued as a developer in the BranchOperations Checkin team.
Posts by Shashank Chandavarkar:
Dask: A parallel data processing python library for large datasets
While conducting data analytics, we often utilize Pandas to perform specific operations on the data in order to extract valuable insights. Initially, when working on data manipulation, I approached it as a data structure problem and did not make use of any built-in Pandas functions. Later, as I delved deeper into Pandas and explored its functions, I discovered that it was significantly faster than manually iterating over the DataFrame (or using the Pandas apply function, which essentially involves iterating over an axis and applying a function) and performing operations on individual rows. Curious about why these built-in functions were faster, I conducted some research and found that Pandas uses NumPy under the hood, which contributes to its speed. When can convert our dataframe to numpy vectors and perform mathematical operations on these vectors if we want our code to be fast. Writing code to perform these vector calculations is significantly harder when lots of operations are involved and sometimes python functions are easier and faster to implement.
In a specific use case involving a DataFrame was large, I had to iterate over the DataFrame and perform operations. This significantly slowed down my code. Recognizing the need for optimization, I began exploring ways to make the iteration (or the apply function) faster. While numerous alternatives were available, one of them was notably simple and easy to use and understand: a library called Dask. Dask parallelizes the process by breaking down the DataFrame into multiple partitions and performing operations on them concurrently.
read more