Scaling Compute Intensive Tasks using Apache Spark
Spark is a popular distributed computing platforms used to process large data sets in a parallel and distributed manner.
This project involved evaluating PySpark’s performance in distributed image processing tasks, such as image transformations, data augmentation, as well as Convolutional Neural Network (CNN)-based feature extraction. Through experiments, we explored the impact of varying the number of workers, batch size, number of partitions of the RDD on the execution time, highlighting the importance of selecting optimal parameters to get the best performance.
Link to our report and code is here.
