Optimizing ETL Pipelines at Scale: Lessons from PySpark and Airflow Integration

Sruthi Erra Hareram

doi:10.52783/jisem.v10i60s.13256

PDF

Published: Sep 30, 2025

DOI: https://doi.org/10.52783/jisem.v10i60s.13256

Keywords:

ETL optimization, PySpark integration, Airflow orchestration, fault tolerance, cloud-native architecture

Sruthi Erra Hareram

Abstract

This article examines a scalable extract, transform, load (ETL) pipeline architecture that focuses on PySpark and Apache Airflow integration. The system faces challenges in processing the petabyte-scale dataset while maintaining reliability, observation, and performance. Integration creates powerful abstract layers that distinguish orchestration from execution, increasing stability through modular dependence management. Performance adaptation technology, including strategic caching, checkpointing, and dynamic resource allocation, greatly improves processing efficiency and mistake tolerance. Enable the IDEMPOTENT task design and multi-level error handling to be compelled to failures without manual intervention. Cloud-country integration, especially with Google Cloud composer, provides scalability and observation through almanac cluster patterns and comprehensive monitoring capabilities. Create architectural patterns and optimization techniques to present data pipelines that effectively scale during the challenges of distributed data processing.

Issue

Vol. 10 No. 60s (2025)

Section

Articles

Journal of Information Systems Engineering and Management

Optimizing ETL Pipelines at Scale: Lessons from PySpark and Airflow Integration

Abstract

Volume 10 (2025)

Volume 9 (2024)

Volume 8 (2023)

Volume 7 (2022)

Volume 6 (2021)

Volume 5 (2020)

Volume 4 (2019)

Volume 3 (2018)

Volume 2 (2017)

Volume 1 (2016)

Journal of Information Systems Engineering and Management

Article Sidebar

Main Article Content

Abstract

Article Details