Automated Data Quality Validation Frameworks for ETL Pipelines

Main Article Content

Subash Yadav

Abstract

In the modern data-driven environment, guaranteeing the quality of data handled in ETL (Extract, Transform, Load) pipelines is essential in making informed decisions in different industries. This research paper introduces a new model of automating data quality validation of the ETL pipeline through anomaly detection and machine learning models. In particular, the paper combines Isolation Forest (anomaly detection in real-time) and Random Forest (supervised validation) to detect anomalies, e.g. missing data, outliers, and schema violations. The framework was applied to the world bank data and had remarkable results in the data accuracy, processing speed and error detection as compared to the conventional manual validation techniques. The findings show that Isolation Forest model has precision of 0.92, recall of 0.88, and AUC-ROC of 0.94, and recall of 0.89, and Accuracy of 92% of the random Forest model. The framework will save more than 60 percent of manual intervention and processing time and improve data accuracy by 7 percent. These results highlight the opportunity of the suggested framework in real-time data settings and high-quality data is essential in operational and strategic decision-making in finance, healthcare, and e-commerce fields.

Article Details

Section
Articles