AI/ML-Based Data Sensitivity Classification: A Technical Framework
Main Article Content
Abstract
Increasingly complex information system environments are posing mounting challenges to organizations in determining and defending sensitive data. Conventional rule-based classifications have not been effective in managing various schemas, ambiguous metadata, and dynamic data structures, which typify new enterprise contexts. This architecture introduces an automated system that uses techniques of artificial intelligence and machine learning to categorize data sensitivity on a large scale. The framework uses multiple-dimensional feature extraction using column names, descriptions, table contexts, data types, and semantic embeddings. Transformer-based and classical models, such as BERT, TF-IDF, and ensemble classifiers, respectively, convert textual metadata into representations that can be used to make predictions. Multi-class architecture separates Personally Identifiable Information, Protected Health Information, and general sensitivity categories and fits various regulatory requirements in GDPR, HIPAA, and CCPA frameworks. Strict testing based on precision, recall, F1-score, and confusion matrix testing ensures the production-quality performance on uneven datasets that are characteristic of enterprise data catalogs. The framework saves a lot of manual classification effort and also provides high accuracy, which can allow the enforcement of policies automatically, faster compliance initiatives, and more mature data governance. Application in the healthcare, financial services, and technology sectors has shown a significant payoff in the form of lower compliance risks, lower operational overhead, and improved data protection capacity to facilitate digital transformation goals.