Topic Driven Text Extraction for Kannada Document Summarization Using LDA
Main Article Content
Abstract
Automatic Text Summarization (ATS) compacts source content into a concise format while preserving core information. While extensively studied for resource-rich languages, ATS remains challenging for low-resource languages like Kannada due to limited corpora and NLP tools. This work introduces an extractive, topic-driven method for summarizing Kannada news articles from multiple documents. We developed a custom dataset of 100 Kannada news story sets (3 articles per set) to address the lack of standardized benchmarks. The proposed approach leverages Latent Dirichlet Allocation (LDA) to identify latent themes across documents, followed by sentence selection using vector-space modeling. Sentences are scored based on their relevance to identified topics (via cosine similarity) and prioritized to maximize informational value while minimizing redundancy through Maximum Marginal Relevance (MMR). Evaluations using ROUGE metrics demonstrate that the LDA-based method outperforms existing summarization algorithms, producing summaries closer to human-generated references. The system achieves higher F-scores (e.g., 0.68 at 40% compression) compared to baseline models like TextRank and approaches for other Indian languages, validating its efficacy for low-resource linguistic contexts.