Analyzing Performance and Accuracy in Complaint Classification using Spark MLlib
Main Article Content
Abstract
Customer complaint data explosion poses the need for complaint classification methods that are scalable and efficient to facilitate Customer Relationship Management (CRM). This work surmounts the complaint classification issue of correctly classifying the unstructured customer complaints using the assistance of Apache Spark and its machine learning library, MLlib. A multi-stage PySpark pipeline used classification from the text of Amazon product reviews to "Highly Dissatisfied" and "Mildly Dissatisfied" classes. Three of the most popular classification algorithms—Naive Bayes, Logistic Regression, and RandomForestClassifier—were evaluated on the entire set of metrics like accuracy and macro F1-score and weighted recall and precision. Our experiments show that while the best accuracy was produced by the model of RandomForestClassifier all things being equal, the most balanced performance was provided by the model of Naive Bayes with the best macro F1-score of 0.6884 and highest weightage of precision of 0.7022. This optimal trade-off makes the model best suited for practical deployment. Our discovery is that for this specific classification task the most efficient solution for consistently and correctly classifying the customer complaints on large scale is the algorithm of Naive Bayes.