An Ensemble Predictive Model Based Prototype for Student Drop-out in Secondary Schools

Model Based


INTRODUCTION
Student drop-out continues to be a serious problem regardless of the fact that education has always been a national priority for successive Tanzanian governments since independence (Wizara ya Elimu na Mafunzo ya Ufundi, 2014). This problem affects both the progress of individuals and society (Kim and Kim, 2018). A total of 5.1 million children between the age of 7 and 17 are estimated to be out of school at the lower secondary level (Human Rights Watch, 2017). For many children, education ends after primary school; only three out of five Tanzanians adolescents or 52% of the eligible school population have been enrolled in lower-secondary education and fewer complete secondary education (Human Rights Watch, 2017). The implications for finding and implementing solutions to the drop-out problem go beyond the individual benefits for students. Furthermore, investing in future progress and better standards of living with multiplier effects requires enabling students to complete their education. Therefore, making efforts that will improve this situation demands a vibrant knowledge of the extent, reasons, circumstances and the response to policies that led to the student drop-out problem.
In response to the drop-out problem and other challenges that secondary schools are facing, the government of Tanzania introduced an Education Training Policy (ETP) and Education Sector Development Plan (ESDP) (TAMISEMI, 2004). The aim was to place emphasis on the quality of education and improve access to secondary education. These goals are in line with the Sustainable Development Goals (SDGs), a United Nations initiative that puts a target for all countries to offer free, equitable and quality primary and secondary education to children by 2030 (Truta et al., 2018). The goals are also in line with Tanzania's international and regional human rights obligations to realize the right to primary and secondary education for all (Wizara ya Elimu na Mafunzo ya Ufundi, 2014). Despite the combined efforts for improving the status of secondary school education by improving access, capacity development, quality and secondary school direct funding, the student drop-out problem still seems to persist.
Recently, machine learning technologies have gained much attention in the fight against the school drop-out problem (Elbadrawy et al., 2016;Xu et al., 2017). The use of these advanced technologies can potentially facilitate the identification of at-risk students and enable timely planning for interventions (Fei and Yeung, 2015). However, most of the existing studies have focused only on developing predictive models without including mechanism to assist interpretation of machine learning results (Aulck et al., 2016;Hung et al., 2017;Liang et al., 2016;Santana et al., 2015). Taking advantage of an increase in the number of Internet users, which is about 23 million peoplealmost 45% of the Tanzanian population (Maginga et al., 2018), this study intends to develop an ensemble predictive model based prototype to enable authorities in identifying at-risk students and schools for early intervention. The study uses both student and school level datasets from a developing country to address the problem with consideration of the local context. The prototype requires Internet connection to support the flow of information between the interface and the server side. The specific focus was to come up with a prototype that allows a user to input student features with high contribution to the drop-out prediction based on the feature engineering experiment conducted. The developed prototype web-based system, which can automatically recognize students with high probability of dropping out, has been constructed by implementing an ensemble algorithm (Mduma et al., 2019b). Furthermore, the system was integrated with a visualization module to highlight schools with high drop-out rates in order to help the authorities to focus on school needs during planning and budgeting processes. The idea of developing a prototype was intended to support interpretability of machine learning results using an easier approach that will be understood by users with no knowledge of machine learning.

RELATED WORK
Machine learning approaches have been used for educational purposes including developing a system for an early identification of students at risk of dropping out (Berens et al., 2018). An Early Detection System (EDS) for predicting student success in tertiary education as a basis for a targeted intervention was developed. Regression analysis, Neural networks, Decision trees and the AdaBoost algorithms were used to point out students characteristics that distinguish potential dropouts from graduates. The developed methods was then implemented in every German university. This method uses student demographic and performance data which was collected and maintained by legal mandate.
Similarly, a mobile academic performance prediction system was developed with the aim of predicting students that require early intervention (Mgala, 2016). The study used datasets of primary schools collected in Kenya. Logistic regression, Multilayer perceptron, Sequential minimal optimization algorithm (SMO), Bayesian network classifiers, Naive Bayes classifier, Lazy learners, Random forest classifier and J48 algorithm were used to build the model. However, a simple Logistic regression classifier achieved the best results. Therefore, it was used in the implementation of this mobile system.
Another study outlined an extensive framework that uses machine learning approaches to identify students who are at risk of not finishing high school on time (Lakkaraju et al., 2015). The study was done in the United States and it aimed at giving both students and schools hands-on tools based on their needs, and to assist schools in identifying and prioritizing students that are at risk of adverse academic outcomes.
In another study, a survival analysis based framework was developed to identify at-risk students (Ameri et al., 2016). A Time-dependent Cox (TD-Cox) model was applied to capture time-varying factors and to leverage this information to provide more accurate prediction of student drop-out. The framework was proposed to predict which students were likely to drop-out including the semester when the drop-out was expected to occur. This method was evaluated on real student data collected at Wayne State University.
Another example involved the use of students data gathered from the University of Barcelona (UB) to implement visualization tools for predicting academic grades and student drop-out (Rovira et al., 2017). The developed tools allowed interpretation of drop-out prediction errors based on the grades distribution.
Furthermore, one study developed a deep learning based prototype system for automated eye gaze following, that estimated where each person in a classroom was looking (Aung et al., 2018). The study aimed at helping teachers to give attention at the right thing or to the right students within classrooms. Since the focus was on classroom observation videos, a dataset of publicly available classroom sessions from YouTube videos were collected.
Our study is based on the earlier works done in the educational field as presented in this section. However, in this study both student and school-level datasets from a developing country were used to reflect local context. As it has been observed, many existing studies focused only on student-level datasets and did not consider schoollevel datasets for addressing this problem (Mduma et al., 2019a). Logistic Regression, Multilayer Perceptron, Random Forest and K-Nearest-Neighbors were used to build the model. The results showed that Logistic Regression and Multi-Layer Perceptron achieved the highest performance. Furthermore, hyper-parameter tuning was performed to improve the predictive power of the well performing models and an ensemble classifier which was developed by soft combining the best performing models attained the best performance (Mduma et al., 2019b). We therefore implemented an ensemble predictive model for this prototype.

MATERIALS AND METHODS
The development of this system followed a prototyping software development approach. This approach was created to receive feedback from users for refining the final product (Nacheva, 2017;Yu, 2018). It presents the analysis, design and implementation phases so as to develop a simplified version of the system and provide users with the evaluation and feedback (Iqbal, 2017). The prototype was then improved following feedbacks from the users. The improved prototype was given back to the users for further evaluation, and the cycle continued until the users were satisfied with the final prototype as shown in Figure 1.
Since the system was designed primarily to help educational stakeholders in identifying at-risk students and schools; education officers, parents, teachers and information systems development experts were involved in the process of prototype development. Education stakeholders from five selected districts were involved in the focus group discussion during data collection. The technical feedback from information systems development experts were used to improve the prototype.
The study includes both functional and non-functional requirements. The functional requirements indicate what a user needs from the system, while the non-functional requirements refer to the system architecture (Alsaleh and Haron, 2016). The functional requirements for developing this prototype cover: • The issues of predicting whether a student will drop-out or not.
• The use of features with high contribution to the drop-out prediction.
• The use of the best classifier -an ensemble algorithm developed by soft combining the tuned Logistic Regression and Multi-Layer Perceptron models. • The issue of visualizing school drop-out. The non-functional requirements of the system cover the issues of:

Datasets Description
There is a dearth of studies focused on addressing student drop-out using machine learning in developing countries (Mduma et al., 2019a) and publicly available datasets addressing this problem are difficult to find. This study used datasets from Tanzania which reflect the context of a specific developing countries. We used the Uwezo data 1 collected in 2015 at the country level to develop an ensemble predictive model. This student-level dataset collected by Twaweza was assembled with the aim of evaluating children learning levels across hundreds of thousands of households in East Africa. The dataset consists of 61,340 samples of student records and 18 features: •  (Basu et al., 2019). Furthermore, several approaches on handling numeric values, missing values, and outliers were identified (Shahul et al., 2016). In this study, Principle Component Analysis (PCA) was performed with the purpose of diminishing the number of dimensions without losing too much information (Jiang et al., 2016).
A school-level dataset collected by the Presidents Office Regional Administration and Local Government in Tanzania (PORALG) was integrated with publicly available data accessed through Government Open Data Portal 2 to support visualization. The dataset consists of 11 features: • Region

Model Development and Proposed Solution
The model was formulated after comprehensive analysis of widely used machine learning algorithms which represent linear, neural network, ensemble and instance models. Since data imbalance was observed during the pre-processing stage, the Synthetic Minority Oversampling Technique and Edited Nearest Neighbor (SMOTE-ENN) approach were applied to handle the problem. The dataset was split into training (60%), validation (20%) and testing (20%) sets. The sampling approach was applied only to the train set. The model was built using train and validation sets and evaluation was done using an unseen test set in order to observe model behavior in a real environment which is imbalance, the overall experimental procedure is summarized in Figure 3. From the architecture diagram in Figure 4, the prototype interface was linked to the server via the Internet. The developed prototype on the client side allows input of the students' information, comprised of features with high contribution to the drop-out prediction. The features were selected using a feature engineering experiment. In data preparation for machine learning, this approach is conducted to construct suitable features for improving predictive performance (Nargesian et al., 2017;Naz et al., 2019). This was attained by evaluating permutation of the feature importance score. The score was anticipated to measure the impact of an individual feature on the model performance by permuting values of each feature and evaluating how much the permutation decreases the model performance. The server contained an ensemble algorithm which was developed by soft combining the tuned Logistic regression and Multi-Layer Perceptron models, earlier recognized as the best model. This model was then implemented in python using Scikit-learn (Mitchell, 2015).
The server interface used Flask framework. Flask was preferred in this study due to it popularity and ability to make the core functionality simple but extensible in terms of development. It also saves time needed to build web applications (Armash et al., 2015). The developed system transferred a students information entered through the  . Diagram of the system's architecture system interface via the Internet to an ensemble algorithm on the server. On the server, the deployed model predicts the result for this new entry. The result is next transferred via the Internet to the prototype interface. Flask web server facilitated the record transfer to the server and the result from the server to the system interface. For this prototype, Heroku server was used as the hosting platform to support deployment of the developed system.

Feature Engineering Results
The results demonstrated in Figure 5, indicate that Student gender (Sex), Parent who check his/her child's exercise book once in a week (PCCB), Household meals per day (MLPD), Student who did read any book with his/her parent in last week (SPB), Parent who discuss his/her child's progress with teacher last term (PTD) and Student age (Age) have high contribution on the drop-out prediction performance. These features were included in the developed prototype to serve as an input for student information.

Drop-out Prediction Interface
The interface allows the system to connect and exchange information by acting as the bridge between a user and the system (Iftikhar et al., 2018). The drop-out prediction module allows users to input student information as shown in Figure 6 and prediction was given based on the provided information. The system then provided prediction results to indicate whether a given student will drop-out or not. This module was developed to assist parents and teachers on identifying at risks student who are in most need of help.

Drop-out Visualization Interface
Visualization has been recognized as an important approach to understanding data (Xin et al., 2018). This technique has been used to support interpretation of machine learning results. The school-level dataset was visualized to highlight school drop-out within selected districts as shown in Figure 7. The intention was to assist education stakeholders on identifying at-risk schools in order to provide requirements based on the school needs.

DISCUSSION AND CONCLUSION
An ensemble predictive model based prototype has been developed to predict student drop-out as declared in this study. The developed system whose requirements specifications were narrated in this paper helps in the identification of at-risk students for early intervention. By taking advantage of Internet penetration within the country and the use of machine learning technology, this system is directly going to benefit education stakeholders in identifying at-risk students and schools. Authorities will be able to use the developed system to facilitate the planning and budgeting process in order to provide school needs based on the requirements. The development of this system considered both users with/without knowledge of basic computer skills.
Several studies in developed countries have applied machine learning techniques to tackle student drop-out (Elbadrawy et al., 2016;Fei and Yeung, 2015;Xu et al., 2017). However, few studies focused on developing a prototypes to assist education stakeholders in the interpretation of machine learning results (Aung et al., 2018;Rovira et al., 2017;Berens et al., 2018). A mobile based tool was developed in a developing country to fight against  the drop-out problem; however, the study focused only on student-level dataset which is not publicly available (Mgala, 2016). Furthermore, feature engineering results that show student gender has high contribution to the drop-out prediction support researchers' findings on drop-out rate with gender association (Kim and Kim, 2018). Therefore, focus should be directed not only on developing predictive models on addressing the problem but also on providing a room for intended users to be able to interact with the developed approach. This can be achieved by implementing the developed models in the systems for easy understanding. Furthermore, evaluation of the developed systems must be taken into consideration to ensure that the systems address the users needs. Additionally, cost and time limitations should be considered when generating new datasets to be used in addressing this problem. This can be achieved by emphasizing the identification of available datasets as done in this study in order to attract other researchers in the education field to provide solutions needed to address the student dropout problem.
This paper presents an ensemble predictive model based prototype to help education stakeholders in the early detection of student drop-out in Tanzania. An ensemble classifier which was obtained by soft combining the tuned Logistic Regression and Multi-layer Perceptron was implemented in this prototype. Six features with significant contributions to the drop-out prediction were used as inputs for student information. Furthermore, the prototype was integrated with a visualization module to facilitate interpretation of machine learning results. In particular, the developed system predicted whether a given student will drop-out or not and visualized schools with high dropout risks. Therefore, this study is limited on identifying at risk students and schools using a web-based approach. Inclusion of other components such as ranking and forecasting mechanisms will be an added advantage on facilitating a more robust and comprehensive early warning systems for students dropout.
Publicly available datasets have been identified to provide room for other researchers in the field of education to apply different approaches to solve the problem of student drop-out. This work proves the value of machine learning approaches on addressing drop-out prediction. The study complements previous research done by other researchers in developed countries using developed countries datasets. Regarding educational implications, the developed system can be extremely useful for education stakeholders, that will be able to recognize earlier which students and schools need help. This information will assist them in providing early intervention. Future directions of this study will be focused on evaluating performance of the developed system and developing a mobile application based on the developed prototype.