Resumo(s)
Although the proven superiority of data-driven approaches based on machine learning
techniques with respect to survey-based methods, the use of machine learning in the
field of education is still in the beginning. To address this situation, this thesis aims to
predict academic achievement by presenting a machine learning framework that is
specifically designed to analyze the enormous amount of data provided by the public
administration. In detail, the key goals are: (i) apply data science and machine learning
methods in the context of academic achievement; (ii) conduct a study of academic
achievement to virtually capture the universe of public high schools’ students, i.e., not
rely on sample data, (iii) use predictive models to actively “flag” those students with a
greater likelihood to underperform in academic achievement, thereby enabling an
appropriate educational response; and (iv) contribute to the domain development
focusing on hypothetical novel quantitative approaches. In pursuing the above
objectives, the research uses a pioneering public initiative – MISI – that has data
regarding high school systems and students’ academic achievement at the country level.
Section 1 and section 2, Introduction, and Literature review, serve to shape the research
framework accordingly.
The section 3 study uses an anonymous 2014-15 school year dataset from the
Directorate-General for Statistics of Education and Science of the Portuguese Ministry
of Education as a means to carry out a predictive power comparison between the classic
multilinear regression model and a chosen set of machine learning algorithms. A
multilinear regression model is used in parallel with random forest, support vector machine, artificial neural network and extreme gradient boosting machine stacking
ensemble implementations. Designing a hybrid analysis is intended where classical
statistical analysis and artificial intelligence algorithms are blended to augment the
ability to retain valuable conclusions and well-supported results. The machine learning
algorithms attain a higher level of predictive ability. In addition, the stacking
appropriateness increases as the base learner output correlation matrix determinant
increases and the random forest feature importance empirical distributions are
correlated with the structure of p-values and the statistical significance test ascertains
of the multiple linear model. An information system that supports the nationwide
education system should be designed and further structured to collect meaningful and
precise data about the full range of academic achievement antecedents. The article
concludes that no evidence is found in favour of smaller classes.
The section 4 study focuses on the machine learning bias when predicting teacher
grades. The experimental phase consists of predicting the student grades of 11th and
12th grade Portuguese high school grades and computing the bias and variance
decomposition. In the base implementation, only the academic achievement critical
factors are considered. In the second implementation, the preceding year’s grade is
appended as an input variable. The machine learning algorithms in use are random
forest, support vector machine, and extreme boosting machine. The reasons behind the
poor performance of the machine learning algorithms are either the input space poor
preciseness or the lack of a sound record of student performance. We introduce the new
concept of knowledge bias and a new predictive model classification. Precision
education would reduce bias by providing low-bias intensive-knowledge models. To avoid bias, it is not necessary to add knowledge to the input space. Low-bias extensiveknowledge
models are achievable simply by appending the student’s earlier
performance record to the model. The low-bias intensive-knowledge learning models
promoted by precision education are suited to designing new policies and actions
toward academic attainments. If the aim is solely prediction, deciding for a low bias
knowledge-extensive model can be appropriate and correct.
The section 5 study applies deep learning to the prediction of Portuguese high school
grades. Two implementations are undertaken in the experimental phase, one of a deep
multilayer perceptron and the other of multiple linear regression. The architecture,
topology, regularization, initialization, and optimization algorithms are fine-tuned in the
deep learning hyper-tuning phase. The results encompass point predictions, prediction
intervals, variables gradients, and the impact of an increase in the class size on grades.
The deep learning generalization error is more minor in the student grades prediction,
and its prediction intervals are more accurate. The deep multilayer perceptron gradients
empirical distributions largely align with the regression coefficients estimates, indicating
a satisfactory regression fit. Based on gradients discrepancies, a student's mother being
an employer does not seem to be a positive factor. A benign paradigm change in the
balance between home and career affairs for both genders should be reinforced. The
deep multilayer perceptron broadens the spectrum of possibilities and greets each
specificity as a core analysis element by providing a quantum solution hinged on a
universal approximator. In the case of an academic achievement critical factor such as
class size where the literature is neither unanimous on its importance nor its direction, the multilayer perceptron formed three distinct clusters per the individual gradient
signals.
Finally, section 6 recaps the findings and conclusions of the thesis.
Keywords: Academic Achievement, Machine Learning, Deep Learning, Support Vector
Regression, Random Forest, Stacking, Boosting, Bias and Variance Decomposition,
Quantitative Political Analysis
Descrição
A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information Management, specialization in Information and Decision Systems
Palavras-chave
Academic Achievement Machine Learning Deep Learning Support Vector Regression Random Forest Stacking Boosting Bias and Variance Decomposition Quantitative Political Analysis SDG 4 - Quality education
