Automated Essay Scoring (AES) Using NLP

Credits: ChatGPT

Business Understanding

Teaching students how to write well is essential since it enables them to convey their ideas in an effective manner and succeed in their personal/academic endevours (McNamara, 2015). As such, writing exercises such as essays are powerful tools for evaluating and preparing students in this area. However, grading essays manually is time-consuming and expensive for states. As a result, educational institutions opt in for multiple-choice standardized tests instead Ref. Thus, developing high-fidelity automated essay scoring tools is of utmost importance.

Data Understanding and Analysis

The dataset is from "The Hewlett Foundation: Automated Essay Scoring" competition on Kaggle (Link). It contains manually-scored essays. The dataset contains 8 essay sets from students grade 7 to 10. On average, each essay is approximately 150 to 550 words in length. Each essay set, has its own prompt, rubric and score range. All essays are scored by two graders and at times of stark contrast between the two grades, a third professional grader has scored the essay. Except for essay set #2, all essays have a final score field that can be used as the target variable in this study. Essay set #2 was scored by graders in two different domains (in the original competition, it's asked that two different scores be outputted for this essay set). For simplicity, we only use the scores from domain one for this essay set in the present study.

Figure 1. Essay counts per set.

Since the score range for each essay set is different, to use all sets together for developing a model, we will normalize all scores to a range 0 to 10. The following shows the distribution of normalized scores for the two human raters that graded the essays by hand.

Figure 2. Distribution of hand-graded scores (normalized) for all essays.

The Cohen's kappa (with quadratic weighting) can be used to compare how well the two human raters agree in order to create a benchmark for model's performance.

Figure 3. Agreement of human raters for each essay set.

Aggregated Cohen's kappa for all essays is $0.75\pm0.07$.

Feature Engineering

To build an automated essay scorer, relevant features must be extracted from essays. Coh-Metrix, a program that analyzes discourse using NLP, is commonly used for this purpose. Since Coh-Metrix lacks a Python implementation for English, this study computes similar features using the extract_features_bulk function. These features capture essay characteristics such as text easability, cohesion, lexical diversity, connectivity, syntactic complexity, and readability. The following table outlines these features.

Feature Name	Explanation
num_words	The total number of words in the essay. It provides an overall measure of essay length.
num_sentences	The total number of sentences in the essay. This is an indicator of the essay's structure.
avg_sentence_length	The average number of words per sentence. It gives an indication of sentence complexity.
std_sentence_length	The standard deviation of sentence lengths. This shows the variability in sentence structure.
avg_syllable_count	The average number of syllables per word. It reflects the complexity and sophistication of the words.
std_syllable_count	The standard deviation of syllable counts across words. This shows the variation in word complexity.
avg_letter_count	The average number of letters per word. It gives an idea of word length and lexical richness.
std_letter_count	The standard deviation of letter counts per word. This measures the variability in word length.
flesch_reading_ease	A readability score indicating how easy the text is to read. Higher scores indicate easier readability.
flesh_kincaid_grade	A grade-level readability score. This indicates the U.S. school grade level required to understand the text.
smog_index	A readability measure based on sentence length and complex words. It provides a gauge of readability difficulty.
lexical_diversity	The ratio of unique words to total words. Higher values suggest a richer vocabulary.
num_stopwords	The proportion of stopwords in the text. A high proportion may indicate simpler language use.
num_pronouns	The proportion of pronouns in the text. This can give insights into the personal style of the writer.
num_verbs	The proportion of verbs in the text. This can indicate the action-oriented nature of the text.
avg_tree_depth	The average syntactic depth of words in the sentence. A measure of sentence complexity from a syntactic perspective.
avg_concreteness	The average concreteness of the words. Concreteness correlates with the level of abstraction in the text.
num_misspelled	The proportion of misspelled words. This may indicate the writer's attention to spelling or errors.
lsa_overlap_avg	The average overlap in meaning between adjacent sentences, measured using Latent Semantic Analysis.

The final dataset has roughly 13000 entries and 20 features (including the essay set).

Machine Learning Model

Various algorithms have been used for automated essay scoring (AES), including Linear Regression, Random Forest Regressor, and Linear Support Vector Regressor. While hierarchical classification and transformer models have shown promise, this project uses a Random Forest Regressor due to its strong performance and ability to handle both numerical and categorical features. Although fine-tuning a transformer model was considered, it was deemed time-prohibitive for this project.

A pipeline was developed for 3-fold cross-validation and hyperparameter tuning, with mean squared error as the evaluation metric. Since scores were originally ordinal but normalized to a continuous scale, a denormalization step was applied to map predictions back to the original scale. Model performance was assessed using Quadratic Weighted Kappa (QWK), achieving a substantial agreement of 0.69 with human raters (for the interpretation of scores, see Doewes et al., 2023). Model's cross-validated as well as test MSE are close to 2 (units of normalized score). The $R^2$ score is roughly 0.6.

Figure 4. Feature importances in the final model. Note the moderate reliance of the model on length-based features.

Results and Recommendations

The cross-validated and test mean squared error of the model are close to 2 (units: normalized score).
The best-fit model has an $R^2$ of roughly 0.6. This means that the model explains only 60% of the variability in the target variable (normalized scores). Although $R^2$ matters more for inferential modeling, a moderate value of $R^2$ can still yield a strong predictive model.
The random forest regressor achieved a quadratic weighted kappa score of 0.69. This is barely meeting the minimum requirement for an acceptable AES model (the minimum acceptable score is 0.7 per Doewes et al., 2023). However, since 0.69 is within 0.1 (0.75) of the human raters' aggreement, it can still be deemed acceptable (Doewes et al. 2021).
The model's reliance on length-based features highlights a known bias in Automated Essay Scoring (AES) systems, where essay length can correlate with higher scores. Designing a richer set of features (including embeddings from a transformer-based model) is expected to improve this behavior.

Limitations and Next Steps

The model can benefit from enriched features that can help improve its performance. The current model may rely on length-based features far more than features that capture the context and meaning of the text. This is a known problem with AES engines. Enriching the features with sentence-embeddings using transformer models has been shown to help resolve this behavior (Doewes et al., 2021).
Training one model per essay set is expected to improve performance. Each essay has a different prompt and different score range. Adding features

References

Ghanta, Harshanti, Automated essay evaluation using natural language processing and machine learning, 2019, Master of Science Thesis. Link
Ludwig, S., Mayer, C., Hanse, C., Eilers, K., Brandt, S., Automated essay scoring using transformer models. Link
McNamara, D., Crossley, S., Roscoe, R., Allen, L., Dai, J., A hierarchical classification approach to automated essay scoring, Assessing Writing Journal, 2015. Link
McNamara, D. S., Graesser, A. C., McCarthy, P., & Cai, Z. Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge: Cambridge University Press, 2014. Link
Doewes, A., Pechenizkiy, M., On the limitations of human-computer agreement in automated essay scoring, International, educational data mining society, 2021. Link
Doewes, A., Kurdhi, N., Saxena, A., Evaluating quadratic weighted kappa as the standard performance metric for automated essay scoring, Educational data mining, 2023. Link

Resources

Data: Link
Jupyter Notebook: Link
Presentation: WIP

Navigating the Repository

├───.gitignore
├───feature_engineered_data.csv
├───training_set_rel3.tsv
├───main.ipynb
├───README.md
├───images
└───src
│ feature_engineering.py
│ post_processing.py
│ preprocessing.py
│ visualizations.py
└───presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Essay Scoring (AES) Using NLP

Business Understanding

Data Understanding and Analysis

Feature Engineering

Machine Learning Model

Results and Recommendations

Limitations and Next Steps

References

Resources

Navigating the Repository

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
clean_data		clean_data
images		images
src		src
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb
presentation.pdf		presentation.pdf

setare92-ha/automated_essay_scoring_using_nlp

Folders and files

Latest commit

History

Repository files navigation

Automated Essay Scoring (AES) Using NLP

Business Understanding

Data Understanding and Analysis

Feature Engineering

Machine Learning Model

Results and Recommendations

Limitations and Next Steps

References

Resources

Navigating the Repository

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages