Online automated essay scoring

Is punctuation not important, a little important, or super-important? You'll have to tell the program which judgment to follow, and the moment you do, you've embedded one of your personal biases into the machine. Fans of robo-graders like the one in the NPR piece talk about how the AI can "learn" what a good essay looks like by being fed a hundred or so "good" essays. But there are two problems here. The first is that somebody has to pick the exemplars, so hello again, human bias. The second is that this narrows the AI's view by saying that a good essay is one that looks a lot like these other essays.

So much for open-ended questions and divergent thinking. But the biggest problem with robo-grading continues to be the algorithm's inability to distinguish between quality and drivel. He pulled up a letter of recommendation he had written, replaced the student's name with words from a Criterion writing prompt, and replaced the word "the" with "chimpanzee.

Automated Essay Scoring Tool

The king of robo-score debunking is Les Perelman. More recently, the former MIT professor teamed up with some students to create BABEL, a computer program that can create gibberish essays that other computer programs score as outstanding pieces of writing. Robo-scoring fans like to reference a study by Mark Shermis University of Akron and Ben Hamner, in which computers and human scorers produced near-identical scores for a batch of essays. Perelman tore the study apart pretty thoroughly.

The full dismantling is here , but the basic problem, beyond methodology itself, was that the testing industry has its own definition of what the task of writing should be, which more about a performance task than an actual expression of thought and meaning. The secret of all studies of this type is simple-- make the humans follow the same algorithm used by the computer rather than he kind of scoring that an actual English teacher would use. The unhappy lesson there is that the robo-graders merely exacerbate the problems created by standardized writing tests.

The point is not that robo-graders can't recognize gibberish. The point is that their inability to distinguish between good writing and baloney makes them easy to game.

Site Information Navigation

Use some big words. Repeat words from the prompt. Fill up lots of space. Students can rapidly learn performative system gaming for an audience of software.

AI In Education — Automatic Essay Scoring

And the people selling this baloney can't tell the difference themselves. This measure quantifies the number of times that an n-gram appears in the essay while weighting them based on how frequently the words appear in a general corpus of text. In other words, the tf-idf measure provides a powerful way of standardizing n-gram counts based on the expected number of times that they would have appeared in an essay in the first place. As a result, while the count of a particular n-gram may be large if found often in the text, this can be offset when processed by the tf-idf method if the n-gram is one already frequently appears in essays.

As such, given the benefits of n-grams and their quantification via the tf-idf method, we created a baseline model using unigrams with tf-idf as the predictive features. As our baseline model, we decided to use a simple linear regression model to predict a set of standardized scores for our training essays. In other words, it determines how well the ranking of the features corresponds with the ranking of the scores.

The benefit of this approach is that this is a useful measure for grading essays, since we're interested to know how directly a feature predicts the relative score of an essay i. Ultimately, this is a better model to measure rather than accuracy, since it gives direct insight into the influence of the feature on the score, and furthermore, because relative accuracy might be more important than actual accuracy.

Spearman results in a score ranging from -1 to 1, where the closer the score is to an absolute value of 1, the stronger the monotonic association and where positive values imply a positive monotonic association, versus negative values implying a negative one. The closer the value to 0, the weaker the monotonic association. The general consensus of Spearman correlation strength interpretation is as follows:. As seen in Figure 2, the baseline model received scores that ranged from very weak to moderate, all with p-scores of several factors less than 0. However, even with this statistical significance, such weak Spearman correlations are ultimately far too low for this baseline model to provide a trustworthy system.

As such, we clearly need a stronger model with a more robust selection of features, as expected! Our early data exploration pointed to word count and vocab size being useful features. Other trivial features that we opted to include were number of sentences, percent of misspellings, and percentages of each part of speech. We believed these features would be valuable additions to our existing baseline model, as they provide greater insight to the overall structure of each essay, and thus foreseeably could be correlated with score.

However, we also wanted to include at least one nontrivial feature, operating under the belief that essay grading depends on the actual content of the essay—that is, an aspect of the writing that is not captured by trivial statistics on the essay. Based on a recommendation by our Teaching Fellow Yoon, we decided to implement the nontrivial perplexity feature. Perplexity is a measure of the likelihood of a sequence of words of appearing, given a training set of text. Somewhat confusingly, a low perplexity score corresponds to a high likelihood of appearing.

IntelliMetric® | direct

One would logically conclude that good essays on a certain topic would have similar ideas and thus similar vocabulary. As such, it follows that given a sufficient training set, perplexity may well provide a valid measure of the content of the essays [4]. Using perplexity proved to be much more of a challenge than anticipated. While the NLTK module provides a method that builds a language model and can subsequently calculate the perplexity of a string based from this model, the method is currently removed from NLTK due to several existing bugs [5].

As such, we concluded that the most appealing option was to implement a basic version of the perplexity library ourselves.

3 tools to automate the essay grading process

We therefore constructed a unigram language model and perplexity function. Ideally, we will be able to expand this functionality to n-grams in the future, but due to time constraints, complexity, code efficiency, and the necessity of testing code we write ourselves, we have only managed to implement perplexity on a unigram model for now. The relationship of each feature to the score can be seen in Figure 3. Unique word count, word count, and sentence count all seem to have a clearly correlated relationship with score, while perplexity demonstrates a possible trend.

It is our belief that with a more advanced perplexity library, perhaps one based on n-grams rather than unigrams, this relationship would be strengthened. Indeed, this is a point of discussion later in this report. With these these additional features in place, we moved on to select the actual model to predict our response variable. In the end, we decided to continue using linear regression, as we saw no reason to stray from this approach, and also because we were recommended to use such a model!

However, we decided that it was important to include a regularization component in order to limit the influence of any collinear relationships among our thousands of features. We experimented with both Lasso and Ridge regularization, tuning for optimal alpha with values ranging from 0. As learned in class, Lasso performs both parameter shrinkage and variable selection, automatically removing predictors that are collinear with other predictors. Ridge regression, on the other hand, does not zero out coefficients for the predictors, but does minimize them, limiting their effect on the Spearman correlation.

With this improved model, we see that the Spearman rank correlations have significantly improved from the baseline model. In Figure 5, we highlight the scores of the models that yielded the highest Spearman correlations for each of the essay sets. Our highest Spearman correlation was achieved on Essay Set 1, at approximately 0.


  1. thesis statement for religion in schools.
  2. application letter for teaching position in secondary.
  3. Automated essay scoring.
  4. peacock essay in english.
  5. writing an informative essay?
  6. narrative essay planning sheet.

We discuss ways to improve this in the following section. Figures 4 and 5 also show that Lasso regularization generally performed better than the Ridge regularization, exhibiting better Spearman scores in six out of the eight essay sets; in fact, the average score of Lasso was also slightly higher.

Automated Essay Scoring Remains An Empty Dream

While this difference is not large, we would nonetheless opt for the Lasso model. Given that we have thousands of features with the inclusion of tf-idf, it is likely that plenty of these features are not statistically significant in our linear model.


  • automated-essay-scoring!
  • Automated Essay Scoring - K Internet Resource Center.
  • A CS109a Final Project by Anmol Gupta, Annie Hwang, Paul Lisker, and Kevin Loughlin.
  • Hence, completely eliminating those features—as Lasso does—rather than just shrinking their coefficients, gives us a more interpretable, computationally efficient, and simpler model. Ultimately, our Lasso linear regression yielded the greatest overall Spearman correlation, and is intuitively justifiable as a model. With proper tuning for regularization, we note that an alpha value of no greater than 0.

    Importantly, p-values remained well below 0.