Myself and a colleague entered this competition. We came 9th.
We were doing this for fun, and aren't experts in the domain, but I think our score was within 'diminishing returns' of the better teams.
There are a couple of things to realize about this challenge:
I wouldn't conceptualise the challenge as trying to find features of good essays. It is more about trying to find features that are predictive of the essay being good.
This is a subtle but important distinction. One example is that the length of essay was hugely predictive of the score the essay would get - longer meant better.
Is a longer essay really a better one? No. But, at the level the students were at, it just so happened that the students who were able to write better, also were able to write longer essays.
While you get good accuracy using techniques like this, its debatable how useful or robust this general approach is - because you aren't really measuring the quality of the essay, so much as you are trying to find features that just happen to be predictive of the quality.
Certainly, it would seem fairly easy for future students to game, if such a system was deployed.
This isn't a general attack on machine learning competitions - but I wonder, if for situations that are in some sense adversarial like this (in that future students would have an incentive to game the system), whether some sort of iterated challenge would be better? After a couple of rounds of grading and attempting to game the grading, we'd probably have a more accurate assessment of how a system would work in practice.
There is another important feature of this essay grading challenge, that should be taken into consideration. There were 8 sets of essays, each on a different topic. So, for example, essay set 4 might have had the topic 'Write an essay on what you feel about technology in schools'. To improve accuracy, competitors could (and I would guess most of the better teams did) build separate models for each individual essay-set/topic. This then increased the accuracy of, say, a bag of words approach - if an essay-set 1 essay mentioned the word 'Internet' then maybe that was predictive of a good essay-set-1 essay, even though the inclusion of 'Internet' would not be predictive of essay quality, across all student essays.
Its important to remember this when thinking about the success of the algorithms. The essay grading algorithms were not necessarily general purpose, and could be fitted to each individual essay topic.
Which is fine, as long as we realize it. The fact that it was so easy to surpass inter annotator agreement (how predictive one of the human graders scoring was of the other human graders scoring) was interesting. Its just important to realize the limits of the machine learning contest setup.
I would guess that accuracy would down on essays of older, more advanced students, or in an adversarial situation where there was an incentive to game the system.
Is a longer essay really a better one? No. But, at the level the students were at, it just so happened that the students who were able to write better, also were able to write longer essays.
It could also be the case that length is one of the features your human graders are using to grade essays. I.e., it might really be causal, rather than merely correlated.
In my (anecdotal) experience, teachers certainly do this. While in college I developed the skill of utilizing excessively long and verbose language while elucidating simple points simply to incrementally increase the length of essays [1].
Luckily a great prof in grad school (thanks Joel) beat this bad habit out of me.
[1] In college I learned to pad my essays with verbose language.
Its possible.
More generally, its also possible the human graders were doing a bad job; the ML system can only learn 'essay quality' to the extent that the training data reflects it.
However, the kaggle supplied 'straw-man' benchmark, which worked solely based on the count of characters and words in the essay, had an score of .647 with the training data. (The score metric used isnt trivial to interpret - it was 'Weighted Mean Quadratic Weighted Kappa' - but for reference the best entries had a score of ~.8 at the end)
The score of .647, just using length, is quite high.
For length to have this powerful a causal predictive effect, the human graders would have to be weighting for length, as a feature, very heavily.
I can't rule that out; but I think its highly likely a major component of the predictive effect of length was correlative, rather than causal.
While you get good accuracy using techniques like this, its debatable how useful or robust this general approach is - because you aren't really measuring the quality of the essay, so much as you are trying to find features that just happen to be predictive of the quality. Certainly, it would seem fairly easy for future students to game, if such a system was deployed.
I'm not able to dig up the name, but there's a named effect in statistics (especially social-science statistics) describing exactly that. When you find a correlate of a desired outcome that has predictive value, a common result if you then set the correlate as a metric is that a substantial part of the correlation and predictive value quickly disappears, because you've now given people incentives to effectively arbitrage the proxy measure. You've said, I'm going to treat easy-to-measure property A as a proxy for what-I-really-want property B. Now there is a market incentive to find the cheapest possible way to maximize property A, which often ends up being via loopholes that do not maximize property B. A heuristic explanation is that proxies that are easier to measure than the "real" thing are also easier to optimize than the real thing. At the very least, your original statistics aren't valid anymore, because you measured in the context where people were not explicitly trying to optimize for A, but now they are doing so, so you need to re-measure to check if this changed the data.
Aha, almost it; I was thinking of the very similar Campbell's law, which your mention of Goodhart's law led me to. Somehow no combination of search terms got me to either of those when I was trying to come up with the name, though...
There are a couple of things to realize about this challenge:
I wouldn't conceptualise the challenge as trying to find features of good essays. It is more about trying to find features that are predictive of the essay being good.
This is a subtle but important distinction. One example is that the length of essay was hugely predictive of the score the essay would get - longer meant better.
Is a longer essay really a better one? No. But, at the level the students were at, it just so happened that the students who were able to write better, also were able to write longer essays.
While you get good accuracy using techniques like this, its debatable how useful or robust this general approach is - because you aren't really measuring the quality of the essay, so much as you are trying to find features that just happen to be predictive of the quality. Certainly, it would seem fairly easy for future students to game, if such a system was deployed.
This isn't a general attack on machine learning competitions - but I wonder, if for situations that are in some sense adversarial like this (in that future students would have an incentive to game the system), whether some sort of iterated challenge would be better? After a couple of rounds of grading and attempting to game the grading, we'd probably have a more accurate assessment of how a system would work in practice.
There is another important feature of this essay grading challenge, that should be taken into consideration. There were 8 sets of essays, each on a different topic. So, for example, essay set 4 might have had the topic 'Write an essay on what you feel about technology in schools'. To improve accuracy, competitors could (and I would guess most of the better teams did) build separate models for each individual essay-set/topic. This then increased the accuracy of, say, a bag of words approach - if an essay-set 1 essay mentioned the word 'Internet' then maybe that was predictive of a good essay-set-1 essay, even though the inclusion of 'Internet' would not be predictive of essay quality, across all student essays.
Its important to remember this when thinking about the success of the algorithms. The essay grading algorithms were not necessarily general purpose, and could be fitted to each individual essay topic.
Which is fine, as long as we realize it. The fact that it was so easy to surpass inter annotator agreement (how predictive one of the human graders scoring was of the other human graders scoring) was interesting. Its just important to realize the limits of the machine learning contest setup.
I would guess that accuracy would down on essays of older, more advanced students, or in an adversarial situation where there was an incentive to game the system.