Myself and a colleague entered this competition. We came 9th. We were doing this...

yummyfajitas · on June 10, 2012

Is a longer essay really a better one? No. But, at the level the students were at, it just so happened that the students who were able to write better, also were able to write longer essays.

It could also be the case that length is one of the features your human graders are using to grade essays. I.e., it might really be causal, rather than merely correlated.

In my (anecdotal) experience, teachers certainly do this. While in college I developed the skill of utilizing excessively long and verbose language while elucidating simple points simply to incrementally increase the length of essays [1].

Luckily a great prof in grad school (thanks Joel) beat this bad habit out of me.

[1] In college I learned to pad my essays with verbose language.

fergal_reid · on June 11, 2012

Its possible. More generally, its also possible the human graders were doing a bad job; the ML system can only learn 'essay quality' to the extent that the training data reflects it.

However, the kaggle supplied 'straw-man' benchmark, which worked solely based on the count of characters and words in the essay, had an score of .647 with the training data. (The score metric used isnt trivial to interpret - it was 'Weighted Mean Quadratic Weighted Kappa' - but for reference the best entries had a score of ~.8 at the end)

The score of .647, just using length, is quite high. For length to have this powerful a causal predictive effect, the human graders would have to be weighting for length, as a feature, very heavily.

I can't rule that out; but I think its highly likely a major component of the predictive effect of length was correlative, rather than causal.

_delirium · on June 10, 2012

While you get good accuracy using techniques like this, its debatable how useful or robust this general approach is - because you aren't really measuring the quality of the essay, so much as you are trying to find features that just happen to be predictive of the quality. Certainly, it would seem fairly easy for future students to game, if such a system was deployed.

I'm not able to dig up the name, but there's a named effect in statistics (especially social-science statistics) describing exactly that. When you find a correlate of a desired outcome that has predictive value, a common result if you then set the correlate as a metric is that a substantial part of the correlation and predictive value quickly disappears, because you've now given people incentives to effectively arbitrage the proxy measure. You've said, I'm going to treat easy-to-measure property A as a proxy for what-I-really-want property B. Now there is a market incentive to find the cheapest possible way to maximize property A, which often ends up being via loopholes that do not maximize property B. A heuristic explanation is that proxies that are easier to measure than the "real" thing are also easier to optimize than the real thing. At the very least, your original statistics aren't valid anymore, because you measured in the context where people were not explicitly trying to optimize for A, but now they are doing so, so you need to re-measure to check if this changed the data.

gwern · on June 10, 2012

Goodhart's law?

_delirium · on June 10, 2012

Aha, almost it; I was thinking of the very similar Campbell's law, which your mention of Goodhart's law led me to. Somehow no combination of search terms got me to either of those when I was trying to come up with the name, though...

gammarator · on June 10, 2012

Very interesting, and a useful concept. Both grades and standarized testing themselves are subject to these pressures, obviously.