You may be overestimating the sophistication of these algorithms.
At least for this training set, my algorithm rewarded the length of the essay most of all (something like 65% of the total prediction). The only other significant factors were misspellings and prevalence of certain parts of speech.
That model matched the accuracy of human graders and several commercial essay grading packages.
Students reverse-engineering comparable algorithms won't necessarily have to write well to score well.
currently, if a student writes nonsense, there's a fairly significant chance that they will be caught and penalised. a human can detect nonsense in three minutes.
in contrast, i suspect algorithmic approaches can be gamed more easily because they don't adapt in the same way. they're not solving the hard ai problem; they're grading essays (currently) written for a human reviewer.
for example, what happens if a child learns an existing text by heart and then substitutes appropriate nouns and verbs to suit the context? say they learn "We hold these truths to be self-evident, that all men are created equal" and then, for an essay on their favourite pet, they hand in "We hold these kittens to be furry, that all kittens are created hungry". That's good grammar; it's got suitable references to the subject; it's clearly nonsense.
No it's incorrect. Compare "these truths... that all men are created equal, (...)" with "these kittens... that all kittens are created hungry". The that in the second sentence is wrong.
In all probability, anyone setting out to actually defeat these algorithms could easily do it with a couple of hundred repetitions of the same sentence.
Or, if it's slightly cleverer than that, certainly you could defeat it by producing a single, perfect-length, stylistically fantastic essay... which would be regurgitated word-for-word regardless of the subject matter.
I think, mind you, that software does have a place in analyzing student essays. If I could scan in an essay and have it spit out a word count, highlight any spelling errors or potentially problematic turns of phrase, and [most importantly] analyze for plagiarism, that would be valuable.
At least for this training set, my algorithm rewarded the length of the essay most of all (something like 65% of the total prediction). The only other significant factors were misspellings and prevalence of certain parts of speech.
That model matched the accuracy of human graders and several commercial essay grading packages.
Students reverse-engineering comparable algorithms won't necessarily have to write well to score well.