Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

He's missing the most important part: ensuring that your application code is treating the text as text, not as an octet stream. This varies by language, but typically the code is something like "text = decode('utf8', binary)" when your application first sees data from the wire (or files, or a URI string, etc.), and "binary = encode('utf8', text)" when the data leaves your program, like to a log file or the terminal or a socket.

I say "binary" and "text" because the Internet cannot transmit text, it can only transmit "binary" octet streams. (Similarly, UNIX files can only store octets, and UNIX file names can only store octets other than / and NUL.) But, your programming language supports both text manipulation and binary manipulation, so you have to tell it how you want to treat the data. Each language is different; Perl treats everything as Latin-1 text by default (which happens to work nicely for binary, as well, but not so nicely for UTF-8-encoded text).

Often, libraries will handle this for you, since they have access to out-of-band information. If your locale is en_US.UTF-8, filenames can be assumed to be UTF-8-encoded. If the HTTP response's content-type says "charset=utf-8", your HTTP library will know to decode the octet stream into text for you. But it's important that you both test this and find the code that does it for you, because sometimes library authors forget or libraries have bugs, and one bug will ruin your whole operation.

Handling Unicode text is hard because it's a rare case where you have to get everything right or the results of your program will be undefined. And, there are no "reasonable defaults", so you have to be explicit about everything. Finally, you can't guess about what encoding your data is; all binary data must come with an encoding out-of-band, or your program will break horribly. Proper text manipulation is the ultimate test of "can I write correct software", and it isn't easy.



I agree on most of your points, but disagree on that guessing encoding should not be done. I think that it conflicts with basic robustness principle "be conservative in what you do, be liberal in what you accept from others".


I personally think being liberal in what you accept from others is the second worst evil in computer science. The worst being null, of course.


I agree. It allows sloppy developers to be liberal in what they do, and leads to increasingly complex (and incompatible) implementations necessary to be compatible with all the edge cases.

HTML is a good example. Browsers are very tolerate of malformed HTML, which is nice for beginners who don't want to worry too much about perfect syntax.

The problem is each browser handles the unspecified cases differently, which leads to differences in the way pages are rendered, security issues like XSS, etc.

Robustness should just be built into the protocol/format/spec, if necessary. HTML5 gets this right by specifying an algorithm that all parsers should use to get consistent behavior, while still being tolerant of imperfect syntax: http://en.wikipedia.org/wiki/Tag_soup#HTML5


Hey now. If software started validating its input, what would virus writers do for a living?


Then you will also personally produce programs which would be broken for 5/6ths of the world population who happen to use letters outside latin1.

There's no way to avoid it unless you wrap it up and add some explicit checks and guesses.


Won't all modern browsers include the encoding in the Content-Type header?

They should. If so there's no need to guess.


It's not just browsers. Browsers are pretty sane when it comes to charsets, because they had the time to make it right and the pressure to do so. (It wasn't like that in times of NN4/IE4, which would interpret your text as whatever they want and won't even let you override)

Facing to something less agamant (like dreaded id3 tags), no such luck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: