Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ignore all character support in the standard library and handle UTF-8 as opaque binary buffers. If you need complex string algorithms, decode into UCS-4 (UTF-32). You'll find short encoding and decoding functions on StackOverflow. For case-insensitive comparisons and sorting, use an external library that knows the latest Unicode standard.


Except that not all binary data is valid UTF-8 so you also need functions that check if a binary buffer is valid UTF-8.


The decoding phase will do that, if needed. Also note that in many cases you must process it as opaque binary, even though it should be valid UTF-8. This is in particular with filenames on POSIX systems because otherwise you could not access any files that happen to have invalid UTF-8 in their names.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: