Currently known UTF-8 problems in Ubuntu
A lot of web pages, like http://se.php.net/manual/sv/introduction.php, use the ISO-8859-1 charset with international/special characters without specifiying any charset (that can be done in the <meta http-equiv="content-type" ... > tag). Maybe ISO-8859-1 should be the default here?
UTFEightByDefault does not mean using it everywhere. In places where the encoding is specified unambigously other encodings are ok. RFC2616 defines 8859-1 as the default encoding for HTML so the current situation appears ok.
- The HTML4 spec, however, recommends not relying on this default, so adding a meta tag would be nice.
- If HTML pages are translated to UTF-8 a meta tag would be REQUIRED.
- Another alternative is to use numeric entity references for all non-ascii characters
HTMLTidy is your friend! http://www.w3.org/People/Raggett/tidy/
- Midnight Commander doesn't work well in UTF-8 locales. The development team has a roadmap which considers fixing this in the upcoming version 1.7.0, which will be developed after the 1.6.1 bugfix release. Patches exist already, so a 1.7.0-pre1 could be out soon after the 1.6.1 release
- LaTeX is capable of handling UTF-8 using \inputenc[utf8], however, this is apparently not enough for languages like greek and japanese. Solutions are welcome.
GnuCash does not support UTF-8.
- GTK+ 1.x unicode capabilities are very limited, GTK+2 is a better choice. However several applications are still using GTK+1, so these will have to be ported or replaced by a better solution.
XMMS can be replaced by BMPx (http://beep-media-player.org/index.php/BMPx_Homepage). Though still in development, it does make use of GStreamer/Xine so that it supports lots of media formats. Beep Media Player itself is superseded by Audacious (http://audacious-media-player.org/).
GnuCash (even though in universe) does have a development branch for Gnome 2.x compatibility, but it's nowhere near ready. The project needs manpower for that!
- No MP3 tagging utility available can handle Unicode in ID3v2 tags correctly (ID3v1 doesn't have any unicode capabilities), many of them accept UTF-8 text, but the encoding bit is not set. The problem seems to originate in id3lib's incomplete unicode support. The solution is to port the tools to using mplib or taglib from id3lib. For intermediate solution before porting is done, one can consider setting GST_ID3_TAG_ENCODING environment variable for non-iso8859-1 locales. This setting, though hackish, do help a lot for many people for now.
eyeD3 (http://eyed3.nicfit.net/) can work with unicode in tags. it's a command-line tool written in python
- please note that to use utf-8 in mp3 tags, you need id3v2.4. earlier versions, like id3v2.3 can only use utf16 as a unicode-based encoding.
I've just done a patch to let lame handle Unicode in tags. You can find it under "Patches" here: http://sourceforge.net/projects/lame. The patch doesn't use UTF-8 in the tags. It uses UCS-2 (not UTF-16), which is included in id3v2.2, at least.
- What's wrong with utf16? What's really important is what encoding is used in mp3s "out in the wild". If they use utf16 then that's the encoding that should be used, not utf8. Does anyone know what's the most common *unambigous* encoding? (the most common encoding is probably just the codepage of the machine they were created on, without specifying what it is...)
- No way should utf-16 be used. ID3 v2.3 is now an informal standard (or actually going to be superseded), while ID3 v2.4 is current one. By the way, the most important point is utf-16 can't represent all characters beyond Basic Multilingual Plane.
- ispell is not locale-aware, so, for example, I cannot use ilithuanian from Hoary universe (which contains a dictionary in ISO-8859-13) to check the spelling of Lithuanian texts in UTF-8.
aspell 0.50 does not support UTF-8 (quote: "aspell 0.50 completely fails to even check the non-UTF-8 parts"). aspell 0.60 does support it. Status of aspell in Debian 0.60 is described at http://bugs.debian.org/274514.
I don't know if it's come up at all, but... UTF-8 kind of sucks if you're using a non-Latin-1 script. It handles Unicode characters less than 255 as single characters, at the expense of all the other characters above that. So speakers of Greek, Russian, Chinese, Korean, Japanese, most other Asian languages, Arabic, Hebrew, etc. etc. all get screwX0r'd when they get forced to use UTF-8 instead of Unicode. IWBNI there was support for regular-old uncompressed unicode for those languages that don't use Latin-1 as their default charset. --EvanProdromou
- Not quite. UTF-8 handles Unicode characters less than 128 as single bytes, and Unicode characters less than 65536 as no more than 3 bytes. So Greek, Russian, Arabic and Hebrew are slightly shorter in UTF-8 than in UCS-2 or UTF-16, and Chinese, Korean and Japanese are less than 50% longer (characters above 65535 are rare even in Chinese).
This is ground breaking news. I haven't heard of any claim that Unicode is better than UTF-8 from Asian language users. If there is any complaint, it is the growing pain: the switch from legacy encoding to UTF-8 is troublesome for some people. --AbelCheung4