Been hard at it. But I hit a snag when encountering a plain, innocuous-looking sinogram: 內. Pretty harmless, right? D’uh. Big time. Because in giapponese it’s 内, not 內. Friggin’ variants. 0×5167 vs 0×5185 [If you don't know what I am talking about, it's okay. In this case, ignorance is bliss!]. Can I slap someone – preferrably not me? Say hello to
hasVariant(X) who just joined us. Nyessss…. Anyway, all fixed now – or rather, for now!
A couple of screenshots is worth 1,000 words!
Here’s a screenshot of the source of KanjiTest.html, in SEE. It’s a cross-table of a few sinograms for a slew of encodings, showing the respective code points for each character in all the encodings. The page itself is in utf-8.
Here’s a screenshot of the source of KanjiTestUTF8.html. An extract of the former table, if you will.
Look, Mom, with only one hand: a UTF-16 encoded page showing a table of sinograms, in UTF-16 of course.
Ooooh, lookit, no hands! Same sinograms, in Shift-JIS. Damn, I rock!
It’s not exactly rosy. The code’s a deluxe candidate for refactoring – read, it’s a mess – but is written in a way that can handle easily as many new encodings as you can throw at me, provided you give me a UTF-8/16 to said encoding cross-ref file. The whole yahzoo [case folding data, Big5, CCCII, Shift-JIS, EUC-KR] is stored in dets tables – UTF-16⇔UTF-8 is an algorithm, thus a tad faster. Because this thing, it’s nice to have, but not exactly a Ferrari. The test that produces these tables, it runs in, ahem, er…, 14 seconds? Don’t reach for your gun right now though, because:
A. I am going to work on speed when functionalities are all in and tested
B. Try to do that right now in Erlang – good luck!
So I guess slow is better than nothing. But we’ll work on speed.