Aka, science-fiction. In a language where strings are lists of integers, what can you expect in terms of multibyte strings?
|
|
|___> zilch
That’s okay, gives me something to do for my idle weekends. I have thus started this project, mb [the shortest while meaningful name I could come up with], aimed at providing unicode support to Erlang – and possibly some sort of support for other encodings. One-byte encodings are actually easy to support; without too much pain I managed to add support for latin-1 to latin-10, MacRoman, Codepage 1252 and Codepage 437 [that’s Windows]. You can create MBStrings [it’s a tuple, really, but don’t tell anyone] and convert to and from these encodings + utf-8. This module also digs utf-16, albeit partially [I did get the surrogates part right, I think].
The -export() macro is already three lines long, main features are:
- new
- convertEncoding / oneByte_to_utf8 / utf8_to_oneByte
- split / splitB
- left / leftB ; right / rightB ; mid / midB
- reverse / reverseB
- lowercase / uppercase
- isASCII
- getNextChar
Note the ~B commands that work on the byte-level – they return “strings” [ie lists] and not MBString objects. Yes, this is an influence from RB, the only language I know that gives you ZERO pain in handling encodings. And I *do* mean zero. When you work at the byte level [there may be a few good reasons to do that, including speed, if you know you are manipulating an ASCII (7-bit) or one-byte (8-bit) encoded string], whatever comes out can’t be multibyte safe. Not 100%. So I reject the output as not safe and hand over a list of integers. You’re then free to try and convert that – again – to an MBString. Plug and pray
My interest in mb strings is of course more CJKV than koi-whatever (russian) or arabic or else. So I am adding first functionalities that interest me [what’s the radical of 寒? how many strokes are there in 龍? and possibly encoding conversions between the big standards – that is, if I find enough info on them.] Some of the functionalities are – of course – cross-language, inasmuch that the concept applies, like lowercase and uppercase. CJK languages have a ‘full-width’ alphabet that is *not* in the ASCII range. Thus, the ordinary and crude algorithm of my youth, back when ASCII rocked, will not work…
Fortunately, the Unicode project has a lot of info, and the UniHan file has it all – almost. What I did is extract the relevant case folding data, and I build a dets database with it. Whenever I need to convert between lower and upper case, I ask the database. Easy as pie. Maybe not as fast as I’d like, but slow is better than zilch.
Let’s take this string:
(ascii)ABCDEFG (russian)ПО-РУСКИЙ (greek)ΕΛΛΑΣ (circles)ⒸⒾⓇⒸⓁⒺⓈ
Which in Erlang “translates” into:
U=mb:new("(ascii)ABCDEFG (russian)\320\237\320\236-\320\240\320\243\320\241\320\232\320\230\320\231 (greek)\316\225\316\233\316\233\316\221\316\243 (circles)\342\222\270\342\222\276\342\223\207\342\222\270\342\223\201\342\222\272\342\223\210").
Ugly I know, but that’s precisely because Erlang is not too good at multibyte strings that I am working on it…
Now,
UL=mb:lowercase(U).
will give back - trust me – the following:
(ascii)abcdefg (russian)\320\277\320\276-\321\200\321\203\321\201\320\272\320\270\320\271 (greek)\316\265\316\273\316\273\316\261\317\203 (circles)\342\223\222\342\223\230\342\223\241\342\223\222\342\223\233\342\223\224\342\223\242
Translated in ‘real’ utf:
(ascii)abcdefg (russian)по-руский (greek)ελλασ (circles)ⓒⓘⓡⓒⓛⓔⓢ
ie a properly lowercase’d string.
More later…
Erlang