Archive for the 'mb' Category

08/21 mb update

I have decided that the only way to make mb faster was to change the underlying structure from {Encoding::atom(), String::list()} to {Encoding::atom(), String::binary()}. Which implied a thorough overhaul of the code. I am almost done, although still fighting with some issues. Still, preliminary tests tend to show that the change was worthwile: mb:reset() – the creation of the encodings-related text files into dets tables – is easily faster by half. Only one test in the test suite passes so far, and it too executes quite faster than the original.

As a side note, I have discovered something puzzling. Say you have a variable Code1 which contains the integer value 0×2121. I was expecting that doing <<Code1>> would yield <<33,33>>. Nopesky. It yields <<”!”>>. You have to do <<Code1:16>> – and hope the integer is not greater than 65535: if your integer was, say, 0×012345, <<Code1:16>> would yield <<35,69>>. Ah well…

I am maintaining the source code with git. It is quite pleasant to use, although I have yet to manage to push the repository correctly to the public repository. So nothing is published yet, except partial docs. MNK, who gave me a FreeBSD jail to play with, and is hosting the whole thing, is playing with a git to mercurial bridge. We’ll probably have one day a <your_scm> to mb repository bridge. Maybe.

Erlang

07/10 X-Encodings in Erlang/mb

Been hard at it. But I hit a snag when encountering a plain, innocuous-looking sinogram: 內. Pretty harmless, right? D’uh. Big time. Because in giapponese it’s 内, not 內. Friggin’ variants. 0×5167 vs 0×5185 [If you don’t know what I am talking about, it’s okay. In this case, ignorance is bliss!]. Can I slap someone – preferrably not me? Say hello to hasVariant(X) who just joined us. Nyessss…. Anyway, all fixed now – or rather, for now!

A couple of screenshots is worth 1,000 words!

Here’s a screenshot of the source of KanjiTest.html, in SEE. It’s a cross-table of a few sinograms for a slew of encodings, showing the respective code points for each character in all the encodings. The page itself is in utf-8.
KanjiTest.html screenshot

Here’s a screenshot of the source of KanjiTestUTF8.html. An extract of the former table, if you will.
KanjiTestUTF8.html screenshot

Look, Mom, with only one hand: a UTF-16 encoded page showing a table of sinograms, in UTF-16 of course.
KanjiTestUTF16.html screenshot

Ooooh, lookit, no hands! Same sinograms, in Shift-JIS. Damn, I rock!
KanjiTestSJ.html screenshot

It’s not exactly rosy. The code’s a deluxe candidate for refactoring – read, it’s a mess – but is written in a way that can handle easily as many new encodings as you can throw at me, provided you give me a UTF-8/16 to said encoding cross-ref file. The whole yahzoo [case folding data, Big5, CCCII, Shift-JIS, EUC-KR] is stored in dets tables – UTF-16⇔UTF-8 is an algorithm, thus a tad faster. Because this thing, it’s nice to have, but not exactly a Ferrari. The test that produces these tables, it runs in, ahem, er…, 14 seconds? Don’t reach for your gun right now though, because:
A. I am going to work on speed when functionalities are all in and tested
B. Try to do that right now in Erlang :D – good luck!

So I guess slow is better than nothing. But we’ll work on speed.

Erlang

07/08 Erlang multi-byte module, continued

There are still many dusty corners, and some major refactoring to do, but it is going well, very well indeed.

Erlang

07/06 Erlang Multibyte Module

900 lines of code later, I got this:

Eshell V5.4.3  (abort with ^G)
1> c("src/mb.erl",[{outdir,"ebin/"},nowarn_unused_function, nowarn_unused_vars]).
{ok,mb}
2> mb:
bocu/1              charToInt/1         convert10/1
convert16/1         convert2/1          convertEncoding/2
filter/2            filterB/2           format/1
getNextChar/1       getNextCharAsInt/1  hasProcess/1
hasTable/1          inStr/2             init/0
isASCII/1           join/2              kangxi/1
left/2              leftB/2             len/1
lenB/1              longSplit/2         lowercase/1
mid/2               mid/3               midB/2
midB/3              module_info/0       module_info/1
new/0               new/1               new/2
new/3               oneByte_to_utf8/1   print/2
replace/3           replaceAll/3        reset/0
reverse/1           reverseB/1          right/2
split/2             surrogate/1         uppercase/1
utf16_to_utf8/1     utf8_to_oneByte/2   utf8_to_utf16/1

There’s actually a bit more, but not yet activated/complete. Been doing quite a bit of bug fixing too… The next big chunk will be CJK encodings ⇔ UTF. I foresee lots of fun here… But at least thanks to the mass of info in the Unihan database, I got my work all lined up : extract BigFive and CCCII data into a file, have the init() script parse it and store that into a dets database. Like I did for case folding – except that the case folding data comes from the aptly named CaseFolding.txt file. Very handy nonetheless.

More later.

Erlang

07/03 Multibyte strings in Erlang

Aka, science-fiction. In a language where strings are lists of integers, what can you expect in terms of multibyte strings?
|
|
|___> zilch

That’s okay, gives me something to do for my idle weekends. I have thus started this project, mb [the shortest while meaningful name I could come up with], aimed at providing unicode support to Erlang – and possibly some sort of support for other encodings. One-byte encodings are actually easy to support; without too much pain I managed to add support for latin-1 to latin-10, MacRoman, Codepage 1252 and Codepage 437 [that’s Windows]. You can create MBStrings [it’s a tuple, really, but don’t tell anyone] and convert to and from these encodings + utf-8. This module also digs utf-16, albeit partially [I did get the surrogates part right, I think].

The -export() macro is already three lines long, main features are:

  • new
  • convertEncoding / oneByte_to_utf8 / utf8_to_oneByte
  • split / splitB
  • left / leftB ; right / rightB ; mid / midB
  • reverse / reverseB
  • lowercase / uppercase
  • isASCII
  • getNextChar

Note the ~B commands that work on the byte-level – they return “strings” [ie lists] and not MBString objects. Yes, this is an influence from RB, the only language I know that gives you ZERO pain in handling encodings. And I *do* mean zero. When you work at the byte level [there may be a few good reasons to do that, including speed, if you know you are manipulating an ASCII (7-bit) or one-byte (8-bit) encoded string], whatever comes out can’t be multibyte safe. Not 100%. So I reject the output as not safe and hand over a list of integers. You’re then free to try and convert that – again – to an MBString. Plug and pray :D

My interest in mb strings is of course more CJKV than koi-whatever (russian) or arabic or else. So I am adding first functionalities that interest me [what’s the radical of ? how many strokes are there in ? and possibly encoding conversions between the big standards – that is, if I find enough info on them.] Some of the functionalities are – of course – cross-language, inasmuch that the concept applies, like lowercase and uppercase. CJK languages have a ‘full-width’ alphabet that is *not* in the ASCII range. Thus, the ordinary and crude algorithm of my youth, back when ASCII rocked, will not work…

Fortunately, the Unicode project has a lot of info, and the UniHan file has it all – almost. What I did is extract the relevant case folding data, and I build a dets database with it. Whenever I need to convert between lower and upper case, I ask the database. Easy as pie. Maybe not as fast as I’d like, but slow is better than zilch.

Let’s take this string:
(ascii)ABCDEFG (russian)ПО-РУСКИЙ (greek)ΕΛΛΑΣ (circles)ⒸⒾⓇⒸⓁⒺⓈ
Which in Erlang “translates” into:
U=mb:new("(ascii)ABCDEFG (russian)\320\237\320\236-\320\240\320\243\320\241\320\232\320\230\320\231 (greek)\316\225\316\233\316\233\316\221\316\243 (circles)\342\222\270\342\222\276\342\223\207\342\222\270\342\223\201\342\222\272\342\223\210").

Ugly I know, but that’s precisely because Erlang is not too good at multibyte strings that I am working on it…

Now,

UL=mb:lowercase(U).
will give back - trust me – the following:

(ascii)abcdefg (russian)\320\277\320\276-\321\200\321\203\321\201\320\272\320\270\320\271 (greek)\316\265\316\273\316\273\316\261\317\203 (circles)\342\223\222\342\223\230\342\223\241\342\223\222\342\223\233\342\223\224\342\223\242

Translated in ‘real’ utf:

(ascii)abcdefg (russian)по-руский (greek)ελλασ (circles)ⓒⓘⓡⓒⓛⓔⓢ

ie a properly lowercase’d string.

More later…

Erlang