Archive for March, 2006

03/31 Tomabaem online light

On the side bar you’ll find a small form that accepts one chinese character, aka sinogram, and will return in the box below that the readings [Cantonese, Mandarin, Japanese, Korean and Viêtnamese] and meanings for this sinogram. This stuff was pulled from the UniHan database, as explained at the Tomabaem main page. Tomabaem Online is a derivate of the desktop app, and is built on indexes of the database that are made offline. Then, a little Python [thanks to 2.4’s unicode codecs, conversion from UTF-16 to UTF-8 is seamless] and grep later, and we have a winner! Which reminds me, I haven’t updated the indexes for a while, I shall do that now. I wrote an RB app that runs 11 threads [must. resist. temptation. to. rewrite. it. in. Erlang.] and the main routine reads the UniHan db and sends text to each thread depending on its content. Here’s a sample:

U+3400 kCantonese jau1
U+3400 kDefinition (same as U+4E18 丘) hillock or mound
U+3400 kHanYu 10015.030
U+3400 kIRGHanyuDaZidian 10015.030
U+3400 kIRGKangXi 0078.010
U+3400 kIRG_GSource KX
U+3400 kIRG_JSource A-2121
U+3400 kIRG_TSource 6-222C
U+3400 kMandarin QIU1
U+3400 kRSUnicode 1.4
U+3400 kSemanticVariant U+4E18
U+3400 kTotalStrokes 5

U+3400 gives me the sinogram’s codepoint, and depending on the kXXXXX tag, I pass each line to a thread storing the data in a separate file. The whole process takes two and a half minutes, and I guess I could do faster. But 160 seconds for 25MB of data is fast enough. So now the indexes are up to date :-) I need to update Tomabaem’s own db, will do that later.

dda> time ./threadIndexer/threadIndexer
Thread kSimplifiedVariant starting...
Thread kTraditionalVariant starting...
Thread kCantonese starting...
Thread kJapaneseOn starting...
Thread kJapaneseKun starting...
Thread kKorean starting...
Thread kMandarin starting...
Thread kVietnamese starting...
Thread kRSKangXi starting...
Thread kTotalStrokes starting...
Thread kDefinition starting...
Done dispatching in 150,204,884µs
Saving kSimplifiedVariant.idx
Saving [...]
Thread kSimplifiedVariant finished...
Thread [...] finished...

real    2m39.733s
user    1m54.760s
sys     0m6.600s

So. Okay, most of the time is spent dispatching, I could improve that…

03/26 Stanford CS Education Library

Not everyone is lucky enough to enjoy a posh education at Stanford, but it shouldn’t prevent you to go to their Stanford CS Education Library and go through the material there. Pretty basic but good stuff. Currently dealing with Binary Trees (with a focus on C), and I am using this as Erlang tutorials, the DIY variant… There’s probably a good implementation of a binary tree [in the Chapter about Tuples of the Erlang doc], but I didn’t cheat and did it all on my own :D


Binary Search Tree

I have now a semi-functioning implementation of a binary search tree [a last minute “upgrade” of the code borked the lookup function, sigh]. It can, so far, create and add nodes to a parent, left or right; lookup a node well, when I fix the code that is…; calculate the maximum depth, minimum and maximum values, and the overall count of nodes; and mirror the tree – for some reason it took me more than 5 minutes, must be tired…

Note to self: unit tests are fine, as long as they don’t have bugs… Trying to debug a perfectly functioning piece of code because the unit test is bugged is a perfectly rotten way of spending time in front of the ‘puter…

I will probably open soon a code repository for all my Erlang efforts – learning CVS on the way, it cannot hurt. If and when, I’ll post the url.

I wonder how fast my bst would be if I dumped 22,000+ XML nodes, parsed to strings… Not that it would make sense, but it would sure be fun to try!

Erlang

03/24 Collateral damage

Mar 23 10:51:10 <malware> dda: I know nothing about Erlang. Can you give me a capsule description? What kind of stuff it is commonly used for or optimal for?
Mar 23 10:54:36 <dda_> malware: concurrency and distributed computing
Mar 23 10:54:53 <dda_> that capsule enough? ;-)
Mar 23 10:54:59 <MNK2_> dda: I thought it was to drive French hackers mad with wacky UTF-8 and string semantics :)
Mar 23 10:55:07 <dda_> that too
Mar 23 10:55:14 <dda_> but that’s collateral damage

Erlang

03/23 Success!

I have brought all the pieces together today in yet another IronCoding session. I discovered two things:

  1. Distribution doesn’t mean raw speed improvement. At least not on the scale of my test files. Running my parser over a couple of machines is slower – so far, but I am waiting to see what’ll happen with the monster files –than on just one.
    Which probably means I’d better get a dual core, very very fast machine. nnnyessss… /me looks at wallet… nnnnnoooooo….
  2. Erlang *is* powerful. 110 lines of code are enough to set everything [concurrency, distribution, tokenization of strings, centralized sql output to a file] up.

On my TiBook, it is a tad slow – besides, it is a busy machine – but on the dedicated server we have with a friend at OVH, which is pretty much underused, a puny Celeron @2.6GHz with half a gig of RAM, it parsed 1,010 complex records [30,000 lines] in a little over 1 second. I tried there too to run it distributed over two local nodes, no dice. 10 seconds or so. sigh…

Now, I need to dump that into an sqlite database. I do. Don’t ask. Any ideas? Google and #erlang weren’t helpful…

Erlang

03/23 P/F same-same, eh

먼저 검찰이 구속기소한 프랑스인 P씨는

국과연 기밀유출, 프랑스 업체 지사장 구속

Er… That would be F씨