I have a very specific need, and I have this wild idea I can solve it with Erlang. Grosso modo, the need is that I have to parse a huge XML file, but I mean something fiercely huge. I haven’t seen it yet – my client has it – but even my very very specific, can break if you move a comma, I am reading this file as plain text parser, which I wrote for this client and this file format, takes forever to parse the file. Attempts to read that file, or similar, albeit a tad smaller ones, with expat et altri was entertaining but fruitless.
So I broke out the big guns. I have three computers at home – well four, but my wife’s old TiBook 15″ is sick – and will prolly have one or two more soon, thanks to recent/current contract work. And one at least will be dual core, or better* so it should be interesting.
* I haven’t decided yet whether to buy a double core or better G5 – since most of my commercial apps are PPC, and OSS stuff can be recompiled and sped up with -mcpu=g5 – or to buy an Intel. We’ll see.
Those two of you who read my posts on Erlang and survived know that I have been playing with concurrency, distributed computing [ridiculously easy to set up in Erlang, but I suspect there are pitfalls lurking right across the corner to bite me…], looping through lists and tuples [of distributed nodes, wink wink], and complaining about string manipulations [for all I know Erlang has a superb set of string manips, but people on #erlang agreed that it’s so so…]. Now, I *need* decent string manips if I am to fake-parse XML. So far, I can start nodes on remote machines, read lines from a file and send these lines to the nodes in a round-robin way, to be processed. Meaning, to obtain the content of the XML – in this case the attributes – and store them somewhere. The “store somewhere” part, I’ll leave for later, we can’t win all battles in one day. Just the word mnesia makes me shiver… For the time being, the parsed output will go to the node’s dictionary. Good enough.
Now, I started working on this with another XML file that has a similar format: the opml file produced by Technorati. It looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!– OPML generated by Technorati Blog Finder –>
<opml version="1.0">
<head>
<ownerName>Technorati</ownerName>
<title>5 blogs tagged Erlang sorted by authority</title>
</head>
<body>
<outline text="3pBlog Mickaël Rémond Performance, Process, P… " description="Mickael Remond’s thoughts on cluster, robust, scalable and distributed computing" htmlUrl="http://www.3pblog.net" xmlUrl="http://www.3pblog.net/rss.php" type="rss"/>
<outline text="Life’s too short to brag about microcosm a… " htmlUrl="http://sungnyemun.org/wordpress" xmlUrl="http://sungnyemun.org/wordpress/wp-rss2.php" type="rss"/>
[…]
</body>
</opml>
The idea is to extract the 5 attributes – not all five may be present, so it’s actually a good idea to preset the dictionary with the five keys and empty values – and return a tuple, list, dictionary, whatever. Read the file until you hit <body> and then start reading lines until </body>. Ok, can do. Then, for each line, I read the “header” [<outline ], and then hope for the best and pray that the tag/attribute pairs are properly formatted – no reason to doubt, they’re machine-produced, right? So I need a function to retrieve the tag [which I called token], and then the attribute. This until I hit “/>”, the end of the line, as far as the “parser” is concerned. I know, many things could break, but if the format is consistent… we should be safe!
Here goes:
-module(extract
).
-compile(export_all
).
doit(S) -> % that’s the function we’ll call
% reset the dictionary
put(text,""),
put(type,""),
put(htmlUrl,""),
put(description,""),
put(xmlUrl,""),
proceed(S).
proceed("/>") -> % This is the closing tag, we’re done.
{{text, get(text)}, {type, get(type)}, {htmlUrl, get(htmlUrl)}, {description, get(description)}, {xmlUrl, get(xmlUrl)}};
proceed([$< |T]) -> % opening tag, read the header
{S1,Head}=extract_header(T),
proceed(S1);
proceed([32|T]) ->
proceed(T);
proceed(S1) -> % any other possibility
{S2,Tk}=extract_token(S1),
{S3,Attr}=extract_attribute(S2),
Atom=list_to_atom(Tk),
put(Atom,Attr),
proceed(S3).
extract_header(L) ->
extract_header(L,[],0).
% the last argument, 0 and then 1 is just a state marker
extract_header([H|T],Acc,0) ->
case H of
32 -> extract_header(T,Acc,1);
Other -> extract_header(T,[H,Acc],0)
end;
extract_header([H|T],Acc,1) ->
case lists:member(H, "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890 -_") of
true -> {[H|T],lists:reverse(lists:flatten(Acc))};
% Return header + remainder
Other -> extract_header(T,Acc,1)
end.
extract_token(L) ->
extract_token(L,[]).
extract_token([H|T],Acc) ->
case lists:member(H, "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890 -_") of
true ->
case Acc of
[] -> extract_token(T,H);
Other -> extract_token(T,[H,Acc])
end;
false -> {[H|T],lists:reverse(lists:flatten(Acc))}
end;
extract_token([],Acc) ->
{">",lists:reverse(lists:flatten(Acc))}.
extract_attribute(L) ->
extract_attribute(L,[],0).
extract_attribute([H|T],Acc,0) ->
case H of
34 -> extract_attribute(T,Acc,1); % Open "
Other -> extract_attribute(T,Acc,0)
end;
extract_attribute([H|T],Acc,1) ->
% " open
case H of
34 -> {T,lists:reverse(lists:flatten(Acc))}; % Close ". Return Acc
Other -> extract_attribute(T,[H,Acc],1)
end.
70> S2="<outline text=\"3pBlog Micka\303\253l R\303\251mond Performance, Process, P… \" description=\"Mickael Remond’s thoughts on cluster, robust, scalable and distributed computing\" htmlUrl=\"http://www.3pblog.net\" xmlUrl=\"http://www.3pblog.net/rss.php\" type=\"rss\"/>".
71> extract:doit(S2).
{{text,"3pBlog Micka\303\253l R\303\251mond Performance, Process, P… "},
{type,"rss"},
{htmlUrl,"http://www.3pblog.net"},
{description,"Mickael Remond’s thoughts on cluster, robust, scalable and distributed computing"},
{xmlUrl,"http://www.3pblog.net/rss.php"}}
Sweet, no?
I am sure this is clumsy. But.It.Works. Now on to bring all the pieces together… And learn about mnesia and stuff.
Erlang