|
XML Parsing |
|
Resources for CL
·
General Information |
Below you can find a couple of news postings of mine, that detail a simple approach to XML parsing, using a small C wrapper to the expat library and the Common Lisp Reader. This approach is trivial, robust (since it uses a high-quality XML parser), mostly portable, and for many uses acceptably fast: Reading in and post-processing a 3.5MB XML file with 89301 elements for example takes around 7s using CMU CL on a lowly AMD K6-2/550. The source code to elements.c can also be downloaded here.
Newsgroups: comp.lang.lisp
Subject: Re: Seeking a *trivial* XML parser
From: "Pierre R. Mai" <pmai@acm.org>
Date: 05 Feb 2002 22:20:07 +0100
Message-ID: <87adunk4mw.fsf@orion.bln.pmsf.de>
tfb+google@tfeb.org (Tim Bradshaw) writes:
> So I need an XML parser. I have spent some time looking around for
> one, and there are obviously a lot of offerings out there, many of
> which are probably very good. All the ones I've looked at look rather
> more complex than I need though. What I want is the absolute minimal
> possible thing: since I get to write the DTDs I can completely control
> what it will come across, and I don't need namespaces or any of the
> other enormous complexities that encrust these things. All I need,
> really, is a politically-acceptable syntax for SEXPRs.
>
> The parser can be external - I can run aprogram and snarf the output
> if need be.
FWIW, I've tended to use the expat parser library by James Clark,
which comes dual-licenced under the MPL/GPL. For simple uses, I've
written the following wrapper for expat, that parses standard input,
and outputs a simple, lisp-readable representation of the file, with
the following features:
- Translates UTF-8 to ISO Latin-1, so that the resulting output can be
used as-is with normal 8bit Unix lisps. Stuff outside of ISO
Latin-1 is silently elided (can easily be changed to dump core
instead ;).
- PCDATA is mapped to CL strings, where as many PCDATA segments as
possible are merged into one string.
- Elements are mapped to lists, with the first item being the start
tag, which is mapped to a nested list, i.e.
<element attr="value">...</element>
is mapped to
(("element" "attr" "value") ...)
- Processing Instructions (PIs) are mapped to a single cons, i.e.
<?foo ...?> is mapped to ("foo" . "...")
The appended file is hereby placed into the public domain.
Regs, Pierre.
/* This is an interface program that uses expat to parse XML and
* output a Lispified representation that can be easily parsed by the
* normal Common Lisp reader.
*/
#include <stdio.h>
#include "xmlparse.h"
/* Since XML PCDATA elements can be returned in multiple chunks by
* expat, and we want this merged into one string for the Lisp side of
* things, we keep track of the current inText state, i.e. whether the
* last thing we output was text, or not. We only write opening
* double quotes on !inText -> inText transitions, and closing douple
* quotes on inText -> !inText transitions.
*/
void finishText(int *inText)
{
if (*inText)
{
putchar('"');
putchar('\n');
*inText=0;
}
}
void startText(int *inText)
{
if (!(*inText))
{
putchar('"');
*inText=1;
}
}
/* Handle conversion from UTF-8 to ISO Latin-1, to which we restrict
* our Lisp side support for the moment. Characters outside of the
* ISO Latin-1 8bit range will be SILENTLY elided. */
void outputText(const unsigned char* text,int len)
{
int pos=0;
while (pos<len)
{
if (text[pos] < 0x80)
{
/* ASCII: Output verbatim, except for escape-chars */
if (text[pos] == '\\' || text[pos] == '"')
putchar('\\');
putchar(text[pos++]);
}
else if (text[pos] < 0xC0)
{
/* We are in the middle of a multi-byte sequence!
* This should never happen, so we skip it. */
pos++;
}
else if (text[pos] < 0xE0)
{
/* Two-byte sequence. Skip if follow on char is not a
* valid continuation byte: */
if ((pos+1>=len) || ((text[pos+1] & 0x80) != 0x80))
{
pos++;
continue;
}
/* Check whether we have a valid ISO Latin-1 character: */
if (text[pos] < 0xC4)
{
/* Valid, output this and next byte */
putchar(((text[pos] & 0x03) << 6) | (text[pos+1] & 0x3f));
}
pos+=2;
}
else if (*text < 0xF0)
{
/* Three-byte sequence. Skip it. */
if ((pos+1>=len) || ((text[pos+1] & 0x80) != 0x80))
{
pos++;
continue;
}
if ((pos+2>=len) || ((text[pos+2] & 0x80) != 0x80))
{
pos+=2;
continue;
}
pos+=3;
}
else
{
/* 4 to 6 byte sequences can't happen in XML, which only
* uses the BMP, aka Unicode. We skip until the next non
* continuation character. */
do
{
pos++;
}
while ((pos<len) && ((text[pos] & 0x80) == 0x80));
}
}
}
void outputString(const unsigned char* text)
{
outputText(text,strlen(text));
}
/* Handle Element start and stop tags */
void startElement(void *userData, const char *name, const char **atts)
{
const char** att;
finishText((int*)userData);
fputs("((\"",stdout);
outputString(name);
putchar('"');
for (att=atts;*att;att+=2)
{
fputs(" \"",stdout);
outputString(*att);
fputs("\" \"",stdout);
outputString(*(att+1));
fputs("\"",stdout);
}
fputs(")\n",stdout);
}
void endElement(void *userData, const char *name)
{
finishText((int*)userData);
fputs(")\n",stdout);
}
/* Handle PCDATA */
void charData(void* userData, const XML_Char *s,int len)
{
int i;
startText((int*)userData);
outputText(s,len);
}
/* Handle PIs */
void processingInstruction(void* userData,const XML_Char *target,
const XML_Char *data)
{
finishText((int*)userData);
fputs("(\"",stdout);
outputString(target);
fputs("\" . \"",stdout);
outputString(data);
fputs("\")",stdout);
}
/* Main program */
int main()
{
char buf[BUFSIZ];
#ifndef CL_NS_SEP
XML_Parser parser = XML_ParserCreate(NULL);
#else
XML_Parser parser = XML_ParserCreateNS(NULL,CL_NS_SEP);
#endif
int done;
int inText = 0;
XML_SetUserData(parser, &inText);
XML_SetElementHandler(parser, startElement, endElement);
XML_SetCharacterDataHandler(parser, charData);
XML_SetProcessingInstructionHandler(parser,processingInstruction);
do {
size_t len = fread(buf, 1, sizeof(buf), stdin);
done = len < sizeof(buf);
if (!XML_Parse(parser, buf, len, done)) {
fprintf(stderr,
"%s at line %d\n",
XML_ErrorString(XML_GetErrorCode(parser)),
XML_GetCurrentLineNumber(parser));
return 1;
}
} while (!done);
XML_ParserFree(parser);
return 0;
}
--
Pierre R. Mai <pmai@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
Newsgroups: comp.lang.lisp
Subject: Re: Seeking a *trivial* XML parser
From: "Pierre R. Mai" <pmai@acm.org>
Date: 06 Feb 2002 23:12:46 +0100
Message-ID: <87g04ew97l.fsf@orion.bln.pmsf.de>
"Pierre R. Mai" <pmai@acm.org> writes:
> FWIW, I've tended to use the expat parser library by James Clark,
> which comes dual-licenced under the MPL/GPL. For simple uses, I've
> written the following wrapper for expat, that parses standard input,
> and outputs a simple, lisp-readable representation of the file, with
> the following features:
A couple of points that have cropped up in private email:
- If you don't need support for processing instructions, just comment
out the following line in main():
> XML_SetProcessingInstructionHandler(parser,processingInstruction);
This will give you a simpler format on the lisp side, since you can
now treat any cons as an element spec.
- Namespaces can be supported by defining the pre-processor symbol
CL_XML_SEP to a character that is then used to separate the
namespace from the identifier in relevant things like GIs...
- In XML (like in mixed content model SGML), whitespace _is_
significant. Only the application can decide whether to elide it in
certain cases (e.g. elements that only contain other elements).
Since the elements.c wrapper guarantees that PCDATA segments are
merged as much as possible, it is easy to trim whitespace from
PCDATA, and elide PCDATA segments which are all whitespace in a
simple post-processing stage, e.g.
(defconstant +ws-char-bag+ '(#\Space #\Tab #\Newline)
"Or whatever else you like to call whitespace...")
(defun post-process-xml (list)
(assert (consp list))
(if (and (stringp (car list)) (stringp (cdr list)))
;; PIs get passed-through
list
(list* (first list)
(mapcan #'(lambda (elem)
(if (stringp elem)
(let ((result
(string-trim +ws-char-bag+ elem)))
(if (zerop (length result))
nil
(list result)))
(list (post-process-xml elem))))
(rest list)))))
With such a post-processing stage, the following XML fragment
<config>
<attrib name="foo">value</attrib>
<attrib name="bar">another value</attrib>
</config>
will result in this list:
(("config")
(("attrib" "name" "foo") "value")
(("attrib" "name" "bar") "another value"))
Regs, Pierre.
--
Pierre R. Mai <pmai@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
Newsgroups: comp.lang.lisp
Subject: Re: The horror that is XML
From: "Pierre R. Mai" <pmai@acm.org>
Date: 09 Mar 2002 21:58:06 +0100
Message-ID: <876645h31d.fsf@orion.bln.pmsf.de>
Frederic Brunel <frederic.brunel@in-fusio.com> writes:
> > I'll suggest that people head to Google Groups, and search for:
> > "pierre mai xml trivial expat"
>
> Thanx, I've got the code to work but I've face a strange
> problem... Which Common Lisp implementation did you use?
>
> I have modified Pierre's code to get a file as input but when I run it
> in CMUCL with (ext:run-program ""), it gets freezed and I'm unable to
> get a stream from it whereas the program runs perfectly from bash (and
> pipes)! :(
Personally, I wouldn't modify the C code to take a file name, but
rather use the ability of either your favourite shell or CMU CL's
run-program to redirect standard input from a file, which is more
flexible.
In CMU CL, something like this should work:
(defun parse-xml-from-file (file)
(let* ((process (ext:run-program *xml-expat-parser-path* nil
:input file :output :stream :wait nil))
(output (ext:process-output process)))
(unwind-protect
(read output)
(ext:process-wait process)
(ext:process-close process)
(unless (zerop (ext:process-exit-code process))
(error "Error parsing XML file ~A." file)))))
If you modified elements.c to take a filename argument, you'd do
(let* ((process (ext:run-program *xml-expat-parser-path* (list file)
:input nil :output :stream :wait nil))
instead.
Regs, Pierre.
--
Pierre R. Mai <pmai@acm.org> http://www.pmsf.de/pmai/
The most likely way for the world to be destroyed, most experts agree,
is by accident. That's where we come in; we're computer professionals.
We cause accidents. -- Nathaniel Borenstein
|