DBLP FAQ: How to parse dblp.xml?

The DBLP data are available from http://dblp.uni-trier.de/xml/:

The encoding used for the XML file is plain ASCII. To represent characters outside of the 7-bit range we use symbolic or numeric entities. All symbolic entities are defined in the DTD. At the moment most parts of DBLP are restricted to ISO-8859-1 (Latin-1) characters, i.e. the first 255 Unicode characters. Only inside the <note>-element you may find characters outside of this range, for example some Chinese names in their original spelling.

Our small example program to process the DBLP data is written in Java. Please load the files

into a directory and compile them:

javac Parser.java

The dblp.xml and dblp.dtd files should be stored into the same directory. You may start the program with the command

java -mx900M -DentityExpansionLimit=2500000 Parser dblp.xml > out.txt

This works for the Java virtual machine 1.5.* but not for 1.6.* . We yet don't understand the problem with Java VM 1.6, but the problem has been reported by others. The machine should have > 1.5G main memory, the option -mx900M sets the heap space to 900M. The option -DentityExpansionLimit is necessary to resolve the symbol entities used in the large XML file. Depending on your machine the program should run a few minutes. The result is stored in 'out.txt' ...

If you want to use Java 1.6, you should download the Apache Xerces XML parser. It does not have the problem reported above, the -DentityExpansionLimit option isn't required here. You only have to copy the file xercesImpl.jar from the Xerces distribution to a loaction covered by your classpath.

The first part of out.txt contains some simple statistics about the DBLP data:

The main part of out.txt shows how we try to locate variations of name spellings:

Hongli Deng: Linda Shapiro - Linda G. Shapiro

There is a person named 'Hongli Deng' who has coauthors 'Linda Shapiro' and 'Linda G. Shapiro'.

Parser.java

This class contains the static main method and the methods necessary to use the XML SAX parser shipped with the standard Java distribution. It produces the first part of the statistics.

The main approaches to parse XML are DOM and SAX parsers:

In our application we are only interested in person names and not in titles, conference names, page numbers, publication years etc. We view a publication as a list of author (or editor) fields, any other information is skipped. The 'startElement' method recognizes two situations:

The 'characters' method simply appends the input text to 'Value' string. This should only happen if we are inside of an author or editor element. Whithout the test 'if (insidePerson)' the program remains correct, but it becomes very slow because we produce several millions of garbage objects.

The method 'endElement' works similar to 'startElement':

Publication.java

...

Person.java

...