Simple Java XML Parser (SJXP)

Android Emulator – for real world performance, see Performance section below.

Sections

  1. Description
  2. License
  3. What Problem Does this Solve?
  4. Benefits
  5. Performance
  6. Usage
  7. Android Example
  8. Intended Audience
  9. Download
  10. Javadoc & Source
  11. Maven
  12. Bug Reports
  13. Feature Requests, Comments & Feedback

Description

Simple Java XML Parser (SJXP) is a very small and fast library (4 classes) that sits on top of any spec-compliant, Java XML pull parser (Android included) and allows you to specify “rules” in simple XPath-like terms that the parser uses to deliver parsed data right at your feet with next to no performance overhead (CPU or memory) over and above default XML pull parsing; it just makes it a hell of a lot simpler to work with.

This project was born out of months of work using a variety of Java XML pull-parsers to process feed documents (RDF, RSS, ATOM) and realizing through writing and re-writing simplifications to the pull parsing process that almost every standard parsing scenario can be distilled down to a few very basic elements.

SJXP is the final result of all that work.

License

This software is licensed under the Apache 2 License.

What Problem Does this Solve?

“Pull Parsing” has become very popular in Java as it provides better-than-SAX performance with a slightly easier-to-use API (and some might say event-based paradigm). The low memory overhead and fast performance are great, but when you code against a pull parser it still feels very much like other XML coding; lots of exception handlers, manually keeping track of parser state, lots of switch and if-else statements and hard-coded logic buried deep inside to store/process/retrieve values.

Pull parsing is nice and fast, but the code ends up being a pain to manage, or at the very least you’d rather not manage it.

Wouldn’t it be nice if you could just tell the parser “Hey buddy, give me all the /rss/channel/item/link values, and make it snappy!” and the parser only pestered you when it had a value ready for you, but done in a really performant way that doesn’t add overhead to the parsing process?

That is exactly what SJXP does.

Benefits

  • 100% Java, runs on any Java platform.
  • If deployed to Android 1.5+, you only need the sjxp JAR by itself because the Android platform provides a pull parser impl already. Simple!
  • Super Fast! As far as straight XML pull parsing with little to no CPU or memory overhead.
  • Uses XPP3 as the pull parser impl on non-Android platforms; one of the fastest/smallest parsers out there.
  • Makes using XML pull parsing simple: define rules, implement callback handlers for the inbound data.
  • Simplifies exception-handling. Just catch the unchecked XMLParserException which will explain (in added detail) exactly what went wrong if anything broke during parsing.
  • Pedantic API ensures tight/clean values are used and accidental mistakes (like a trailing slash) won’t bite you down the road when you mysteriously get no values from the parser.
  • Conceptually simple. There was no need to make this library a convoluted mess of Lexer-esque parse trees. Only simple, performant constructs are used which makes the API really easy to use and source code a breeze to walk through if you are interested.
  • Source code is easy to understand and all code is documented almost line-by-line.

Performance

Below is a benchmark available in the /src/test/main source tree for SJXP.

As of SJXP 2.0, memory usage and performance have been improved drastically by utilizing a new hash code based method for matching IRules. Memory and CPU usage in SJXP 2.0+ has been cut down to a small fraction of what it was in the 1.x series.

Hardware

  • Java 1.6.0_24 on Windows 7 64-bit
  • Dual Core Intel E6850 processor
  • 8 GB of ram

Benchmark Files

  1. Hacker News Feed – 10 KB, 30 stories
  2. Bugzilla Bug Feed – 132 KB, 128 comments
  3. New York Craigslist Feed – 278 KB, 100 listings
  4. TechCrunch Feed – 300 KB, 25 stories
  5. Samsung News Feed – 500 KB, 100 stories
  6. Eclipse XML Editor Stress Test File – 1.63 MB, 1054 additionallineitem entries
  7. Example Dictionary XML – 10 MB, 41,427 <w> entries

Results

1. Processed 11061 bytes, parsed 60 XML elements in 7ms
2. Processed 132687 bytes, parsed 258 XML elements in 33ms
3. Processed 278771 bytes, parsed 200 XML elements in 143ms
4. Processed 303726 bytes, parsed 50 XML elements in 9ms
5. Processed 724031 bytes, parsed 300 XML elements in 19ms
6. Processed 1633334 bytes, parsed 2108 XML elements in 120ms
7. Processed 10625983 bytes, parsed 41427 XML elements in 241ms

Usage

The basis of SJXP is defining rules via the IRule interface, or more specifically, just create DefaultRule instances and override the no-op handlers for whichever value you want.

Lets say we wanted to parse RSS2-compliant feeds by reading out every stories Title and Link for our Android app, web service, web app… whatever, we just want that information.

We would define two rules that look like this:

IRule titleRule = new DefaultRule(Type.CHARACTER, "/rss/channel/item/title") {
	@Override
	public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
		// Store the title in a DB or something fancy
	}
}
IRule linkRule = new DefaultRule(Type.CHARACTER, "/rss/channel/item/link") {
	@Override
	public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
		// Also store the link, or something equivalently fancy
	}
}

then we would take those two rules and give it to an instance of XMLParser to use:

XMLParser parser = new XMLParser(titleRule, linkRule);

and now we can take that parser instance, and run it against any and all RSS2 feeds that we want to parse… that’s it. We are done. No while-loops, no exception handlers, no position parser calculations, no namespace querying. Cool hu?

NOTE: The userObject argument is just a simple pass-through mechanism you can use if you wish to pass an object to XMLParser and have it give it back to you inside of the handler methods (e.g. like a DAO or cache object you use to store the parsed values).

TIP: If you only want to parse a limited number of items from your file, you can tell the XMLParser to stop at any time by calling the stop() method from one of the handlers. Stopping the parser is a safe operation as it flips a flag internally that tells the parser “When are you are done with the current event, don’t move on to the next one, just return from parse(…)”.

The XML-versed among you probably have had the following question on your mind: “What about namespace-qualified elements or attributes?

Great question, that was one of the core design elements of SJXP.

You actually specify a namespace-qualified element using simple bracket-notation prefixed to the name of the element or attribute you want to parse. Enough talk, here is a real-world example.

Let’s say I am parsing an RSS2 feed that uses the Dublin Core metadata spec (it is very popular, many do). Now let’s say we wanted to parse the <dc:subject> elements (0 or more) out of the root <channel> element to store as a categorization of the site’s content.

The rule to do that would look like this:

IRule channelSubjectRule = new DefaultRule(Type.CHARACTER, "/rss/channel/[http://purl.org/dc/elements/1.1/]subject") {
	@Override
	public void handleParsedCharacters(XMLParser parser, String text, Object userObject) {
		// Got the Channel's dc:subject value! I win!
	}
}

You do the same for attributes that are namespace-qualified as well, like <item rdf:about=”…”>.

The reason the namespace URI is used instead of the namespace prefix is that namespace prefixes are document-specific; typically they follow a well-known format (e.g. “dc” for “Dublin Core”) but not always. By using the entire namespace URI, ambiguity is completely removed from the resolution process and errors or mistypes can be more easily caught and avoided.

NOTE: Rule-matching is exact in SJXP. If you are looking for an element or attribute value that may or may not be namespace qualified, you need to create two rules; one for each scenario.

Android Example

If you are looking for examples in straight Java code, there are a lot of easy-to-read examples in the test cases that you can peek through.

If you want to see an example of how to use/deploy SJXP inside of an Android app, you can download this Example SJXP Android App that I put together.

This project has been updated for SJXP 2.0+.

NOTE: It is nothing more complicated than creating a /libs dir at the root of your app, and dropping the sole SJXP JAR file into it (no other dependencies required)!

You will need to have the Eclipse ADT (Android Dev Tools) plugin installed and will need to edit the project properties to set the AVD (Android Virtual Device) you want to run the project with because the ones I used and the ones you have are probably named differently.

Intended Audience

Anyone working on a Java platform (Android, J2ME, client Java, web application, web service… anything) that wants a simple and very fast way to parse XML.

SJXP is not a binding framework. It also doesn’t support writing XML.

Download

All download bundles include the library JAR, source code and Javadoc.

Javadoc & Source

Maven

The Buzz Media provides access to it’s libraries via it’s own Maven Repository. Sources and Javadoc are available from the repository.

Edit your pom.xml and add the following to your <repositories> section (be sure to add one if you don’t have one already):

<repository>
	<id>The Buzz Media Maven Repository</id>
	<url>http://maven.thebuzzmedia.com</url>
</repository>

Then edit your pom.xml and add the following dependency to your <dependencies>section (be sure to add one if you don’t have one already):

<dependency>
	<groupId>com.thebuzzmedia</groupId>
	<artifactId>sjxp</artifactId>
	<version>2.2</version>
	<type>jar</type>
	<scope>compile</scope>
</dependency>

Be sure to adjust the version value to whichever version of imgscalr you want to grab.

NOTEAt this time we are not providing checksums for the files on our repository, so you will see “[WARNING] Checksum validation failed” messages from Maven, but they can be safely ignored. Writing a release POM for Maven is beyond my capabilities as a human at this time.

Bug Reports

If you find a bug, please create a new issue with a description of the problem and how to reproduce it. Example files help a lot!

Feature Requests, Comments & Feedback

Please email me at software@thebuzzmedia.com, I would love to hear from you.