common-parser-lib (Common Parser Java Library)

Source and Downloads @ GitHub

Description

“Parsing” is a broad topic with many complex details weaved into it. Besides different types of parsers, there are a handful of common approaches or methods to parsing, namely:

  • Tokenization
  • Scanning
  • Callback-based Parsing
  • Event-based Parsing

This library provides a foundation of general patterns for all these different approaches to parsing in addition to base implementations covering the most common details of parsers like stream reading, buffer (re)filling and so on; the details that every implementation of these types share, but are a pain to write and rewrite.

Great care was taken to ensure that this library provides “levels” of Abstract base implementations to be extended, with no single layer assuming so much about your design that it is effectively useless if you ever want to do something different with it.

For example, if you need a custom parse to parse java.io.File‘s directly, you might extend AbstractParser. If you want to write a tokenizer that processes streams, you could extend AbstractStreamTokenizer. Each of these base implementations are meant to give you increasing levels of boilerplate written and tested that you don’t have to write.

The goal of the library is to provide specific enough Abstract base implementations that people utilizing the library can extend any base class and simply fill in 1 or 2 method hooks used to actually mark the bounds of tokens in the data being parser; the base implementation should take care of all the rest of the details.

For example, if you want to parse java.io.File‘s content delimited by the string “<chapter>”, you should only need to extend a base implementation then write a few lines used to scan a byte[] or char[] buffer for the data you are looking for.

Concrete implementations for more common parser types are included when applicable. Extension of classes in this library is meant to make writing your own parser as painless as possible.

Design

All parsers in this library share two common designs:

  1. They will throw a detailed ParseException if something goes wrong.
  2. They produce instances of IToken when they successfully mark data.

There are also a few different approaches to parsing that can be utilized, they are:

  • ITokenizer - A stateful style of parsing that starts with wrapping existing content with a tokenizer, then going into a loop calling nextToken() until the tokenizer reports it can find no more tokens from the input you gave it.
  • IScanner - A stateless style of parsing whereby a scanner (typically a static utility class) is given an input source that it scans completely and returns a List of ITokens found in the input.
  • ICallbackParser - A stateful style of parsing where a parser is passed both an input source and an ICallback to be invoked every time an IToken is found in the input. Parsers of this type are typically “stoppable”.

The base interfaces for all these parser types are defined using a generous set of generics, allowing for very flexible-but-compatible implementations to follow. A series of increasingly more specific interfaces and Abstract base implementations are provided to help provide more focused extension points; for example, IStreamTokenizer for folks parsing InputStreams.

The generic types defined on all base interfaces are defined as follows:

  • <IT> “input type”, the type of the input the parser is consuming.
  • <TT> “token type”, a type defined to uniquely identify the token. This is not required (pass <Void> as type if you don’t need it), but more complex parsing scenarios require a way to identify the different types of tokens being generated, so support for that is built in directly.
  • <VT> “value type”, the type of the token’s value returned by IToken.getValue().
  • <ST> “source type”, the type of the token’s source returned by IToken.getSource().

In the original versions of this library <ST>, <VT> and <IT> were all the same value (e.g. byte[]), but it quickly became apparent that you could just as easily have a parser processing InputStreams (IT type), that utilize ByteBuffers internally (ST type) as the token’s source and the generated ITokens return char[]s (VT type) from calls to getValue(). Because of this, additional types for each piece of the parser puzzle were introduced.

If you are implementing a very simple parser that does operate on all the same types, it is easy enough to define: just pass the same type (e.g. String) for all the generic type parameters.