Amazon CloudFront Log Parser

Sections

  1. Description
  2. License
  3. What Problem Does this Solve?
  4. Performance
  5. Design
  6. Usage
  7. Intended Audience
  8. Download
  9. Javadoc & Source
  10. Maven
  11. Bug Reports
  12. Feature Requests, Comments & Feedback

Description

CloudFront Log Parser is a high-performance Java library offering a simple and adaptive parser for both the Download and Streaming CloudFront log file formats.

The simplicity and adaptivity of the CloudFront Log Parser comes from the following two features of the library:

  1. Auto-detection of the log format at parse-time.
  2. Ability to detect and ignore unknown or corrupt field names.

#1 makes the library easy to use (a single LogParser class) and #2 makes the library safe to deploy in long-running environments (e.g. a log-processing server) without worrying that one day Amazon may roll out a change to the log format and suddenly your log-processing server is throwing so many exceptions the VM halts.

The CloudFront Log Parser API was designed to be simple to use and flexible enough to plug into any use-case. The LogParser class itself consumes raw InputStreams from the .gz log files. No need to wrap the streams with buffers or worry about them being local or remote resources; the parser doesn’t care.

Because of this design, the CloudFront Log Parser integrates really easily with Amazon’s existing AWS Java SDK as the library can consume the InputStreams coming directly from the S3 API’s S3Object.getObjectContent() method.

The CloudFront Log Parser is intended to be used in any deployment scenario.

License

This software is licensed under the Apache 2 License.

What Problem Does this Solve?

Amazon promotes the use of their Elastic MapReduce product for processing CloudFront logs, which is a good suggestion if you don’t mind incurring the additional cost of using more AWS services.

For folks wanting to parse the logs themselves (e.g. they already have a server available) there are not (any?) Java-based solutions online and hand-writing a log parser that is efficient and tightened up correctly so it can run unhindered in a headless environment without stopping is not a simple thing to knock out in an afternoon.

The CloudFront Log Parser Java Library solves that problem by providing an extremely fast and simple-to-integrate library meant to make parsing CloudFront log files easy.

Performance

Benchmarks can be found in the /src/test/java folder of the source tree andcan be run directly from the command line.

Hardware

  • Java 1.6.0_24 on Windows 7 64-bit
  • Dual Core Intel E6850 processor
  • 8 GB of ram

Benchmark Results

  1. Parsed 100 Log Entries in 27ms (3,703 entries / sec)
  2. Parsed 100,000 Log Entries in 864ms (115,740 entries / sec)
  3. Parsed 174,200 Log Entries in 1341ms (129,903 entries / sec)
  4. Parsed 1,000,000 Log Entries in 7520ms (132,978 entries / sec)

The Amazon CloudFront docs say log files are truncated at a maximum size of 50MB (uncompressed) before they are written out to the log directory. The 3rd test, parsing the 174k log entries is exactly 50MB uncompressed and matches this worst-case-scenario. That means using the CloudFront Log Parser Java Library, you can parse a 50MB log file in a little over a second on equivalent hardware.

If you are running on faster server hardware your single-threaded rate will be higher than what you see here. If you decide to parse logs in a heavily-threaded environment (1 LogParser instance per Thread) then your parse rates could be magnitudes times faster.

The CloudFront Log Parser Java Library was engineered to be very efficient.

Design

CloudFront Log Parser was designed, first and foremost, to execute as fast as possible with as little memory allocation as possible to avoid thrashing the host VM running it. It was designed from the ground up to be effective at parsing huge volumes of log entries.

Object creation is kept to a minimum during parsing (no matter how big or long the parse operation) by re-using a single ILogEntry instance (either DownloadLogEntry or StreamingLogEntry depending on the log type), per LogParser instance. The ILogEntry instance is re-used every time a log entry line is parsed and delivered to the caller-provided ILogParserCallback instead of creating a new ILogEntry instance every single time.

NOTE: LogParser instances are safe to re-use, but are not Thread-safe. I recommend a 1-Parser-per-Thread approach for multi-threaded environments.

The trade-off for this performance optimization is that the ILogEntry instances are volatile. The instance is valid for the scope of the call to the callback, but once that callback method returns, the ILogEntry is reused by the parser and values inside of it reset and re-populated.

ILogParserCallback implementations should never hold onto ILogEntry instances. Pull the values out of them, store/process/manipulate those instead.

Even though the ILogEntry instances themselves are volatile, the values returned by the getFieldValues methods are safe to store or keep references to.

This design was chosen to ensure that long-running, high-volume log processing jobs avoided creating millions of garbage objects that would cause the host VM to trash on long GC cycles; slowing the log processing and increasing volatility in the VM (e.g. starving heap space from other processes running inside the same VM).

Usage

The CloudFront Log Parser Java Library is built on a callback-based model of execution; more specifically, you create a LogParser instance and an implementation of ILogParserCallback to handle each parsed log line, then pass the callback and a raw InputStream pointing at the log file to LogParser.parse(…) and you can sit back and receive parsed log lines in your handler method.

Execution would look like this:

LogParser parser = new LogParser();
InputStream stream = null;
 
parser.parse(stream, new ILogParserCallback() {
  public void logEntryParsed(ILogEntry entry) {
    String ip = new String(entry.getFieldValue("c-ip"));
    String file = new String(entry.getFieldValue("cs-uri-stem"));
    String bytes = new String(entry.getFieldValue("sc-bytes"));
 
    System.out.println("Client from " + ip + " downloaded " + file
                            + " using " + bytes + " bytes.");
  }
});

NOTE: To make the parser as efficient as possible, field values are always returned as char[] to avoid unnecessary String allocations. If you need Strings, they are easy to create.

There are other methods defined on the ILogEntry interface that allow for ultra-efficient retrieval of the values contained; you don’t have to query by the field name if you don’t want to.

For example you can query by index (starts at 0, use getFieldCount() to get the total) or you can get the entire backing 2D array of values (getFieldValues()) where each field value is represented by its own char[] (0-indexed). This isn’t a copy of the array, but the raw backing away so be careful with it; sticking to read-only behavior with it is the best approach.

The library was intentionally designed with increasing layers of accessibility; if you are just getting started you can query by name. It uses a few extra cycles to do the lookup that maps names to indices, but it’s still fast and it’s very easy to use.

On the other hand if you are integrating the library into a system that will be processing billions of records and needs to run as fast as possible, just pull the backing 2D array and cut through it manually (the fields it contains are ordered per Amazon’s specification).

Intended Audience

Anyone wanting to parse their CloudFront logs (from a Download or Streaming distribution) quickly without the use of a separate service.

Download

All download bundles include the library JAR, source code and Javadoc.

Javadoc & Source

Maven

The Buzz Media provides access to it’s libraries via it’s own Maven Repository. Sources and Javadoc are available from the repository.

Edit your pom.xml and add the following to your <repositories> section (be sure to add one if you don’t have one already):

<repository>
	<id>The Buzz Media Maven Repository</id>
	<url>http://maven.thebuzzmedia.com</url>
</repository>

Then edit your pom.xml and add the following dependency to your <dependencies>section (be sure to add one if you don’t have one already):

<dependency>
	<groupId>com.thebuzzmedia</groupId>
	<artifactId>cloudfront-log-parser</artifactId>
	<version>1.4</version>
	<type>jar</type>
	<scope>compile</scope>
</dependency>

Be sure to adjust the version value to whichever version of CloudFront Log Parser you want to grab.

NOTEAt this time we are not providing checksums for the files on our repository, so you will see “[WARNING] Checksum validation failed” messages from Maven, but they can be safely ignored. Writing a release POM for Maven is beyond my capabilities as a human at this time.

Bug Reports

If you find a bug, please create a new issue with a description of the problem and how to reproduce it. Example files help a lot!

Feature Requests, Comments & Feedback

Please email me at software@thebuzzmedia.com, I would love to hear from you.