Java IO Faster Than NIO – Old is New Again!

Alex Blewitt tweeted an article by Paul Tyma titled: Thousands of Threads and Blocking I/O: The old way to write Java servers is new again. Paul is the Founder/CEO of ManyBrain, the creator of Mailinator.

Paul’s 65-slide presentation is a fast read for anyone interested in Java I/O, especially in a client/server setup. What makes the presentation interesting is Paul began his research of IO vs NIO with the presumptions that all Java developers are running around with: NIO is faster than IO because it’s asynchronous and non-blocking.

The more research he did, the more he found everyone repeating that claim, but a complete lack of benchmarks and research to go with it. Paul sat down and wrote up a quick “blast the server with data” benchmark and found in every case the NIO-based server was 25% slower than the blocking, thread-based IO server.

Researching further Paul found others online that had come across the exact same performance discrepancies. Here is a quote from Rahul Bhargava of CTO Rascal Systems that is relevant:


Blocking model was consistently 25-35% faster than using NIO selectors. Lot of techniques suggested by EmberIO folks were employed – using multiple selectors, doing multiple (2) reads if the first read returned EAGAIN equivalent in Java. Yet we couldn’t beat the plain thread per connection model with Linux NPTL.

To work around not so performant/scalable poll() implementation on Linux’s we tried using epoll with Blackwidow JVM on a 2.6.5 kernel. while epoll improved the over scalability, the performance still remained 25% below the vanilla thread per connection model. With epoll we needed lot fewer threads to get to the best performance mark that we could get out of NIO.

Paul goes on to further make the case that the reason blocking I/O is now the new (old) way to writer servers is because of the extremely low-cost to thread synchronization that exists in modern operating systems combined with the explosion of multi-cored systems become more of the norm. Paul’s benchmark of thread contention, whether it be 2 threads or 1000 threads shows constant-time work that the OS needs to do in order to keep all those threads actively engaged with the cost of an idle thread damn near zero:

For folks that remember the days of Java 1.1, 1.2, chat clients and on giant application servers, you do remember that blocking I/O was the bane of Java’s high-performance-server existence. It seems that while the software community solved that problem with Java’s NIO libraries, the OS and hardware community solved the original problem of expensive threads with advanced OS threading libraries like NPTL and multi-core machines.

Naturally the union of both approaches is the “perfect mix” and what Paul has touched upon in his presentation; software is advanced enough to give us what it used to not be able to, and hardware has improved enough to magnify that benefit many times over.

Paul’s presentation goes on to discuss developing high-performance services and the different implementation details of things like work queues, blocking vs non-blocking work queues, waking sleeping threads and how to scale different approaches out to huge problem sets like he has for his own Mailinator service (which can handle millions of emails a day).

It’s an excellent read if you get a chance to scroll through it.

Tags: , , , , , , ,

About Riyad Kalla

Software development, video games, writing, reading and anything shiny. I ultimately just want to provide a resource that helps people and if I can't do that, then at least make them laugh.

, , , , , , ,

25 Responses to “Java IO Faster Than NIO – Old is New Again!”

  1. brian bulkowski July 27, 2010 at 11:37 am #

    Testing on a 2.6.5 kernel is unreasonable. The standard data center deployments are on 2.6.18 (centos 5), and massive improvements have been made especially regarding interrupt routing in modern kernels (2.6.32+). Finally, there are a few JVMs that I think are in contention for testing: IBM’s, IcedTea, and Sun’s 1.6 series, but the Blackwidow JVM doesn’t make my list.

    This kind of test is critical, but without testing a reasonable deployment configuration, this isn’t enough information to even consider changing one’s interface choices.

    One of the best benefits of NIO is efficient binary packing of data and dealing with signed/unsigned problems, which is difficult to get right and inefficient at run-time compared to NIO.

    That having been said, I have two versions of the Citrusleaf library, one using NIO and one using thread-per-request, and thread-per-request wins in most situations. The object creation and destruction overhead of NIO loses when the core count is high.

    • Riyad Kalla July 27, 2010 at 11:46 am #

      Brian,

      I’m guessing your comment about linux kernel versions was from testing Cirtrusleaf’s thread/NIO impl on the different versions? Do you have a benchmark and/or numbers from testing that you have published for these platforms and different VMs?

      Interesting data point about the different vers of your library. I wasn’t aware that the NIO/selector approach was object-creation heavy.

      • brian bulkowski July 27, 2010 at 3:14 pm #

        I don’t have specific benchmark numbers, and my implementation is different from what you’ve been discussing here. But a benchmark based on 2.6.5 at this point is a non-issue.

        My experience with the NIO/streams interface was running on a Niagra t1000. In that system, and with the Sun 1.5 JVM we were using, the system bottlenecked on object creation and destruction. However, the 1.5 JVM is far in the rearview mirror, as is the Niagra T1000 multicore/multithread architecture. So my experience there is also a non-issue, just like the benchmarks you’ve quoted.

        In my code – your milage may vary – the number of objects I had to create and destroy with the NIO interface was higher than using read() and write() with byte[]. There is a methodology where one codes to byte[] and uses that 100% with NIO, but I didn’t try that – one of the benefits of NIO/streams should be the ability to use the integer byteswapping methods.

        If I have a gripe here, it’s that there are two interfaces with vastly different performance, and one ends up with a fairly large implementation decision. What happened to write once run anywhere?

        • Richard July 28, 2010 at 5:17 pm #

          I wouldn’t be so dismissive. It’s inevitable that >= 2.6.32 will be the norm. It also suggests that if you’re developing a system today, it makes sense to look into this further.

          At any rate, I find it really interesting that we have this meem floating around for so long no one has stopped to ask if it’s still true. I wonder how many of those are out there ;-)

          - Richard

  2. Mikael Grev July 27, 2010 at 4:04 pm #

    NIO is only faster when you need really high throughput (GB/s+) and you are using native (Direct) buffers (or -server with small loops). When speeds go below that the implementation is more important.

  3. Ivan July 27, 2010 at 6:44 pm #

    Sounds about right. I knew this from experience but never had the time to quantify exact difference. In fact my experience is that this has been the case ever since nio was introduced.

    Well done Paul for quantifying this.

  4. Shivam Kumar July 27, 2010 at 10:18 pm #

    I think Paul has explain all the reasons for good performance of Blocking IO. But, what remains unexplained is why NIO performs 25% poorly than Blocking IO. Is it the multiplexer which makes NIO slow or the channels?

    Actually, I created a NIO client where one thread was always dedicated to multiplexer (so that events are servered as quickly as they arrive) and all the IO activities we off-loaded to the worker threads. I tested keeping the thread-pool size(of worker threads) same as that of the thread-pool size in Blocking IO. Still, NIO was not as good as IO.

    Which scenarios should we move to NIO?

    • Richard July 28, 2010 at 5:21 pm #

      Thats exactly what I did, I wonder how that setup would compare…

  5. Ivan July 28, 2010 at 2:24 am #

    The key to performance is – allocation, allocation, allocation (guys from the UK will get the pun).

    Creating objects is not just a load on garbage collector. Much larger, static cost is memory allocation and zeroing.

    Slow:

    for ( someloop ) {
    byte [] buf = new byte[2048];
    // populate buffer
    output.write(buf, 0, numbytes);
    }

    Way much faster due to no alloc/zeroing:

    byte [] buf = new byte[2048];
    for ( someloop ) {
    // populate buffer
    output.write(buf, 0, numbytes);
    }

    The main reason for zeroing is security – you don’t want your buffers leaking information. However, it is probably a good idea to separate performance critical code into a secure context and reuse buffers. With fast synchronisation, buffer pools are also huge boon – getting a buffer from a thread safe pool is way faster than allocating a new one.

    I have very often seen religious reliance on GC efficiency and pretty much all Java platform code is written without regard for zeroing cost. This is fair enough as it makes code cleaner. However, it does mean we have to do a bit extra to reach full performance potential.

    • Riyad Kalla July 28, 2010 at 7:55 am #

      Ivan,

      Good example — allocating *outside* loops is one of the more common mini-performance-quirks I see in code, and like you said it creates a lot more work for the VM.

      I also like this trick/tip for for-loops:

      for(int i=0, size=list.size(); i < size; i++) { … }

      that way you can do the compare against the size int instead of calling size() each time.

      I don’t know the VM-optimization behavior in some cases, if it can optimize out the size() call for each loop or not… my guess would be not depending on the scope of the list incase it determines that the list *could* be modified outside of that loop (e.g. have an element added to it)

      Regardless, I’ve found just sticking the strategy you mention a much safer alternative to guessing what the VM will eventually do with the code.

      • Ivan July 29, 2010 at 1:50 am #

        Allocating outside loops and static loop limits really do make a difference.

        However, my comment was more about the extraordinary performance hit we suffer because of zeroing. Perhaps my example didn’t emphasize that enough.

        The way I see server optimisations is that all code can be considered as critical – just as if running in a tight loop. So all rules wou’d use for a tight loop should stand in the service/processor code.

        • Riyad Kalla July 29, 2010 at 4:54 am #

          Ivan,

          I re-read your example, I’m not clear where the zeroing comes in — do you literally mean:

          for() {
          Person p = dao.getPerson(id);
          // stuff
          p = null;
          }

          or something else?

          • Ivan July 29, 2010 at 7:57 am #

            Zeroing affects every allocation, however arrays are probably most critical and are particularly suitable for optimisation.

            Take a pooled server thread using a bunch of byte buffers. You can experience significantly better performance if you allocate the buffers once and keep them in a thread-local than if you allocate every time they are needed. The performance improvement is in fact surprisingly large. As it turns out this is 99% down to avoiding allocation/zeroing and 1% or less down to GC savings.

            I have stumbled upon this while developing huge tree like structure (>100M entries = 1B nodes) which required real-time queries / under 10ms latency. Tree was partitioned and represented as buffers and kept on the disk (with lots of caching). Anyway, I had lots of arrays and here it became apparent that recycling them produces huge performance boost as opposed to allocating.

            When applied in our server context, this produced performance bump of around 10% on already heavily profiled, scrutinised and optimised services.

            BTW, nio couldn’t even come close to the performance we had with good old MT server model with pooled threads.

            • Riyad Kalla July 29, 2010 at 7:59 am #

              Ivan damn-interesting followup and data from your testing.

              Would never have guessed that at-scale that would have made such a difference.

  6. Henning July 28, 2010 at 2:53 am #

    Please notice that the linked PDF is from the beginning of 2008 (discussed on TSS: http://www.theserverside.com/news/thread.tss?thread_id=48449) and the benchmarks are from 2004 (discussed on TSS: http://www.theserverside.com/discussions/thread.tss?thread_id=26700).

    That’s the reason why Kernel 2.6.5 and Windows XP were used. This is just old news.

    • Riyad Kalla July 28, 2010 at 7:58 am #

      Henning,

      Thanks for the clarification. Do you happen to know if the information doesn’t apply anymore thought? It seems to me if anything the thread performance in more modern OS’s as mentioned by Brian if anything is even better/faster now and the issues that caused NIO at the time to deliver slightly slower performance are still standing.

      I would love to see Brian or Paul re-run comparable benchmarks with modern hardware/software on the latest Sun JVM for a more concrete comparison, but I think these findings are still applicable.

  7. JCoder August 25, 2010 at 6:23 am #

    Throughput is not the point of NIO. The point of non-blocking IO is to be able to handle millions of simultaneous slow connections with fast response times – a case where threads would just blow the stack and drop connections after accepting a few thousands.

    • Ivan August 25, 2010 at 6:39 am #

      I suppose you have a point, but then there are very few uses for NIO. Especially as most people use UDP (rather than TCP) for such applications.

      • JCoder August 25, 2010 at 8:23 am #

        You cannot use UDP for standard HTTP protocol. NIO allows you to set the maximum number of simultaneous connections very high. So when you have fast server and wide Internet connection, NIO allows connections not to wait one for another.

        • Ivan August 26, 2010 at 2:19 am #

          Hmmm… I thought you’ve specifically excluded HTTP in your previous comment by saying throughput is not the point. There are not many scenarios where throughput is not the point for HTTP.

          Also, there are not many scenarios when I’d want a single server handling millions of HTTP connections, regardless of what kind of server I’m running. In fact, there is scarcely a way to have millions of connections on 65535 ports unless you do a lot of gymnastics with network interface aliasing.

          • JCoder August 26, 2010 at 3:59 am #

            You are right, but without NIO you would not be probably able to handle more than 1024 connections at once, because of memory address space consumed by thread stacks. On a 64-bit machine this is probably not a problem, but I haven’t found a benchmark telling how good are modern 64-bit OSes at handling >> 1000 threads.

            And for HTTP response time becomes much more important on heavy loads – response times over 1s are easy to get, but users are not happy with that.

  8. JCoder August 25, 2010 at 8:26 am #

    Anyway, I suspect the benchmark methodology is not quite right. Here you can read a report from user claiming NIO gave 250% performance boost on her production system.

Trackbacks/Pingbacks

  1. links for 2010-07-27 « Daniel Harrison's Personal Blog - July 27, 2010

    [...] Java IO is faster than NIO: Old is New Again (tags: java development scalability) [...]

  2. Networking - Page 2 - May 5, 2012

    [...] Today, 07:59 PM Use blocking IO, and the thread-per-client model if you're planning to host on a core-happy machine. Blocking IO has better performance under less stress. You'll only need some implementation of NIO when the amount of actual connections exceeds the 100K mark. Which, considering a RS world is made for 2000 players (at maximum), will never happen. Creating a server design around blocking IO is very simple. The code in blocking IO applications are almost always far more legible. Here's a short article that summarizes a very large article, which backs up my points: http://www.thebuzzmedia.com/java-io-…-is-new-again/ [...]

Leave a Reply


3 − 1 =