Document Sample
20062241-Java-not-really-faster-than-C Powered By Docstoc
					        The Java not really faster than C++ Benchmark

When I first read "Java faster than C++ benchmark", I was sure that there was something
wrong with it. After all, Java couldn't be faster that C++, right? What would be next? C++
faster than C? C faster than Assembler? After quite a while I've found some time to update
the results using brand new JVMs and GCC versions.

Benchmark environment
All new tests were performed on Linux 2.6, Intel® Core™ Quad Q6600 2.40GHz. Older tests
(marked as such) were done on AMD Athlon™ XP 2500+ (Barton). No other load was
placed on the machine during tests, run level 1 (single user mode) had been used. GNU C
Library had optimizations for i686 (standard Debian package, this should be of benefit for
both Java and C++). When a tested GCC version wasn't available with my Linux distro, I
compiled GCC myself from the standard release. Please note that the result of this benchmark
(I am talking about "C++ faster than Java" result) is valid for GCC 3.4.4, 4.0.x, 4.1.x and
4.2.x as well, but GCC series 4.3 is faster than earlier GCCs. If truth be told, each new major
GCC release performs better than the previous one - as should be expected, given the amount
of work GCC guys put into their product.

Testing conditions
I've kept the original number of repetitions in all tests.
Time was measured in the same way as in the original results, with one exception: the time
used as benchmark result was elapsed real (wall clock) time used by the process. If we stick
to the original method of measurement, the results will be incorrect and biased towards Java -
because of threads used by the JVM. Wall clock time is better and more accurate if the
machine operates under no load - as was the case here.


Since Java compiler does not have any settings that could affect performance, Java code was
compiled with standard settings, using Java compiler from Sun in the same version as the
JVM that was used later to do the actual benchmark.
To run hash, heapsort and strcat tests I had to increase the heap size. It's pretty much standard
practice for enterprise software, so I do not see it as a drawback of Java.


When looking at C++ code, I've noticed many performance problems, which probably may
go unnoticed for people with strong Java background. These problems were not present in
Java code.
In other words, to make this benchmark fair, I had to make some modifications to the original
code. In doing so, I tried to make the code as close to the original as possible, even if my
personal coding style is completely different. If anyone feels that some further modifications
could/should be made to Java or C++ code, do not hesitate to contact me.
For C++ compilation, I've used the following options:
-O2 -fomit-frame-pointer -finline-functions -march=core2
Note: older results were produced with -march=athlon-xp architecture.
I consider the first two options standard for compilation (all major Linux distributions, Linux
kernel and many others use these two for compilation).
-finline-functions is used to tell the compiler to guess which functions should be inlined (Java
also does that).
Core2 - well, that's my machine. I could use pentium or i686 instead (which could be
considered more standard), but it would not change the overall result - that, is C++ being the
clear winner. So, let's get down to business, shall we?

Changes to the original benchmark
Here's the list of changes that I've made to the C++ programs. I did not list the trivial changes
needed for portation to GCC 4.3.x (like adding an additional header file). You can download
the modified code here.
      • ackermann.cpp
         No changes to the code here. The clue to improving performance of C++ lies in Java
         results. Client VM does not finish the test - stack overflow ocurs. Server VM passes
         the test and the result is much better than C++. The easiest answer is that Server VM
         does not recurse as much as Client VM, because of inlining. In GCC you can enable
         automated inlining by adding -finline-functions option. Since the recursion is very
         deep here, this program will benefit a lot. This option was used in all tests, not only
         Note that it is possible to make this program even faster (up to twice as fast) if you
         enable more aggressive inlining in GCC. I decided against doing it, but in real life, if
         you know that your program does a lot of recursion it may be worth a try.
      • fibo.cpp
         No changes here.
      • hash.cpp
      • hash2.cpp
         Both of these have the same problem. If you use map in C++, you use a TreeMap
         equivalent, not HashMap equivalent. TreeMaps are much worse (of course) when it
         comes to performing (huge amounts) of insertions on them. In C++, if you want hash
         maps, you have to use hash_map class (obsolete) or unordered_map (soon to be part
         of C++ standard, already available in GCC 4.x). I've rewritten both of these to use
         unordered_map just to stay on the bleeding edge ;-) . Performance results are similar
         with both hash_map and unordered_map. If you look closely at C++ code you will
         notice that a benchmark called "Java prettier that C++" would make an interesting
      • heapsort.cpp
         No changes here.
      • matrix.cpp
         No changes here.
      • methcall.cpp
         OK. So what this test actually does is it calls two methods 1 000 000 000 times each.
         Java uses references to access objects, so I've changed C++ code to use references
    I also inverted the loop direction, since comparison against 0 is faster - normally I
    would leave it as it was, but since this was the only benchmark in which Java Server
    came close to C++, I had to do something ;-) . Oh, and inverting the loop direction
    has no effect whatsoever on Java, so I left the code as it was.
    If you look at the results, you will notice that only the poor Java Client VM actually
    tried to invoke the methods - both Java Server and GCC inlined the methods, which
    resulted in a much better performance.
•   nestedloop.cpp
    No changes here.
•   objinst.cpp
    Before I explain the changes, a word of explanation about differences in memory
    management between Java and C++. (What follows is all very simplified). Java
    allocates new memory blocks on it's internal heap (which is allocated in huge chunks
    from the OS). In this way, in most of the cases it bypasses memory allocation
    mechanisms of the underlying OS and is very fast. If you allocate memory on heap in
    C++, each allocation request will be sent to the operating system, which is slow.
    That's why Java won in the original benchmark.
    The thing is, in C++ you don't have to use the default allocation algorithm - you have
    other options. You can either change the allocation behaviour in C++ to allocate
    memory exactly in the way Java does (which will be very fast), or allocate memory
    on the stack, not heap (which is more or less similar to the way memory management
    in Java works, if you grossly oversimplify the problem).
    If you look at the code which allocates code on stack, you will notice that it is much
    simpler than the original version. And if you look at the results you will notice that
    it's much faster, too. Actually, so fast, that it didn't even register on the graph ;-)
•   random.cpp
    No changes here.
•   sieve.cpp
    Rewritten. I think that original code was very unfair for C++:
          1. C++ used vector container and Java used an array of primitive types.
          2. C++ version counted the items one extra time at the end ( count() call )
          3. C++ used iterators and Java used indexing (actually it does not matter all that
              much for performance, but makes the code look ugly)
    I've replaced vector with an array (like in Java), removed the iterators (obviously)
    and removed the final count() call. The code is now more or less equivalent to Java
•   strcat.cpp
    I simplified the code, because it basically did what string implementation already
    does internally. I also removed the reserve() call from code - although keeping it
    makes C++ faster, I think it's cheating. If you actually reserve the string size (the way
    original code does it), you will get 20% performance increase. You cannot do this in
    Java. See, I am fair.
•   sumcol.cpp
    I added ios_base::sync_with_stdio() call to make sure that C++ does not keep the
    stream in sync with C functions all the time - no reason to do it. The code can be
    made faster by 50% if C routines (fgets) are used to access stdin instead of C++
    streams. I decided not to do it to keep the code as close as possible to the original
•   wc.cpp
        Only sync_with_stdio was added. It does improve the results a bit, but C++ was
        already fast anyway.

Results / And the winner is...
On Intel, all tests but two ended with C++ being better. Overall, C++ was twice as fast as
On Athlon, every individual test ended with C++ performing better than Java. Overall, C++
was three times faster than Java.
Results for tests performed on Intel Quad Core:
      • Individual results for up to date Java 6 vs GCC 4.3 Final Release
Results for tests performed on Athlon:
      • Individual results for up to date Java 6 vs GCC 4.3 Prerelease
      • Individual results for Java 5 vs GCC 4.1
      • Individual results for Java 6 vs GCC 4.1
      • Individual results for Java 6u1 vs GCC 4.3 snapshot
      • Individual results for Java 6 Beta vs GCC 4.1.1 Beta release of JVM, performance
        of the final release differs!
      • Individual results for Java 6 Beta 2 vs GCC 4.1.1 Beta release of JVM,
        performance of the final release differs!
      • Individual results for Java 6 RC vs GCC 4.1 Release Candidate release of JVM,
        performance of the final release differs!

The most important conclusion is obvious. (For this set of benchmarks,) C++ is clearly the
Second conclusion - don't use Client VM in Java.
Unfortunately, there's also a third conclusion. It seems that it's much, much easier to create a
well performing program in Java. So, please consider it for a moment before you start
recoding your Java program in C++ just to make it faster...

Java 6 vs Java 5

Client vs Server VM

On older processors and Java versions, it seems that client VM in Java 6 is tuned a little bit
better than in Java 5. Still, server VM performs significantly better than client VM. On newer
processors and Java versions, client JVM performed better. It does not make sense to draw
any consulsions from that for such small programs though.

Java progress

It seems that on new hardware the gap between Java and C++ has narrowed. Going from 3x
slower to 2x slower is quite an impressive feat on Java side.
Performance issues

If we don't take into account Ackermann test, Java 6 Server VM performs better than Java 5
Server VM. The most likely reason for problems in Ackermann is limited inlining in Java 6 -
at least in comparison to Java 5. If not for this problem, Java 6 would actually get
significantly better than Java 5 in overall results (~15 seconds).
Please note that I had to increase the stack size in Ackermann test for Java 6 - otherwise
neither of the JVM versions would finish the test. This further supports the theory about
changes in the inlining heuristics.

Shared By: