Posted by : Unknown Friday, July 26, 2013

Java Performance in the Real World

The growing excitement about Java in corporate IT departments has been closely followed by a growing concern about its performance. While numerous trade seminars, presentations and articles have explored Java performance, very few have focused on real world IT systems. This article discusses the performance of Java-based applications that solve real IT problems: enforcing business rules, accessing disparate data, and presenting information graphically.

To determine Java's performance impact in solving each of the above problems, we developed a variety of stand-alone and distributed object applications in both Java and C++, and compared their throughput under various load conditions. We also applied the latest performance enhancing techniques available in both languages such as threads, and Just-In-Time compilation and measured their impact. Our findings are both surprising and contrary to some widely held notions about Java's performance.

Standalone applications

Since a real world IT system may consist of several standalone components, it is instructive to begin by analyzing the main tasks of a standalone application. According to Carmine Mangione in a Feb’98 JavaWorld article, these are:

1.       Loading program executables: either from local storage (hard disk) or remotely over the network
2.       Running program instructions, including math operations, method calls and other business logic.
3.       Allocating and freeing memory used by the application.
4.       Accessing system resources such as I/O, file handling, and printing.

To the above, we would add the following tasks for a typical GUI-based IT application:
5.       Rendering and managing a Graphical User Interface (GUI)
6.       Handling user events (mouse clicks, input, drag-and-drop, etc).

The diagrams below illustrate how a Win32 and a Java application handle these functions.

Architecture of a standalone C++ application on NT
  
Architecture of a standalone Java application

From the above diagram it can be seen that in addition the 6 sets of functions described for a C++ application, a Java applications also:

7.       Runs a Byte-code verifier, and Security Manager during program loading to prevent illegal stack overflow and data conversion and to restrict access to resources such as network sockets. This would seem to indicate that a Java application would load more slowly than an equivalent C++ one, but in practice, there are 2 reasons why Java applications often load faster, especially across a network:
·         Java executables are significantly smaller in size than their C++ counterparts.
·         The class files which make up a Java application can be loaded dynamically as needed rather than being loaded all at once, as is often the case with C++ libraries.
8.       Uses either a Byte-code interpreter or a Just-In-Time (JIT) compiler to translate byte-code to machine instructions before executing them. If each program instruction is interpreted and then run, the application will perform 3-10 times slower than a compiled C++ version. However, JIT compilers significantly reduce the performance lag by compiling often-used instructions into machine code on the fly. They also perform some primary and secondary optimizations on the code, similar to, but not as extensive as the optimization done by a good C++ compiler. We would therefore expect program instructions to be executed fastest by a C++ program, with a JIT-enabled Java VM close behind and a Java interpreter performing much slower.
9.       Performs garbage collection by identifying and releasing memory that is no longer in use and moving memory around to prevent fragmentation. Garbage collection can reduce one of the leading causes of bugs in IT systems, memory leaks. However, because it entails using handles for object references and requires the garbage collector to be constantly running in the background, it produces a performance penalty.
10.    Uses peer interfaces for accessing system resources, rather than directly calling the underlying operating system. In the case where these peer interfaces map directly to the underlying subsystem (print, I/O, graphic) there is minimal overhead associated with this technique, rather, it maintains a consistent interface to system resources across platforms. However, where the peer interfaces don’t directly use the underlying subsystem, execution time greatly increases.

Real-world results

Having looked at the reasoning behind potential differences in C++ and Java application performance, the following section details the actual results of benchmarking both kinds of applications while performing the tasks outlined above.

All the tests were carried out on a Pentium II 266Mhz machine, with 128 MB RAM, running NT 4.0 Workstation. C++ applications were developed using Microsoft Visual C++ 5.0, Java applications using the Java Developer’s Kit 1.1.6 and Visual Basic 5.0 was used to develop the non-Java GUI interface. Three different applications were used to isolate and time specific tasks:
1.       A Program Execution App was developed in C++ and Java to measure execution of program instructions, loops, and method calls.
2.       A Memory Analyzer was developed in C++ and Java to measure memory allocation and deallocation.
3.       A GUI Plus application was developed in Visual Basic 5.0 (compiled executable) and Java to measure program loading, graphics, and event handling.

 

Test

Description

Time (s)
C++
Time (s) JIT
Time (s) Interpreter
Integer Division
This test loops 10 million times on an integer division.

1.3
1.4
3.8

Member Method

This test loops 10 million times calling a member method, which contains an Integer division.

1.3
1.5
9

First Million Primes

This test calculates the first million prime numbers. It exercises variable access, array access, and function-call invocation.

400
420
1800
Memory Allocation
This test allocates and frees 10 million 32-bit integers

0.7
1.6
1.6
Program Load
This test loads the VB and Java GUI using an executable to tabulate time

0.6
0.3
0.3
Render GUI
This test measures the time needed to render a complex screen with buttons, fields, list boxes, etc.

0.02
0.3
0.3
Perform Events
This test performs 1 million button clicks on the GUI

0.5
0.5
2

Analysis

The following observations can be made from the test results:

·         It can be seen that for the first three tests, the JIT-enabled Java application is only slightly (0-15%) slower than the C++ version, while the interpreted code is 3-8 times slower. The large performance penalty for interpreted code is because of multiple interpretations of the same code as well as lack of any optimization. The JIT code is slightly slower because the compiler performs fewer code optimization and virtually no global optimization and also because of the Java’s use of handles for object reference
·         Memory and deallocation is slightly more than twice as slow for Java applications. This is because of the working of the garbage collector discussed earlier.
·         Program loading is actually faster in Java, as anticipated. This is primarily due to the difference in executable size.
·         While the JIT-enabled Java GUI handles events just as fast as its VB counterpart, it is 15 times slower in rendering the GUI. This large difference can be attributed to the fact that while the VB application calls the Win32 graphics subsystem directly, the Java GUI uses the Java Foundation Classes (JFC) framework which has its own built-in graphics engine.

Distributed applications

Architecture of a Distributed Application

While there exist numerous real-world IT systems that are standalone, a majority of business-critical applications developed in the last few years have been distributed, either using a client-server architecture or more recently distributed objects. Thus, a true measure of Java’s performance in real world IT systems can only be gained by comparing the performance of business-critical, distributed Java and C++ applications. The architecture of a typical distributed object application is shown above. It consists of a GUI client that communicates via an object protocol with a set of application server objects that encapsulate the business logic. These server objects provide persistence by interfacing with object and relational databases via data access objects. They can also participate in transactions by using Transaction managers such as Tuxedo and Encina and frequently access legacy data and applications on the mainframe.

In addition to performing all the tasks of a standalone application (program loading and execution, memory management, accessing system resources and graphics processing), distributed applications also:

1.       Make distributed object requests between client and server objects, among server objects and between server and data access, transactional or legacy objects. Depending on the object middleware used, these requests can be in DCOM, CORBA IIOP or in the case of Java applications, RMI.
2.       Access disparate data (relational and non-relational), using a variety of protocols including ODBC, OLE DB, Embedded SQL and JDBC for Java applications.
3.       Interface with legacy systems using middleware such as CORBA or Microsoft Transaction Server, or even the JDK running on the mainframe or AS/400.
4.       Make use of threads. Multithreaded clients allow an the GUI to quickly return control to the end-user while processing a request in a separate thread. Multithreaded servers can handle simultaneous requests from multiple clients are can scale more easily.

Real-world results 

Having described the additional tasks required of distributed applications, the following section compares the performance of C++ and Java implementations of such applications performing the above tasks.

As before, all the tests were carried out on a Pentium II 266Mhz machine, with 128 MB RAM, running NT 4.0 Workstation. In this case two workstations were used to distributed the application with many clients and servers running on each. C++ applications were developed using Microsoft Visual C++ 5.0 and Java applications using the Java Developer’s Kit 1.1.6. Visbroker 3.2 (C++ and Java) was used as the CORBA object middleware. For data access, the C++ application used Microsoft’s ODBC driver to access data in MS SQLServer 6.5 while the Java application used a JDBC Type 3 driver from Intersolv to access the same data. Three different components were used to isolate and time specific tasks:

1.       An Object Request component was developed in C++ and Java to measure distributed object requests.
2.       A Data Access component was developed in C++ and Java to measure data access from SQLServer.
3.       A MultiThread component was developed in C++ and Java to measure synchronized thread calls.


 

Test

Description

Time (s)
C++
Time (s) JIT
Time (s) Interpreter

ORB init and bind
This test measures the time needed to initialize the CORBA client and bind to remote application server

1
0.9
0.9

Single object invocation

This test instantiates a remote object which performs 1 million operations

0.02
0.03
0.7

Multiple object invocation
This test loops instantiates 3000 remote objects each of which perform 1 operation

36
22
22

Database connection
This test connects to a remote SQLServer 6.5 database

0.3
1
0.7

Select

This test loops 100 times and retrieves 10 rows from the database

27
12
12
Synchronized Method
This test measures the time needed to access a synchronized method 20000 times

10
17
18
Analysis
The following observations can be made from the test results:

·         A JIT compiler provides limited performance improvement for distributed applications. This can be surmised by the fact that results for most tests are almost identical between  JIT-enabled and interpreted Java applications. In general, the network hop is the gating factor for distributed object requests while the database driver is the gating factor for data access. The latter conclusion is drawn from the fact that the Java application access data more than twice as fast as the C++ application, primarily because it uses a Type 3 JDBC driver with server-side SQL execution, which is much more efficient than the client-side execution provided by the C++ ODBC driver.
·         Remote object instantiation is faster with Java. This could be attributed to a better implementation of the CORBA Basic Object Adapter in Visbroker for Java vs. Visbroker C++.
·         As expected, synchronized methods are slower in Java than C++. This is because such methods keep both a C and Java stack in memory and also execute a significant amount of additional code to provide thread-safety.

Performance enhancing techniques

While distributed applications in general are not overly affected by Java’s performance limitations, there are two important reasons for trying to enhance their performance:
A.      If a distributed application performs computationally intensive work, or its GUI is fairly complex, then some of the limitations of the standalone JIT code, as seen in the GUI and method-call tests, can become more pronounced. A hint of this can be seen in the “single-object invocation” test above where the C++ version is slightly faster than the Java JIT code, because the remote operation is performing some computational work
B.      While the relative performance of C++ and Java distributed applications may be similar, there is definitely an advantage in increasing the absolute performance of a Java distributed application, so that it provides higher thruput, increased transactions/sec, and greater scalability.

Performance can be improved at several levels: At the lowest level, providing a faster Virtual Machine and better JIT compiler can produce better optimized and faster executing code. A level above this are Java performance tools and libraries such as specialized libraries for I/O, as well as faster data access drivers. Finally, some of the biggest performance gains can be obtained by profiling the application code and then optimizing it using proven techniques. Each of these methods are explored in greater detail below:

Using a faster Virtual Machine and JIT compiler

There are numerous Java VMs and JIT compilers available, especially on popular platforms such as Win95, NT and Solaris. The speed of VMs is usually rated using one of two popular benchmarks: Jmark 2.0 and CaffeineMark 3.0. Each runs a variety of tests including processor-intensive tasks, GUI and thread calls on a particular VM and combines the results into a composite Jmark or CaffeineMark score which can be used by an evaluator (but more often by the vendor’s marketing folks) to make (or push) a VM selection. Some of the faster VM’s and JIT compilers on the market today include:
1.       Supercede 2.0 Pro compiler and VM with native code generation.
2.       TowerJ compiler and VM with native code generation .
3.       Microsoft VM 3.1 with generational garbage collector. This VM is available as part of the Visual J++ product or as a free download from Microsoft’s Java website at http://www.microsoft.com/java/. In tests performed by Sun engineers at the JavaOne conference, the MS VM executed 20-45% faster than the JDK 1.2 beta3 and the JDK 1.1.6 VMs with JIT.
4.       Kaffe, available free on 30 operating systems, includes JIT conversion from byte to native code.
5.       Symantec VM and JIT available as part of Symantec Visual CafĂ© and is also bundled with Netscape Navigator and Sun’s JDK 1.1. 
6.       Inprise VM and JIT available as part of Jbuilder 2.0. 

Using Java performance tools and libraries

Tools available for optimizing Java applications include:
1.       The javac compiler itself with the –O option for optimization. Using this compiler option provides some primary optimization and dead code elimination.
2.       JAX from IBM can reduce the size of a Java application and make it more efficient (upto 50% reduction in size) by removing dead code, inlining method calls, etc. It is available for free download at http://www.alphaworks.ibm.com/formula/JAX.

Java class libraries available for improving performance include:
1.       The Windows Foundation Classes (WFC) and the Jdirect API from Microsoft, which allow Java applications to call the Win32 subsystem directly and thus greatly improve graphics handling, and other system tasks. The tradeoff is application portability because the API’s to these libraries are used as an alternative to standard Java AWT/JFC calls. These libraries are available at with Visual J++.
2.       Perflib provides a set of Java classes for high performance sorting, searching, I/O, etc. The routines claim to be upto 5 times as fast as standard JDK implementations.
3.       A variety of Type 3 and 4 JDBC drivers are available for fast, native access to most relational databases. Some of the popular vendors include Inprise with their Data Gateway and Microfocus/Intersolv’s DataDirect product suite.

Profiling and Optimizing Application code

While the above techniques can yield significant performance improvements, tuning application code can potentially provide the greatest “bang for the buck”. This is especially true if a major inefficiency can be identified and eliminated in the 20% of code that is executed 80% of the time. A good way to discover programming inefficiencies is to run the Java code through a profiler. Profiling allows the detection of performance bottlenecks, identification of CPU and memory intensive code and collection of function and even line-level timing data. Some of the Java profilers in the market include:
1.       Visual Quantify from Rational Software.
2.       OptimizeIt from Intuitive Systems.
3.       JProbe from KLGroup.
4.       Jinsight, a freeware profiler and memory analyzer from IBM.

Once the problem areas of an application are identified, there are a number of steps that can be taken to improve overall performance. At the component level, there are numerous coding techniques that can be used to increase code efficiency and avoid problem APIs. The “Java performance tuning tips 1.0” article from IBM and the “Java performance and optimization” article from Inside Java, (available in the Resources section) both discuss some of these techniques in detail. For improving the performance of business-critical distributed applications, the following techniques are available:

·         Avoid synchronized methods in multithreaded applications, if possible. As seen from the tests above, they are fairly slow and resource intensive. In some cases, it might be better to create two versions of a method, one synchronized and one non-synchronized, and only use the former when absolutely necessary.
·         Pass objects by value when appropriate. Some middleware environments such as CORBA make it especially convenient to pass objects by reference to a remote module. The problem with this approach is that anytime the remote module needs access to the object, it needs to make a remote call back to the passing object. So in cases where a passed object needs to be accessed frequently, it makes more sense to take the initial hit of passing it by value.
·         Use JDBC with precompiled SQL, rather than Dynamic SQL, for oft-repeated queries. Precompiled SQL is stored in the database server and repeatedly executed with new inputs while dynamic SQL is recompiled every time it is run. Using this technique in our tests has reduced the access time for repeated queries 2-8 times!
·         Multiplex database connections across several clients and maintain them rather than creating and destroying a connection for each client. This has the dual advantages of reduced connection time and improved scalability of the application.
·         Many real-world applications require several layers of security, including authentication, authorization, data encryption and non-repudiation. Since security adds significant overhead to a distributed request and encryption algorithms are computationally intensive, use the minimum security level possible for a given operation and user.
·         A common technique for traversing firewalls is to tunnel the object request through HTTP. While this approach is the most flexible, it comes with a significant performance penalty. A better approach, when possible, is to open a minimum set of firewall ports or to use an object-protocol friendly firewall proxy (such as Wonderwall from Iona).
·         Finally, if delays can’t be avoided due to a large number of system users or slow-running server objects, their impact on the client can be minimized by creating a separate thread to handle the object request and returning control of the GUI back to the user.

Future Improvements

The great interest that Java is receiving from corporate IT departments has caused system vendors to continue the rapid pace of advancement in this technology. High on their priority list is further improvements in the speed of Java applications, both standalone and distributed.

For standalone applications, Sun is delivering JDK 1.2 later this year, which promises performance improvements in strings, vectors, dates and the JIT compiler. Q1 ’99 heralds the availability of the revolutionary HotSpot compiler from Sun which in preliminary tests at JavaOne ran applications faster than C++! HotSpot is a cross between a JIT compiler and interpreter and provides its dramatic performance improvements with a much-improved generational garbage collector, fast thread synchronization and “adaptive” compilation.

While improved object middleware such as CORBA 2.2 and RMI 2.0 promise to increase the speed of distributed Java applications, the most significant development in this area is the rapid advancement of Java Application Servers. These applications servers host the server-side business objects and automatically provide them with multithreading, database and resource pooling and load-balancing, all of which promise to make real-world Java applications the fastest and most scalable kind of distributed applications available.


Leave a Reply

Subscribe to Posts | Subscribe to Comments

Blog Archive

- Copyright © Seminar Sparkz Inc -- Powered by Semianr Sparkz Inc - Designed by Shaik Chand -