Languages‎ > ‎Java‎ > ‎

Realistically real-time

Javolution creator Jean-Marie Dautelle discusses different methods to reduce the worst-case execution time of Java applications by leveraging the extra processing power of multicore systems. The methods described here do not require specialized real-time virtual machines and are suitable for many types of applications for which unexpected delays in the tens or hundreds of milliseconds are not acceptable.

Real-time response is a requirement in Java application domains such as banking, online collaboration, and games development, as well as for safety critical applications used in hospitals, manufacturing, aviation, and emergency response systems. Unfortunately, the Java platform has long suffered from its erratic response time. Java applications are known to freeze because the garbage collector has "stopped the world." Incremental garbage collection (-Xincgc) improves this situation, but it does not completely eliminated the nasty "full GC" pauses associated with the Java platform.

To further complicate things, Java program execution also can suffer from unexpected delays caused by Just-In-Time (JIT) compilation, class initialization, or standard utility collections internally resizing.

This dire situation has led to the development of proprietary real-time virtual machines such as IBM WebSphere Real Time and Sun Mackinac. It also led me to create the real-time Javolution library. While real-time virtual machines are quite expensive, the Javolution library is free. As you'll learn in this article, it is also possible to exploit multicore processing power to increase real-time application responsiveness and reduce unexpected delays.

Our real-time benchmark

In order to measure the effects of the real-time solutions discussed in this article, I will run a simple program calculating a FFT (Fast Fourier Transform). The test program, rt-test1.java, uses complex numbers that have to be allocated and garbage collected. I'll first show what I can do to improve this program's benchmarks by optimizing the Java HotSpot virtual machine for real time GC and compilation. In rt-test2.java I'll rewrite the program using specialized classes from my Javolution library. The goal in both cases is to minimize the worst-case execution time. If you would like to run your own tests, you may substitute any code generating many temporary objects.

I first run the benchmark on an Intel Core Duo Laptop 2GHz, Java 1.6 with the following runtime arguments (for 256 MB fixed-size memory):

-Xms256M


-Xmx256M

The resulting execution time and memory usage are shown in Figure 1.

A graph showing the initial benchmark results.

Figure 1. Initial results:
Maximum execution time: 814.439545 ms
Average execution time: 10.863037 ms

Figure 2 shows the memory usage for this benchmark. As you can see, "full GC" took place and lasted almost one full second!

A graph showing memory usage for the first benchmark.

Figure 2. Memory usage for this first benchmark


Concurrent, low pause collector

In November 2006, Sun released the source code for the Java HotSpot virtual machine under the GPL v2 and initiated the OpenJDK project.

The generational collector used by HotSpot is very efficient for collecting short-lived objects and can be tuned to collect young generation (objects created since the last incremental garbage collection) quickly. For example, the setting -XX:NewRatio=3 means that the ratio between the young and tenured generation is 1:3.

The larger the ratio, the shorter the minor collection time, but (unfortunately) the more frequent the major collections have to be! Java SE 6 HotSpot has some advanced options for working around this problem. Two of these are of particular interest to us:

-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode

This first option activates the concurrent mark sweep (CMS) collector (also known as the concurrent low pause collector). This collector attempts to minimize the pauses due to garbage collection by doing most of the garbage collection work concurrently with the application threads. In other words, the VM can take advantage of idle cores to perform garbage collection concurrently. It should be noted that this option is different from -XX:+UseParallelGC, which uses multiple threads for collection and only executes the collector faster. (Furthermore, the parallel collector cannot be used with the concurrent low pause collector.)

The second option allows CMS to run even if you are not reaching the limits of memory. It prevents CMS from kicking in too late and being unable to finish its collection before available memory is exhausted. Figure 3 is the execution graph of our real-time benchmark with the CMS options activated.

Benchmark with HotSpot CMS options activated.

Figure 3. Benchmark with HotSpot CMS options activated:
Maximum execution time: 56.689048 ms
Average execution time: 19.683719 ms

Figure 4 shows the memory usage for this test.

HotSpot memory usage with concurrent mark sweep.

Figure 4. HotSpot memory usage with concurrent mark sweep

As you see, the worst-case execution time has been reduced by an order of magnitude, but the average execution time has increased.

Ahead-of-time compilation and alternatives

HotSpot does not provide an option to compile classes before execution. It does provide an option to specify the compilation threshold (the number of times a method has to be executed before being compiled). Setting the compilation threshold to one (-XX:CompileThreshold=1) forces the code to be compiled at first execution. This option is not as good as ahead-of-time compilation (supported by virtual machines such as Excelsior JET) or static native compilation (supported by the Gnu Compiler for Java) due to the performance hit the first time the code is executed. But it is often possible to execute critical code at startup and prevent JIT compilation from occurring at an inappropriate time. Figure 5 shows the benchmark result from using this option in conjunction with the CMS collector.

Benchmark with CompileThreshold=1.

Figure 5. Benchmark with CompileThreshold 1:
Maximum execution time: 49.830635 ms
Average execution time: 19.737168 ms

Because the benchmark code is so small, compiling the code at start-up does not result in a significant reduction in the worst-case execution time. However, JIT compilation for large applications may delay code execution by hundreds of milliseconds.


Stack allocation and the Javolution library

A major problem with concurrent collection is that it cannot always keep up with the collection if too much garbage is generated too fast. This situation results in GC's famous "stop the world" collection. An obvious solution is to limit garbage-generation throughput. Advanced compilers may avoid allocating new objects if these objects are de-referenced quickly (basically, if it can determine that the object will be candidate for garbage collection).

For example during the FFT calculation, the compiler could avoid allocating intermediate complex numbers and store their real/imaginary values directly in registers. But this optimization, called "escape analysis," is still a work in progress and only applicable to simple cases. Another solution would be to declare some classes as ValueType and reference instances of them only by copy. ValueType objects could then be allocated on the stack with no adverse effect on the collector (the same as for primitive types such as int/double). (I submitted a request for enhancement for this stack allocation feature in 2005.)

Alternatively, the Javolution library supports custom object allocation policies. But it requires the use of static factory methods instead of constructors (the behavior of the new keyword cannot be programmatically changed).

In rt-test2.java I have re-written the Complex class from rt-test1.java to support stack allocation. I've also re-written the FFT code to use the stack (the stack actual implementation can be "scoped memory" for Real-Time VM or thread-local object pools). I have also removed the Thread.sleep(5) statement, which was originally placed in the program's main loop to reduce garbage flow and avoid overloading the CMS collector. As you see in Figure 6, stack allocation by reducing garbage generation allows the concurrent collector to complete its work before the system runs out of memory.

Benchmark results of custom stack allocation.

Figure 6. Javolution's custom stack allocation
Maximum execution time: 76.192644 ms
Average execution time: 13.151166 ms

Compared to the previous graph the average execution time has been significantly reduced. On the other hand, the worst-case execution time hasn't improved a bit. Why not?

The reason for the discrepancy is that the minor collections run less frequently but clean up the same memory space. One solution is the option -XX:NewRatio=128 (from 8 the default for client VMs), which lets us take advantage of having less garbage to decrease the size of the young generation space. The VM options should now look as shown here:

-Xms256M -Xmx256M
-XX:+UseConcMarkSweepGC
-XX:+CMSIncrementalMode
-XX:CompileThreshold=1
-XX:NewRatio=128

Results of decreasing the size of young generation.

Figure 7. Decrease the size of young generation
Maximum execution time: 31.635864 ms
Average execution time: 16.256594 ms

Limiting the garbage generated is clearly the most efficient way to reduce GC pauses without overloading the concurrent collector. We have so far reduced the worst-case execution time from 814 ms to 31 ms!

Javolution's concurrent context

It is not rare in real-time systems that some critical processing has to be performed as quickly as possible. Increasing thread priority guarantees that the code will be scheduled for execution quickly, but it does not make the execution any faster. This problem is exacerbated with the latest highly concurrent processors such as the Sun Niagara 2 Processor. This processor is capable of running 32 threads concurrently but each thread executes at very low clock speed. Obviously it would be helpful, when something urgent has to be done, for all the processors to participate in completing the task quickly.

To this effect, the Javolution library provides a specialized execution context called ConcurrentContext. When a thread enters a concurrent context, it may perform concurrent executions by calling the ConcurrentContext.execute(Runnable) static method. The logic is then executed by a concurrent thread, or by the current thread itself if no concurrent thread is immediately available. (The number of concurrent threads is typically limited to the number of processors.)

Only after all concurrent executions are completed is the current thread allowed to exit the scope of the concurrent context (internal synchronization). In Listing 1 the FFT calculations are rewritten to execute concurrently:

Listing 1. FFT calculations using ConcurrentContext

import static javolution.context.ConcurrentContext.*;
...
enter();
try {
for (int i = 0; i < n; i++) {
final int j = i % K;
execute(new Runnable() {
public void run() {
fft(frames[j], results[j]);
}
});
}
} finally {
exit(); // Waits for concurrent executions to complete.
}

Concurrent contexts ensure the same behavior whether or not the execution is performed by the current thread or a concurrent thread. Any exception raised during the concurrent logic executions is propagated to the current thread.

Concurrent contexts are used with Web application servers in order to average server response time, even when some user actions take longer than others. In such a scenario, lengthy operations are performed in a concurrent context and are authorized to use up to half of the processors available. This is done through the simple command:

ConcurrentContext.setConcurrency(
(Runtime.getRuntime().availableProcessors() / 2) - 1);

The other half of the processors can continue servicing simple queries.

Javolution's concurrent contexts have proven to be very efficient. JScience's matrix multiplications, for example, are accelerated by a factor of 1.99x when concurrency is enabled on a dual-core processor. On one government project running on a Sun Fire T2000 Server, the execution of lengthy database operations was accelerated 17 times by using ConcurrentContext and 5 lines of code!


Closures and the ConcurrentContext

Currently several proposals exist for adding closures to Java 7. My preference goes to the Concise Instance Creation Expressions (CICE) proposal championed by Joshua Bloch. Not only is this proposal the simplest but it would work quite well with Javolution constructs such as ObjectFactory or ConcurrentContext. This proposal is a natural syntax simplification of anonymous classes with single abstract method (such as ObjectFactory.create() and Runnable.run()). Listing 2 shows what the concurrent FFT calculations rewritten using CICE closures would look like.

Listing 2. Concurrent FFT calculations with closures

enter();
try {
for (int i = 0; i < n; i++) {
final int j = i % K;
execute(Runnable() { fft(frames[j], results[j]); });
}
} finally {
exit(); // Wait for concurrent execution to complete.
}

I would also like to suggest that the closure class type be omitted when it can be inferred; for instance: execute(fft(frames[j], results[j]); });.

Class initialization

The Java platform specification is very clear about when and how Java classes are initialized. Class initializers are run at first use either when an instance of the class is created or when static methods (or fields) are called. In case of static class circularity, initialization order (and calling order) matters as it can result in an illegal state (such as static final member not set). Listing 3 is an example of illegal state.

Listing 3. An example of illegal state

class> A {
public static final A ONE = B.ONE;
}
class B extends A {
static final B ONE = new B();
}

If a program uses B before A, then ONE is null.

But the main problem for a real-time system is that the execution of a new conditional branch may cascade into hundreds of classes being initialized at an inappropriate time. There is no easy solution to this problem without changing the Java specification. One possibility, which is being employed on large real-time servers, is to force the initialization of all classes at startup. This can be done through the Javolution command: ClassInitializer.initializeAll(). But user beware: this command may take several minutes as it initializes all the classes in your classpath including the runtime library classes.

Real-time libraries

All the techniques described here, and even the best real-time virtual machine, are immaterial when non time-deterministic libraries are employed. What is the point of having concurrent garbage collection if a simple call to add one element to a standard Java collection results in delays in the hundreds of milliseconds because the collection has resized itself internally?

It is very unlikely that the standard Java libraries will be rewritten to become time-deterministic. A solution is to use external libraries having bounded/predictable response times. The Javolution collections implement the standard collection interfaces and can be used as drop-in replacements of the standard Java utility classes. Time-deterministic behavior is achieved through incremental capacity increases instead of full resizing. In other words, resizing occurs more often but has less impact on execution time or memory fragmentation.

Execution time of List.add(Object)

Figure 8. Execution time of List.add(Object)

In many instances, these classes are as fast as their standard library counterparts, proving that you can be both real-time and real fast!

In conclusion

Multicore processors are a godsend to Java developers, having significantly increased the field of applicability of the Java platform. As demonstrated with the FFT case study in this article, using multicore processors makes it possible to achieve a high level of time-determinism on a standard virtual machine. You simply have to know how to configure and optimize the VM and, most importantly, limit the garbage flow of your application.


Comments