Extreme Performance with Java – Charlie Hunt at qcon ’12

Extreme Performance with Java

Charlie Hunt explains what can be done to lower the latency introduced by the Java GC and JIT, including coding tips, and introducing tools for tuning the performance of Java applications.

Advertisements

The ultimate reference book on Java performance is out! ‘The definitive master class in performance tuning Java applications…’, James Gosling.

Java Performance

 

Hot off the press – October 2011! 

http://www.amazon.com/Java-Performance-Charlie-Hunt/dp/0137142528

 

“The definitive master class in performance tuning Java applications…if you love all the gory details, this is the book for you.”
–James Gosling, creator of the Java Programming Language

Improvements in the Java platform and new multicore/multiprocessor hardware have made it possible to dramatically improve the performance and scalability of Java software.

Java™ Performance covers the latest Oracle and third-party tools for monitoring and measuring performance on a wide variety of hardware architectures and operating systems. The authors present dozens of tips and tricks you’ll find nowhere else.

You’ll learn how to construct experiments that identify opportunities for optimization, interpret the results, and take effective action. You’ll also find powerful insights into microbenchmarking–including how to avoid common mistakes that can mislead you into writing poorly performing software. Then, building on this foundation, you’ll walk through optimizing the Java HotSpot VM, standard and multitiered applications; Web applications, and more.

Coverage includes

  • Taking a proactive approach to meeting application performance and scalability goals
  • Monitoring Java performance at the OS level in Windows, Linux, and Oracle Solaris environments
  • Using modern Java Virtual Machine (JVM) and OS observability tools to profile running systems, with almost no performance penalty
  • Gaining “under the hood” knowledge of the Java HotSpot VM that can help you address most Java performance issues
  • Integrating JVM-level and application monitoring
  • Mastering Java method and heap (memory) profiling
  • Tuning the Java HotSpot VM for startup, memory footprint, response time, and latency
  • Determining when Java applications require rework to meet performance goals
  • Systematically profiling and tuning performance in both Java SE and Java EE applications
  • Optimizing the performance of the Java HotSpot VM

Using this book, you can squeeze maximum performance and value from all your Java applications–no matter how complex they are, what platforms they’re running on, or how long you’ve been running them.
About the Author

Charlie Hunt is the JVM performance lead engineer at Oracle. He is responsible for improving the performance of the HotSpot JVM and Java SE class libraries. He has also been involved in improving the performance of the Oracle GlassFish and Oracle WebLogic Server. A regular JavaOne speaker on Java performance, he also coauthored NetBeans™ IDE Field Guide (Prentice Hall, 2005).

Binu John is a senior performance engineer at Ning, Inc., where he focuses on improving the performance and scalability of the Ning platform to support millions of page views per month. Before that, he spent more than a decade working on Java-related performance issues at Sun Microsystems, where he served on Sun’s Enterprise Java Performance team. John has contributed to developing industry standard benchmarks such as SPECjms2007 and SPECJAppServer2010; published several performance whitepapers; and contributed to java.net’s XMLTest and WSTest benchmark projects.

ParallelGCThreads = (ncpus

On GC Threads (via Hiroshi Yamauchi):

  • Since ParallelCMSThreads is computed based on the value of ParallelGCThreads, overriding ParallelGCThreads when using CMS affects ParallelCMSThreads and the CMS performance.
  • Knowing how the default values of the flags helps better tune both the parallel GC and the CMS GC. Since the Sun JVM engineers probably empirically determined the default values in certain environment, it may not necessarily be the best for your environment.

False sharing: an invisible scalability buster

Watch out for false sharing; it’s an invisible scalability buster. The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they’re on the same cache line.

Detecting the problem isn’t always easy. Typical CPU monitors completely mask memory waiting by counting it as busy time, which doesn’t help us here, although the irregular lengths of the individual cores’ busy times gives us a clue. Look for code performance analysis tools that let you measure, for each line of your source code, the cycles per instruction (CPI) and/or cache miss rates those source statements actually experience at execution time, so that you can find out which innocuous statements are taking extremely disproportionate amounts of cycles to run and/or spending a lot of time waiting for memory. You should never see high cache miss rates on a variable being updated by one thread in a tight inner loop, because it should just be loaded into cache once and then stay hot; lots of misses mean lots of contention on that variable or on a nearby one.

Resolve false sharing by reducing the frequency of updates to the falsely shared variables, such as by updating local data instead most of the time. Alternatively, you can ensure a variable is completely unshared by by using padding, and alignment if available, to ensure that no other data precedes or follows a key object in the same cache line.

Herb Sutter

Numbers Everyone Should Know

http://static.googleusercontent.com/externalcontent/untrusteddlcp/research.google.com/en/us/people/jeff/stanford-295-talk.pdf

L1 cache reference                         0.5 ns
Branch mispredict                          5 ns
L2 cache reference                         7 ns
Mutex lock/unlock                          100 ns
Main memory reference                      100 ns
Compress 1K bytes with Zippy               10,000 ns
Send 2K bytes over 1 Gbps network          20,000 ns
Read 1 MB sequentially from memory         250,000 ns
Round trip within same datacenter          500,000 ns
Disk seek                                  10,000,000 ns
Read 1 MB sequentially from network        10,000,000 ns
Read 1 MB sequentially from disk           30,000,000 ns
Send packet CA->Netherlands->CA      150,000,000 ns

Latency is always present

What Your Computer Does While You Wait : Gustavo Duarte

Many discrete wait periods are required when accessing memory. The electrical protocol for access calls for delays after a memory row is selected, after a column is selected, before data can be read reliably, and so on. The use of capacitors calls for periodic refreshes of the data stored in memory lest some bits get corrupted, which adds further overhead. Certain consecutive memory accesses may happen more quickly but there are still delays, and more so for random access. Latency is always present.