ParallelGCThreads = (ncpus

On GC Threads (via Hiroshi Yamauchi):

  • Since ParallelCMSThreads is computed based on the value of ParallelGCThreads, overriding ParallelGCThreads when using CMS affects ParallelCMSThreads and the CMS performance.
  • Knowing how the default values of the flags helps better tune both the parallel GC and the CMS GC. Since the Sun JVM engineers probably empirically determined the default values in certain environment, it may not necessarily be the best for your environment.

False sharing: an invisible scalability buster

Watch out for false sharing; it’s an invisible scalability buster. The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they’re on the same cache line.

Detecting the problem isn’t always easy. Typical CPU monitors completely mask memory waiting by counting it as busy time, which doesn’t help us here, although the irregular lengths of the individual cores’ busy times gives us a clue. Look for code performance analysis tools that let you measure, for each line of your source code, the cycles per instruction (CPI) and/or cache miss rates those source statements actually experience at execution time, so that you can find out which innocuous statements are taking extremely disproportionate amounts of cycles to run and/or spending a lot of time waiting for memory. You should never see high cache miss rates on a variable being updated by one thread in a tight inner loop, because it should just be loaded into cache once and then stay hot; lots of misses mean lots of contention on that variable or on a nearby one.

Resolve false sharing by reducing the frequency of updates to the falsely shared variables, such as by updating local data instead most of the time. Alternatively, you can ensure a variable is completely unshared by by using padding, and alignment if available, to ensure that no other data precedes or follows a key object in the same cache line.

Herb Sutter

Numbers Everyone Should Know

L1 cache reference                         0.5 ns
Branch mispredict                          5 ns
L2 cache reference                         7 ns
Mutex lock/unlock                          100 ns
Main memory reference                      100 ns
Compress 1K bytes with Zippy               10,000 ns
Send 2K bytes over 1 Gbps network          20,000 ns
Read 1 MB sequentially from memory         250,000 ns
Round trip within same datacenter          500,000 ns
Disk seek                                  10,000,000 ns
Read 1 MB sequentially from network        10,000,000 ns
Read 1 MB sequentially from disk           30,000,000 ns
Send packet CA->Netherlands->CA      150,000,000 ns