On GC Threads (via Hiroshi Yamauchi):
- Since ParallelCMSThreads is computed based on the value of ParallelGCThreads, overriding ParallelGCThreads when using CMS affects ParallelCMSThreads and the CMS performance.
- Knowing how the default values of the flags helps better tune both the parallel GC and the CMS GC. Since the Sun JVM engineers probably empirically determined the default values in certain environment, it may not necessarily be the best for your environment.
Watch out for false sharing; it’s an invisible scalability buster. The general case to watch out for is when you have two objects or fields that are frequently accessed (either read or written) by different threads, at least one of the threads is doing writes, and the objects are so close in memory that they’re on the same cache line.
Detecting the problem isn’t always easy. Typical CPU monitors completely mask memory waiting by counting it as busy time, which doesn’t help us here, although the irregular lengths of the individual cores’ busy times gives us a clue. Look for code performance analysis tools that let you measure, for each line of your source code, the cycles per instruction (CPI) and/or cache miss rates those source statements actually experience at execution time, so that you can find out which innocuous statements are taking extremely disproportionate amounts of cycles to run and/or spending a lot of time waiting for memory. You should never see high cache miss rates on a variable being updated by one thread in a tight inner loop, because it should just be loaded into cache once and then stay hot; lots of misses mean lots of contention on that variable or on a nearby one.
Resolve false sharing by reducing the frequency of updates to the falsely shared variables, such as by updating local data instead most of the time. Alternatively, you can ensure a variable is completely unshared by by using padding, and alignment if available, to ensure that no other data precedes or follows a key object in the same cache line.
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns
Mutex lock/unlock 100 ns
Main memory reference 100 ns
Compress 1K bytes with Zippy 10,000 ns
Send 2K bytes over 1 Gbps network 20,000 ns
Read 1 MB sequentially from memory 250,000 ns
Round trip within same datacenter 500,000 ns
Disk seek 10,000,000 ns
Read 1 MB sequentially from network 10,000,000 ns
Read 1 MB sequentially from disk 30,000,000 ns
Send packet CA->Netherlands->CA 150,000,000 ns