As explained in Java theory and practice: Synchronization optimizations in Mustang by Brian Goetz, lock coarsening is the process of merging adjacent synchronized blocks that lock on the same object. It is one of the optimization techniques available in the HotSpot VM and is on by default. It can be turned off with -XX:-EliminateLocks option.
To demonstrate this feature, I will use 2 simple classes Driver and FavoriteChars. The myFavorites() method in FavoriteChars invokes synchronized getVowel(int) method 3 times. We will see that when -XX:+EliminateLocks is enabled, instead of generating code to obtain and release locks 3 times for each invocation of getVowel(int), the HotSpot Server compiler (C2) merges the 3 invocations into a single synchronized block.
public class FavoriteChars {
private final char[] VOWELS = new char[] { 'a', 'e', 'i', 'o', 'u' };
public char[] myFavorites() {
char first = getVowel(0);
char second = getVowel(1);
char third = getVowel(2);
return new char[] { first, second, third };
}
public synchronized char getVowel(int index) {
return VOWELS[index];
}
}
Finally the driver class. Driver calls FavoriteChars.myFavorites() enough times that the method is compiled into native code and inlined by C2. To print out this native code, I will use a debug build of JVM.
public class Driver {
public static void main(String[] args) {
FavoriteChars demo = new FavoriteChars();
for (int i = 0; i < 100000; i++) {
System.err.println(demo.myFavorites());
}
}
}
The main method prints the char[] returned by FavoriteChars.myFavorites() to System.err for two reasons: (1) to ensure the method is not optimized away and (2) to redirect that array to /dev/null so that it doesn't interfere with -XX:+PrintOptoAssembly output, which is sent to System.out.
Lock coarsening disabled
First let's see the code with lock coarsening disabled. Here's my platform info:
vkandy@ksi:~/Optimizations$ uname -a Linux ksi 2.6.31-15-generic #50-Ubuntu SMP Tue Nov 10 14:54:29 UTC 2009 i686 GNU/Linux vkandy@ksi:~/Optimizations$ $DEBUG_JAVA_HOME/bin/java -server -Xinternalversion Java HotSpot(TM) Server VM (16.0-b12-fastdebug) for linux-x86 JRE (1.6.0_18-ea-fastdebug-b05), built on Nov 18 2009 02:05:36 by "java_re" with gcc 3.2.1-7a (J2SE release)
vkandy@ksi:~/Optimizations$ $DEBUG_JAVA_HOME/bin/javac -d bin src/*.java vkandy@ksi:~/Optimizations$ $DEBUG_JAVA_HOME/bin/java -server -XX:-EliminateLocks -XX:CompileCommand=print,*FavoriteChars.myFavorites -cp bin Driver >-el.log 2>/dev/null
I am only interested in the code for FavoriteChars.myFavorites() so, this command will redirect JIT'd myFavorites() method to -el.log. Following is the fast path code of the method. This is the code executed by the biased thread. See -el.log:
000 N660: # B1 <- BLOCK HEAD IS JUNK Freq: 1 000 CMP EAX,[ECX+4] # Inline cache check JNE SharedRuntime::handle_ic_miss_stub NOP NOP NOP 000 00c B1: # B24 B2 <- BLOCK HEAD IS JUNK Freq: 1 00c # stack bang PUSHL EBP SUB ESP,40 # Create frame 01a MOV EBX,ECX 01c MOV EAX,[ECX] # int 01e MOV EBP,EAX 020 AND EBP,#7 023 MOV ECX, Thread::current() 02f CMP EBP,#5 032 Jne B24 P=0.000001 C=-1.000000 032 038 B2: # B27 B3 <- B1 Freq: 0.999999 038 MOV EDI,precise klass FavoriteChars: 0x098f1870:Constant:exact * 03d MOV EBP,[EDI + #104] # int 040 MOV EDX,EBP 042 OR EDX,ECX 044 MOV ESI,EDX 046 XOR ESI,EAX 048 TEST ESI,#-121 04e Jne B27 P=0.000001 C=-1.000000 04e 054 B3: # B60 B4 <- B25 B24 B2 B31 Freq: 1 054 MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding) 054 MOV EAX,[EBX + #8] ! Field FavoriteChars.VOWELS 057 MOV EBP,[EAX + #8] 05a NullCheck EAX 05a 05a B4: # B26 B5 <- B3 Freq: 0.999999 05a TESTu EBP,EBP 05c Jbe,u B26 P=0.000001 C=-1.000000 05c 062 B5: # B41 B6 <- B4 Freq: 0.999998 062 MOVZX EDI,[EAX + #12] # ushort/char -> int 066 MEMBAR-release ! (empty encoding) 066 MOV EBP,#7 06b AND EBP,[EBX] 06d CMP EBP,#5 070 Jne B41 P=0.000001 C=-1.000000 070 076 B6: # B33 B7 <- B42 B41 B5 Freq: 0.999998 076 MOV EAX,[EBX] # int 078 MOV EDX,EAX 07a AND EDX,#7 07d CMP EDX,#5 080 Jne B33 P=0.000001 C=-1.000000 080 086 B7: # B35 B8 <- B6 Freq: 0.999997 086 MOV EBP,precise klass FavoriteChars: 0x098f1870:Constant:exact * 08b MOV EBP,[EBP + #104] # int 08e MOV EDX,EBP 090 OR EDX,ECX 092 MOV ESI,EDX 094 XOR ESI,EAX 096 TEST ESI,#-121 09c Jne B35 P=0.000001 C=-1.000000 09c 0a2 B8: # B61 B9 <- B40 B33 B7 B38 Freq: 0.999998 0a2 MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding) 0a2 MOV EBP,[EBX + #8] ! Field FavoriteChars.VOWELS 0a5 MOV EDX,[EBP + #8] 0a8 NullCheck EBP 0a8 0a8 B9: # B43 B10 <- B8 Freq: 0.999997 0a8 CMPu EDX,#1 0ab Jbe,u B43 P=0.000001 C=-1.000000 0ab 0b1 B10: # B47 B11 <- B9 Freq: 0.999996 0b1 MOVZX EBP,[EBP + #14] # ushort/char -> int 0b5 MEMBAR-release ! (empty encoding) 0b5 MOV EDX,#7 0ba AND EDX,[EBX] 0bc CMP EDX,#5 0bf Jne B47 P=0.000001 C=-1.000000 0bf 0c5 B11: # B12 <- B10 Freq: 0.999995 0c5 MOV [ESP + #8],ECX 0c9 MOV [ESP + #12],EDI 0cd MOV [ESP + #16],EBP 0cd 0d1 B12: # B45 B13 <- B58 B48 B11 Freq: 0.999996 0d1 MOV EAX,[EBX] # int 0d3 MOV EDX,EBX 0d5 MOV ECX,EAX 0d7 AND ECX,#7 0da CMP ECX,#5 0dd Jne B45 P=0.000001 C=-1.000000 0dd 0e3 B13: # B50 B14 <- B12 Freq: 0.999995 0e3 MOV EBX,precise klass FavoriteChars: 0x098f1870:Constant:exact * 0e8 MOV EDI,[EBX + #104] # int 0eb MOV ECX,EDI 0ed MOV EBX,[ESP + #8] 0f1 OR ECX,EBX 0f3 MOV EBX,ECX 0f5 XOR EBX,EAX 0f7 TEST EBX,#-121 0fd Jne B50 P=0.000001 C=-1.000000 0fd 103 B14: # B15 <- B13 Freq: 0.999994 103 MOV EBX,EDX 105 MOV EDI,[ESP + #8] 105 109 B15: # B62 B16 <- B55 B46 B14 B53 Freq: 0.999996 109 MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding) 109 MOV EBP,[EBX + #8] ! Field FavoriteChars.VOWELS 10c MOV EAX,[EBP + #8] 10f NullCheck EBP 10f 10f B16: # B49 B17 <- B15 Freq: 0.999995 10f CMPu EAX,#2 112 Jbe,u B49 P=0.000001 C=-1.000000 112 118 B17: # B56 B18 <- B16 Freq: 0.999994 118 MOVZX EBP,[EBP + #16] # ushort/char -> int 11c MEMBAR-release ! (empty encoding) 11c MOV ECX,#7 121 AND ECX,[EBX] 123 CMP ECX,#5 126 Jne B56 P=0.000001 C=-1.000000 126 12c B18: # B21 B19 <- B57 B56 B17 Freq: 0.999994 12c MOV EAX,[EDI + #68] 12f LEA EBX,[EAX + #24] 132 CMPu EBX,[EDI + #76] 135 Jnb,us B21 P=0.000100 C=-1.000000 135 137 B19: # B20 <- B18 Freq: 0.999894 137 MOV [EDI + #68],EBX 13a PREFETCHNTA [EBX + #256] ! Prefetch into non-temporal cache for write 141 MOV [EAX],0x00000001 147 PREFETCHNTA [EBX + #288] ! Prefetch into non-temporal cache for write 14e MOV [EAX + #4],precise klass [C: 0x096dfa30:Constant:exact * 155 PREFETCHNTA [EBX + #320] ! Prefetch into non-temporal cache for write 15c MOV [EAX + #8],#3 163 MOV [EAX + #12],#0 16a XOR ECX.lo,ECX.lo XOR ECX.hi,ECX.hi 16e MOV [EAX + #16],ECX.lo MOV [EAX + #16]+4,ECX.hi 16e 174 B20: # N660 <- B22 B19 Freq: 0.999994 174 MOV ECX,[ESP + #12] 178 MOV16 [EAX + #12],ECX 17c 17c #checkcastPP of EAX 17c MOV EBX,[ESP + #16] 180 MOV16 [EAX + #14],EBX 184 MOV16 [EAX + #16],EBP 188 ADD ESP,40 # Destroy frame POPL EBP TEST PollPage,EAX ! Poll Safepoint 192 RET
Observations
Firstly, you can see that the 3 invocations of getVowel(int), the critical sections for which lock is needed, are inlined at labels B3, B8 and B15. See instructions between hilighted lines 76-88, 111-123, 157-169. There are no calls to getVowel(int) method, instead we see 3 MOVs which get the job done (not considering loading the array): MOVZX EDI,[EAX + #12], MOVZX EBP,[EBP + #14] and MOVZX EBP,[EBP + #16]. Note that when you print the bytecode of FavoriteChars.class, javap may show 3 invokevirtual 3 <getVowel> <(I)C> statements but at runtime, getVowel(int) is compiled to native code and inlined in FavoriteChars.myFavorites() method.
Secondly, note the conditional jump instructions (JNE) at the end of B1 and B2, just above label B3 (the first critical section). Similarly, there are conditional jump instructions just above B8 and B15, the other 2 critical sections. The instructions in labels B1 and B2 are biased locking code which updates the object's header with biased thread's information. Threads other than the bias holding thread are made to jump to slow path:
00c B1: # B24 B2 <- BLOCK HEAD IS JUNK Freq: 1 00c # stack bang PUSHL EBP SUB ESP,40 # Create frame 01a MOV EBX,ECX 01c MOV EAX,[ECX] # int 01e MOV EBP,EAX 020 AND EBP,#7 023 MOV ECX, Thread::current() 02f CMP EBP,#5 032 Jne B24 P=0.000001 C=-1.000000 032 038 B2: # B27 B3 <- B1 Freq: 0.999999 038 MOV EDI,precise klass FavoriteChars: 0x08472d08:Constant:exact * 03d MOV EBP,[EDI + #104] # int 040 MOV EDX,EBP 042 OR EDX,ECX 044 MOV ESI,EDX 046 XOR ESI,EAX 048 TEST ESI,#-121 04e Jne B27 P=0.000001 C=-1.000000
1a6 B23: # B24 <- B27 Freq: 9.99999e-13 1a6 CMPXCHG [EBX],EBP # If EAX==[EBX] Then store EBP into [EBX]
235 B27: # B23 B28 <- B2 Freq: 9.99999e-07 235 TEST ESI,#7 23b Jne B23 P=0.000001 C=-1.000000
The biased thread, acquires the lock and goes on to execute critical section in B3 whereas the other threads will have to execute CMPXCHG, a CAS operation in slow path (and probably more) to acquire the lock prior to entering the critical section in B3.
At this point we know that the current thread holds the lock for this object so we can move on to executing the critical section in B3. All that the getVowel(int) method does is load the array and read an element at a given index, so that's what is wrapped between MEMBAR-acquire and MEMBAR-release statements. No instructions are generated for MEMBAR-acquire and MEMBAR-release (difference in instruction address shows size is zero) because at this point, the thread that's executing this code is either the biased thread or the thread that won a lock:
054 B3: # B60 B4 <- B25 B24 B2 B31 Freq: 1 054 MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding) 054 MOV EAX,[EBX + #8] ! Field FavoriteChars.VOWELS 057 MOV EBP,[EAX + #8] 05a NullCheck EAX 05a 05a B4: # B26 B5 <- B3 Freq: 0.999999 05a TESTu EBP,EBP 05c Jbe,u B26 P=0.000001 C=-1.000000 05c 062 B5: # B41 B6 <- B4 Freq: 0.999998 062 MOVZX EDI,[EAX + #12] # ushort/char -> int 066 MEMBAR-release ! (empty encoding) 066 MOV EBP,#7 06b AND EBP,[EBX] 06d CMP EBP,#5 070 Jne B41 P=0.000001 C=-1.000000
This is repeated 2 more times for the remaining 2 invocations of getVowel(int) and finally an array is constructed and returned. In summary, threads other than the biased thread acquire and release the same lock 3 times, for each critical section.
Lock coarsening enabled
Now, let's enable lock coarsening and run the program again.
vkandy@ksi:~/Optimizations$ $DEBUG_JAVA_HOME/bin/java -server -XX:+EliminateLocks -XX:CompileCommand=print,*FavoriteChars.myFavorites -cp bin Driver >+el.log 2>/dev/null
Following is the output of the fast path. The first 2 labels B1 and B2 (biased locking code) is similar to what we saw before. But notice the instructions between labels B3 - B5 in +el.log:
00c B1: # B12 B2 <- BLOCK HEAD IS JUNK Freq: 1 00c # stack bang PUSHL EBP SUB ESP,24 # Create frame 01a MOV EBX,ECX 01c MOV EAX,[ECX] # int 01e MOV ECX,EAX 020 AND ECX,#7 023 MOV EDX, Thread::current() 02f CMP ECX,#5 032 Jne B12 P=0.000001 C=-1.000000 032 038 B2: # B15 B3 <- B1 Freq: 0.999999 038 MOV ECX,precise klass FavoriteChars: 0x08e05b88:Constant:exact * 03d MOV EBP,[ECX + #104] # int 040 MOV ECX,EBP 042 OR ECX,EDX 044 MOV EDI,ECX 046 XOR EDI,EAX 048 TEST EDI,#-121 04e Jne B15 P=0.000001 C=-1.000000 04e 054 B3: # B22 B4 <- B13 B12 B2 B19 Freq: 1 054 MEMBAR-acquire (prior CMPXCHG in FastLock so empty encoding) 054 MOV EAX,[EBX + #8] ! Field FavoriteChars.VOWELS 057 MOV EBP,[EAX + #8] 05a NullCheck EAX 05a 05a B4: # B14 B5 <- B3 Freq: 0.999999 05a CMPu EBP,#2 05d Jbe,u B14 P=0.000001 C=-1.000000 05d 063 B5: # B20 B6 <- B4 Freq: 0.999998 063 MOVZX EDI,[EAX + #16] # ushort/char -> int 067 MOVZX ESI,[EAX + #14] # ushort/char -> int 06b MOVZX EBP,[EAX + #12] # ushort/char -> int 06f MEMBAR-release ! (empty encoding) 06f MOV EAX,#7 074 AND EAX,[EBX] 076 CMP EAX,#5 079 Jne B20 P=0.000001 C=-1.000000
The critical section - the 3 reads from the array FavoriteChars.VOWELS - are grouped together between MEMBAR-acquire and MEMBAR-release. Again there are no instructions for MEMBAR-acquire and MEMBAR-release because the fast path code is for the biased thread. However, threads other than the bias holding thread have to acquire lock just once to read the 3 chars from FavoriteChars.VOWELS: see the 3 MOVs in label B5. Meaning threads other than the bias holding thread execute expensive lock acquisition code just once to enter and execute all 3 critical sections. In other words, C2, when asked to -XX:+EliminateLocks, merged 3 synchronized blocks which lock on the same object, into 1 (relatively) larger block, thereby reducing locking overhead.
If you have any comments, suggestions or corrections please feel free to let me know.