- Apr 27, 2000
- 23,223
- 13,301
- 136
Wow, it looks like Oracle is pretty serious about supporting AVX/AVX2 in the JVM. Not sure if that's fully implemented in Java 8 but I don't see why not. It may well be that ChipAbuse options 3 and 4 can trigger AVX2 overvolt. Obviously the older code won't since it's all volatile.
In any case, the article is an interesting read, especially when they start going into optimizing code snippets to let the JVM peel away loops sequentially to use SIMD to its fullest.
In my experience, the JVM does better if you explicitly feed it operands of like type with like operators in groupings of 256 bits (or 512 bits such as in the first example below). That's on a Steamroller though, and who knows what kind of optimization the JVM is doing for that thing. Regardless:
seems to work a lot better than:
It may be due to my habit of using do/while loops and the inclusion of an iterator variable that is defined outside the loop structure, but whatever. The fastest completion times I've been able to cook up have been from including statements in blocks in multiples of 8. What's weird is the Intel document explicitly mentions arrays of 1024 elements, whereas I've had the best results with 128 elements. Might be a cache size or performance problem. Again, Steamroller, not Haswell . . .
In any case, the article is an interesting read, especially when they start going into optimizing code snippets to let the JVM peel away loops sequentially to use SIMD to its fullest.
In my experience, the JVM does better if you explicitly feed it operands of like type with like operators in groupings of 256 bits (or 512 bits such as in the first example below). That's on a Steamroller though, and who knows what kind of optimization the JVM is doing for that thing. Regardless:
Code:
j = 0;
do
{
float3[j][0] = float1[j][0] + float2[j][0];
float3[j][0] = float1[j][1] + float2[j][1];
//2-14 here
float3[j][15] = float1[j][15] + float2[j][15];
j++;
}
while (j < 8);
seems to work a lot better than:
Code:
i = 0;
do
{
float3[i] = float1[i] + float2[i];
i++;
}
while (i < 128);
It may be due to my habit of using do/while loops and the inclusion of an iterator variable that is defined outside the loop structure, but whatever. The fastest completion times I've been able to cook up have been from including statements in blocks in multiples of 8. What's weird is the Intel document explicitly mentions arrays of 1024 elements, whereas I've had the best results with 128 elements. Might be a cache size or performance problem. Again, Steamroller, not Haswell . . .
Last edited:
