According to the most recent Steam Hardware Survey, 99% of steam users have SSE2 and SSE3
instruction support. The most immediately obvious thing to do with SSE is to try to optimize
4-vector math. However, a naiive reimplementation of vector methods doesn't come close to the theoretical 4x improvement that one would expect from an SSE implementation
Lets start with the naiive implementation of a vector4, and the naiive implementation of
operator +=:
struct Vector4
{
//Assume that this has constructors
//and other reasonable methods on it,
//like dot, operator[], other operators,
//implemented in the most straightforward way.
Vector4& operator+=(const Vector4& other)
{
for (size_t i = 0; i < 4; ++i)
{
elements[i] += other.elements[i];
}
return this;
}
float elements[4];
};
Wonderful. Next, a test rig just to test this basic functionality:
Vector4 v = Vector4(r[0], r[1], r[2], r[3]);
Vector4 b = Vector4(r[4], r[5], r[6], r[7]);
LARGE_INTEGER begin;
QueryPerformanceCounter(&begin);
Vector4 c = Vector4::Zero;
for (size_t i = 0; i < 100000000; ++i)
{
c += v + b;
c -= b + v;
}
LARGE_INTEGER end;
QueryPerformanceCounter(&end);
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
double time = (double)(end.QuadPart - begin.QuadPart) / (freq.QuadPart);
std::cout << "Took: " << 1000 * time << "ms\n";
The expected output is [0, 0, 0, 0], and r is populated with random data (no denormals).
Using the naiive implementation, with /Ox (Full optimization) and no architecture flags,
the generated code simply loads up every floating point value, converts to double, adds it, and
then converts backwards. The core of the operation is shown below - a bunch of flds, fadds and fstp.
009714AB fadd dword ptr [esp+5Ch]
009714AF fstp dword ptr [esp+98h]
009714B6 fld dword ptr [esp+8Ch]
009714BD fadd dword ptr [esp+7Ch]
009714C1 fstp dword ptr [esp+7Ch]
009714C5 fld dword ptr [esp+90h]
009714CC fadd dword ptr [esp+80h]
009714D3 fstp dword ptr [esp+80h]
009714DA fld dword ptr [esp+94h]
009714E1 fadd dword ptr [esp+84h]
009714E8 fstp dword ptr [esp+84h]
009714EF fld dword ptr [esp+98h]
009714F6 fadd dword ptr [esp+88h]
009714FD fstp dword ptr [esp+88h]
The problem is that the floating point model is set to precise, which forces
80-bit precision on the FPU. Changing the floating point model to fast does
not net any gains in this department - the generated code is functionally
identical between the versions.
The baseline performance on my i7 here is roughly 810 milliseconds with /fp:fast, and 1080
milliseconds with /fp:precise. This means that the target performance for this loop is 202.5
milliseconds, which will represent a 4x speedup.
The least labor intensive way of preforming SSE optimizations is to let the compiler do it for
you. While Visual Studio 2010 doesn't allow targeting SSE3 architectures specifically, the
SSE3 instructions do not really add very much - the only interesting instruction for
short vector operations is hadd, the horizontal add. HADD can be used to implement
SSE dot products (ignoring the SSE4 dot product intrinsic, which is actually pretty slow),
but only shows significant gains if you do 4 dot products at once.
Either way, enabling /arch:SSE2 produces shockingly bad code. Basically, it loads every
element into its own XMM register with movss, and then adds them one by one. Since there
are not enough registers to load all 13 floating point values at one value per register,
the compiler generates a load of movss into memory, which incurs even more slowdown.
;;; This loads up a bunch of elements into XMM registers 0-3.
0024139E movss xmm1,dword ptr [esp+80h]
002413A7 movss xmm2,dword ptr [esp+84h]
002413B0 movss xmm3,dword ptr [esp+88h]
002413B9 addss xmm1,dword ptr [esp+70h]
002413BF addss xmm2,dword ptr [esp+74h]
002413C5 addss xmm3,dword ptr [esp+78h]
002413CB movss dword ptr [esp+30h],xmm0
002413D1 movss xmm0,dword ptr [Math::Vector<float,4>::Zero+4 (24631Ch)]
002413D9 movss dword ptr [esp+34h],xmm0
002413DF movss xmm0,dword ptr [Math::Vector<float,4>::Zero+8 (246320h)]
002413E7 movss dword ptr [esp+38h],xmm0
002413ED movss xmm0,dword ptr [Math::Vector<float,4>::Zero+0Ch (246324h)]
002413F5 movss dword ptr [esp+3Ch],xmm0
002413FB movss xmm0,dword ptr [esp+7Ch]
00241401 addss xmm0,dword ptr [esp+6Ch]
;;; Integer conversion code omitted
00241442 movss xmm4,dword ptr [esp+10h]
00241448 movaps xmm5,xmm4
0024144B addss xmm5,xmm6
0024144F addss xmm5,dword ptr [esp+30h]
00241455 movaps xmm6,xmm4
00241458 addss xmm6,xmm7
0024145C movss dword ptr [esp+54h],xmm6
00241462 movaps xmm6,xmm4
00241465 movaps xmm7,xmm2
00241468 addss xmm6,xmm7
0024146C movss xmm7,dword ptr [esp+78h]
00241472 addss xmm6,dword ptr [esp+38h]
00241478 addss xmm7,xmm4
0024147C addss xmm7,dword ptr [esp+3Ch]
00241482 movss dword ptr [esp+3Ch],xmm7
After all of this, the actual performance is actually a bit better - 700 milliseconds
for the same loop, a 1.15x improvement on the naiive implementation, but still
far from the idealized 210 millisecond runtime that an SSE implementation should aim for.
Before actually implementing the SSE version, certain constraints need to be added here.
SSE performs optimally when loading from 16 byte aligned addresses, which means that, by
default, any heap allocation will likely be incompatible with the movps instructions. One can
specify that any stack allocation should be aligned with a declspec, but this causes
certain problems. In particular, the vector implementation that ships with Visual Studio
contains at least this troublesome line:
void resize(size_type _Newsize, _Ty _Val)
When _Ty is an aligned type, the template will fail to compile, since it is not possible,
in the general case, to push an aligned value onto the stack in preparation for a function call.
Changing the container to use boost::container::vector, which uses a constant reference instead
of a value, allows the container to compile.
However, actually using the container causes memory problems - movps expects an aligned
memory address, but the container doesn't guarantee any such alignment. This will result
in a runtime crash. A solution here is to store pointers to your data, but this design
will invalidate your cache like mad. A proper solution is to use an aligned allocator,
but this causes template type issues. Without support for template template aliases, the
next best solution is to use some type computation to generate a container with the
properly aligned allocator, but this has maintainability problems.
Another problem to consider here is that repeated calls to movps actually will do redundant
loads if you're not careful. The easiest way to make sure that move instructions are generated
only when they are required is to invoke a small bit of undefined behavior, and store both the
__m128 register and the floating point representation together:
__declspec(align(16)) union storage
{
__m128 sse
float fpu[4];
};
The declspec above is not strictly required, since that can be inferred from the __m128 value,
but it doesn't harm the code generation at all, and it is always nice to be explicit. The actual
implementation lives in the rmath repository, and not
be posted here. The generated code here is not quite optimal, but it is reasonable:
00F71406 movaps xmm2,xmmword ptr [esp+40h]
00F7140B movdqa xmm1,xmmword ptr [Math::Vector<float,4>::Zero (0F766C0h)]
00F71413 addps xmm2,xmmword ptr [esp+10h]
;;; Integer division and conversion ommitted ...
; Move (float)(i / 200) into xmm0, and shuffle it into all components, and
; some more math operations
00F71441 movss xmm0,dword ptr [esp+0Ch]
00F71447 fstp dword ptr [esp+10h]
00F7144B shufps xmm0,xmm0,0
00F7144F addps xmm0,xmm2
00F71452 addps xmm0,xmm1
00F71455 movss xmm1,dword ptr [esp+10h]
00F7145B shufps xmm1,xmm1,0
00F7145F addps xmm1,xmm2
00F71462 subps xmm0,xmm1
00F71465 movaps xmm1,xmm0
This code executes in 280 milliseconds, an 2.8x improvement over the original, non-SSE code.
Interestingly, the vast majority of the time is spent within the integer division and subsequent
floating point conversion. Removing the code shaves 200 milliseconds off the naiive implementation,
to 600ms, and improves the sse implementation to 175ms, a 3.4x improvement. This improvement
is significant, but only really affects vector computation heavy code, which should have been
already instruction parallelized. Using an SSE version of a vector also requires special
container considerations, which might require significant code rework. Considering the
actual level of code that would benefit from this optimization, and the amount of code
that would be required to change out all the containers and allocators, it may not even be a good
cost/benefit ratio.