According to the most recent Steam Hardware Survey, 99% of steam users have SSE2 and SSE3 instruction support. The most immediately obvious thing to do with SSE is to try to optimize 4-vector math. However, a naiive reimplementation of vector methods doesn’t come close to the theoretical 4x improvement that one would expect from an SSE implementation

Lets start with the naiive implementation of a vector4, and the naiive implementation of operator +=:

struct Vector4
{
	//Assume that this has constructors
	//and other reasonable methods on it,
	//like dot, operator[], other operators, 
	//implemented in the most straightforward way.


	Vector4& operator+=(const Vector4& other)
	{
		for (size_t i = 0; i < 4; ++i)
		{
			elements[i] += other.elements[i];
		}

		return this;
	}
	
	float elements[4];	
};

Wonderful. Next, a test rig just to test this basic functionality:

Vector4 v = Vector4(r[0], r[1], r[2], r[3]);
Vector4 b = Vector4(r[4], r[5], r[6], r[7]);

LARGE_INTEGER begin;
QueryPerformanceCounter(&begin);

Vector4 c = Vector4::Zero;
for (size_t i = 0; i < 100000000; ++i)
{
	c += v + b;
	c -= b + v;
}

LARGE_INTEGER end;
QueryPerformanceCounter(&end);

LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);

double time = (double)(end.QuadPart - begin.QuadPart) / (freq.QuadPart);
std::cout << "Took: " << 1000 * time << "ms\n";

The expected output is [0, 0, 0, 0], and r is populated with random data (no denormals).

Using the naiive implementation, with /Ox (Full optimization) and no architecture flags, the generated code simply loads up every floating point value, converts to double, adds it, and then converts backwards. The core of the operation is shown below - a bunch of flds, fadds and fstp.

009714AB  fadd        dword ptr [esp+5Ch]  
009714AF  fstp        dword ptr [esp+98h]  
009714B6  fld         dword ptr [esp+8Ch]  
009714BD  fadd        dword ptr [esp+7Ch]  
009714C1  fstp        dword ptr [esp+7Ch]  
009714C5  fld         dword ptr [esp+90h]  
009714CC  fadd        dword ptr [esp+80h]  
009714D3  fstp        dword ptr [esp+80h]  
009714DA  fld         dword ptr [esp+94h]  
009714E1  fadd        dword ptr [esp+84h]  
009714E8  fstp        dword ptr [esp+84h]  
009714EF  fld         dword ptr [esp+98h]  
009714F6  fadd        dword ptr [esp+88h]  
009714FD  fstp        dword ptr [esp+88h]  

The problem is that the floating point model is set to precise, which forces 80-bit precision on the FPU. Changing the floating point model to fast does not net any gains in this department - the generated code is functionally identical between the versions.

The baseline performance on my i7 here is roughly 810 milliseconds with /fp:fast, and 1080 milliseconds with /fp:precise. This means that the target performance for this loop is 202.5 milliseconds, which will represent a 4x speedup.

The least labor intensive way of preforming SSE optimizations is to let the compiler do it for you. While Visual Studio 2010 doesn’t allow targeting SSE3 architectures specifically, the SSE3 instructions do not really add very much - the only interesting instruction for short vector operations is hadd, the horizontal add. HADD can be used to implement SSE dot products (ignoring the SSE4 dot product intrinsic, which is actually pretty slow), but only shows significant gains if you do 4 dot products at once.

Either way, enabling /arch:SSE2 produces shockingly bad code. Basically, it loads every element into its own XMM register with movss, and then adds them one by one. Since there are not enough registers to load all 13 floating point values at one value per register, the compiler generates a load of movss into memory, which incurs even more slowdown.

;;; This loads up a bunch of elements into XMM registers 0-3.
0024139E  movss       xmm1,dword ptr [esp+80h]  
002413A7  movss       xmm2,dword ptr [esp+84h]  
002413B0  movss       xmm3,dword ptr [esp+88h]  
002413B9  addss       xmm1,dword ptr [esp+70h]  
002413BF  addss       xmm2,dword ptr [esp+74h]  
002413C5  addss       xmm3,dword ptr [esp+78h]  
002413CB  movss       dword ptr [esp+30h],xmm0  
002413D1  movss       xmm0,dword ptr [Math::Vector<float,4>::Zero+4 (24631Ch)]  
002413D9  movss       dword ptr [esp+34h],xmm0  
002413DF  movss       xmm0,dword ptr [Math::Vector<float,4>::Zero+8 (246320h)]  
002413E7  movss       dword ptr [esp+38h],xmm0  
002413ED  movss       xmm0,dword ptr [Math::Vector<float,4>::Zero+0Ch (246324h)]  
002413F5  movss       dword ptr [esp+3Ch],xmm0  
002413FB  movss       xmm0,dword ptr [esp+7Ch]  
00241401  addss       xmm0,dword ptr [esp+6Ch]  

;;; Integer conversion code omitted 

00241442  movss       xmm4,dword ptr [esp+10h]  
00241448  movaps      xmm5,xmm4  
0024144B  addss       xmm5,xmm6  
0024144F  addss       xmm5,dword ptr [esp+30h]  
00241455  movaps      xmm6,xmm4  
00241458  addss       xmm6,xmm7  
0024145C  movss       dword ptr [esp+54h],xmm6  
00241462  movaps      xmm6,xmm4  
00241465  movaps      xmm7,xmm2  
00241468  addss       xmm6,xmm7  
0024146C  movss       xmm7,dword ptr [esp+78h]  
00241472  addss       xmm6,dword ptr [esp+38h]  
00241478  addss       xmm7,xmm4  
0024147C  addss       xmm7,dword ptr [esp+3Ch]  
00241482  movss       dword ptr [esp+3Ch],xmm7  

After all of this, the actual performance is actually a bit better - 700 milliseconds for the same loop, a 1.15x improvement on the naiive implementation, but still far from the idealized 210 millisecond runtime that an SSE implementation should aim for.

Before actually implementing the SSE version, certain constraints need to be added here. SSE performs optimally when loading from 16 byte aligned addresses, which means that, by default, any heap allocation will likely be incompatible with the movps instructions. One can specify that any stack allocation should be aligned with a declspec, but this causes certain problems. In particular, the vector implementation that ships with Visual Studio contains at least this troublesome line:

void resize(size_type _Newsize, _Ty _Val)

When _Ty is an aligned type, the template will fail to compile, since it is not possible, in the general case, to push an aligned value onto the stack in preparation for a function call. Changing the container to use boost::container::vector, which uses a constant reference instead of a value, allows the container to compile.

However, actually using the container causes memory problems - movps expects an aligned memory address, but the container doesn’t guarantee any such alignment. This will result in a runtime crash. A solution here is to store pointers to your data, but this design will invalidate your cache like mad. A proper solution is to use an aligned allocator, but this causes template type issues. Without support for template template aliases, the next best solution is to use some type computation to generate a container with the properly aligned allocator, but this has maintainability problems.

Another problem to consider here is that repeated calls to movps actually will do redundant loads if you’re not careful. The easiest way to make sure that move instructions are generated only when they are required is to invoke a small bit of undefined behavior, and store both the __m128 register and the floating point representation together:

__declspec(align(16)) union storage
{
	__m128 sse
	float  fpu[4];
};

The declspec above is not strictly required, since that can be inferred from the __m128 value, but it doesn’t harm the code generation at all, and it is always nice to be explicit. The actual implementation lives in the rmath repository, and not be posted here. The generated code here is not quite optimal, but it is reasonable:

00F71406  movaps      xmm2,xmmword ptr [esp+40h]  
00F7140B  movdqa      xmm1,xmmword ptr [Math::Vector<float,4>::Zero (0F766C0h)]  
00F71413  addps       xmm2,xmmword ptr [esp+10h]  

;;; Integer division and conversion ommitted ...

; Move (float)(i / 200) into xmm0, and shuffle it into all components, and 
; some more math operations 

00F71441  movss       xmm0,dword ptr [esp+0Ch]  
00F71447  fstp        dword ptr [esp+10h]  
00F7144B  shufps      xmm0,xmm0,0  
00F7144F  addps       xmm0,xmm2  
00F71452  addps       xmm0,xmm1  
00F71455  movss       xmm1,dword ptr [esp+10h]  
00F7145B  shufps      xmm1,xmm1,0  
00F7145F  addps       xmm1,xmm2  
00F71462  subps       xmm0,xmm1  
00F71465  movaps      xmm1,xmm0  

This code executes in 280 milliseconds, an 2.8x improvement over the original, non-SSE code. Interestingly, the vast majority of the time is spent within the integer division and subsequent floating point conversion. Removing the code shaves 200 milliseconds off the naiive implementation, to 600ms, and improves the sse implementation to 175ms, a 3.4x improvement. This improvement is significant, but only really affects vector computation heavy code, which should have been already instruction parallelized. Using an SSE version of a vector also requires special container considerations, which might require significant code rework. Considering the actual level of code that would benefit from this optimization, and the amount of code that would be required to change out all the containers and allocators, it may not even be a good cost/benefit ratio.