2009/02/13

AA-sort

A co-worker sent me a link to AA-sort some time ago, but I didn't have any reason to use it until just recently. I found myself in a situation where I have 2K~4K 4-byte elements packet into qwords and my initial O(N^2) solution just wasn't working out. So, I decided to try the O(N log N) "in-core" component of AA-sort as a pre-sort.

The paper linked above is relatively straight-forward, but they only provide pseudo-code so there's still a bit of effort required to get something usable. As per the author's goals, it's easily written as highly-vectorized, non-branching code and was rather fun to get working.

The authors were working with 8 16-bit values packed into a qword, and I had to make some simple changes to get it working with vec_uint4s.

First, my bitonic_sort() routine takes care of step one of their in-core algorithm by sorting the 4 uint32_t elements in a qword:

inline vec_uint4 bitonic_sort( vec_uint4 Vec_ )
{
static const vec_uchar16 BADC = {
0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03,
0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
};
static const vec_uchar16 CDAB = {
0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
};
static const vec_uint4 mask1 = {
0x00000000, 0xffffffff, 0xffffffff, 0x00000000
};
static const vec_uint4 mask2 = {
0x00000000, 0x00000000, 0xffffffff, 0xffffffff
};
static const vec_uint4 mask3 = {
0x00000000, 0xffffffff, 0x00000000, 0xffffffff
};
vec_uint4 temp, cmp, result;

temp = spu_shuffle( Vec_, Vec_, BADC ); // Compare A to B and C to D
cmp = spu_cmpgt( Vec_, temp );
cmp = spu_xor( cmp, mask1 ); // Order A/B increasing, C/D decreasing
result = spu_sel( Vec_, temp, cmp );

temp = spu_shuffle( result, result, CDAB ); // Compare AB to CD
cmp2 = spu_cmpgt( result, temp );
cmp2 = spu_xor( cmp2, mask2 ); // Order AB/CD increasing
result = spu_sel( result, temp, cmp2 );
temp = spu_shuffle( result, result, BADC ); // Compare previous A to B and C to D
cmp = spu_cmpgt( result, temp );
cmp = spu_xor( cmp, mask3 ); // Order A/B and C/D increasing
result = spu_sel( result, temp, cmp );

return result;
}

Next come cmpswap() and cmpswap_skew() from the paper:

inline vec_uint4 vector_cmpswap( vec_uint4* A_, vec_uint4* B_ )
{
const vec_uint4 a = *A_;
const vec_uint4 b = *B_;
const vec_uint4 cmp = spu_cmpgt( a, b );
*A_ = spu_sel( a, b, cmp );
*B_ = spu_sel( b, a, cmp );

return cmp;
}

inline vec_uint4 vector_cmpswap_skew( vec_uint4* A_, vec_uint4* B_ )
{
const vec_uint4 a = *A_;
const vec_uint4 b = *B_;

// The last word compared is 0xffff so that part of the mask should always be 0000
static const vec_uchar16 BCDx = {
0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b,
0x0c, 0x0d, 0x0e, 0x0f, 0xc0, 0xc0, 0xc0, 0xc0
};
const vec_uint4 bShift = spu_shuffle( b, b, BCDx );
const vec_uint4 cmp = spu_cmpgt( a, bShift );
const vec_uint4 swap = spu_sel( a, bShift, cmp );
const vec_uint4 bSwap = spu_sel( bShift, a, cmp );

static const vec_uchar16 abcD = {
0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
0x18, 0x19, 0x1a, 0x1b, 0x0c, 0x0d, 0x0e, 0x0f
};
static const vec_uchar16 Aabc = {
0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13,
0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b
};
*A_ = spu_shuffle( a, swap, abcD );
*B_ = spu_shuffle( b, bSwap, Aabc );

return cmp;
}

In both functions I return the comparison mask because I think it's necessary to implement step 2 of AA-sort's in-core algorithm; the modified combsort:

static const float INV_SHRINK_FACTOR = 1.f/1.3f;
uint32_t gap = uint32_t(float(edgeCount) * INV_SHRINK_FACTOR);
while (gap > 1)
{
// Straight comparisons
const uint32_t maxSwapIndex = edgeCount - gap;
for (uint32_t i = 0; i < maxSwapIndex; ++i)
{
vector_cmpswap( (vec_uint4*)&work[i], (vec_uint4*)&work[i+gap] );
}

// Skewed comparisons for when i+gap exceeds N/4
for (uint32_t i = edgeCount - gap; i < edgeCount; ++i)
{
vector_cmpswap_skew( (vec_uint4*)&work[i], (vec_uint4*)&work[i+gap - edgeCount] );
}

gap = uint32_t((float)gap * INV_SHRINK_FACTOR);
}
vec_uint4 not_sorted;
do
{
not_sorted = spu_splats((uint32_t)0);
for (uint32_t i = 0; i < edgeCount - 1; ++i)
{
not_sorted = spu_or( not_sorted, vector_cmpswap( (vec_uint4*)&work[i], (vec_uint4*)&work[i+1] ) );
}
not_sorted = spu_or( not_sorted, vector_cmpswap_skew( (vec_uint4*)&work[edgeCount - 1], (vec_uint4*)&work[0] ) );
not_sorted = spu_orx( not_sorted );
} while ( spu_extract(not_sorted, 0) );

I use a shrink factor of 1.3 as suggested by the authors as well the combsort page at Wikipedia. Standard combsort stops looping once gap is less-or-equal to 1 and no swaps have been performed in the current iteration. AA-sort splits the gap and "swap" loops so I use the not_sorted value to keep track of when swaps stop occurring (this is the reason that I return the cmp value from each of the swap() functions).

You end up with nicely sorted (albeit transposed) data like:

[0] = 00000003 03910392 04a504a6 05140515
[1] = 0000002b 03910393 04a604a7 05140516
[2] = 0000002b 03910393 04a7049f 05150514
[3] = 0000002c 03910394 04a704a8 05150516
[4] = 0000002c 03920393 04a704a8 05150516
[5] = 0000002d 03930394 04a804a9 05160515
[6] = 0003002b 03b903ba 04a9049f 05170516
[7] = 00200021 03b903bb 04a904aa 05170518
[8] = 00200022 03b903bb 04a904aa 05180517
[9] = 00210022 03b903bc 04aa04ab 05190518
...

AA-sort normally calls for a final pass that transposes the elements across the width of each vector rather than through each vector element but it wasn't necessary in my particular situation.

I took some performance statistics with the decrementer using a few of my data sets and attached the results below:


Time (ms)# Values
0.0366671254
0.008571171
0.017419684
0.1316541902
0.019900786
0.009424360
0.109599954
0.3475312415
0.3717792454


Pretty decent considering the original O(N^2) implementation was taking 3 ms for 2k+ values.

2009/02/02

ATI Open-sourcing Radeon Details

I have a lot of respect and admiration for the lengths that ATI goes to to release information to the public.  Last year they made the Radeon 6xx series ISA available.  In the last couple of weeks they've released details on 3D registers in the 6xx/7xx series, and the (albeit primitive but not simple) r600_demo (via git from freedesktop.org).  The availability of this kind of information is a huge boon to hobbyists, researchers, developers, and pretty much everyone else that cares.

I've given all the above resources a look through and I hope to spend a bit more time pouring over the r600_demo (not for the faint of heart or API-dependent).  Thus far I've only had access to documents related to NVidia hardware from the previous generation (under NDA).  The differences are seemingly non-trivial but I'm not certain if it's a result of genealogical advances or manufacturer architectural differences.

2009/01/27

Insomniac Tech Page

The Insomniac Tech (R&D) Page is a great source for PS3/performance/game-programming related information.  Even if you happen to not like their games or subscribe to their approach to SPU programming (by choice or otherwise), they've got a lot of resources related to different aspects of engine and game development.  And since they are a game studio all the information is very "practical".  I applaud their openness and willing to share information online, at GDC, etc.

Their SPU Instruction Cheat Sheet has a permanent home on my desktop.

2009/01/26

Code Generation in Synthesis

While looking for resources and ideas regarding self-modifying code I came across this Ph.D. dissertation from the early 90's "Synthesis: An Efficient Implimentation of Fundamental Operating System Services".  It actually has some pretty cool stuff in it: self-modifying code for context switches and optimization, lock-free synchronization primitives, and fine-grained (compared to normal CPU quantum) scheduling for real-time processing.  The lock-free implementation seems particularly forward-thinking given the many-core world we live in today.

2008/09/10

CSAA

I was kind of planning on working through NeHe's tutorials but after blazing through the first 10 I realized I hadn't really forgotten that much and actually had a pretty good idea of what I was doing. Still, it was a nostalgic trip.

I decided that I really needed anti-aliasing. And not just some generic MSAA, I needed CSAA ("Coverage Sample Anti-Aliasing") an extension available on NVidia's GeForce 8 cards. NVidia has a page with details (also GL_NV_framebuffer_multisample_coverage), but basically they claim that it "rivals the quality of 8x or 16x MSAA" while having overhead comparable to 4x MSAA. The screenshots at the bottom of their write-up seem realistic basic on my test application so I won't bother (re-)posting them here, but it looks pretty nice.

They also mention that by "decoupling coverage from color/z/stencil" it's extremely efficient in terms of bandwidth and storage costs. They provide some information about the number of samples taken (at 8x CSAA; 4 color/z/stencil and 8 coverage samples), but there isn't a lot of details about storage requirements. I think I've figured out how to measure GPU computational overhead with reasonable accuracy (GL_EXT_timer_query) so I can partially check on their claims of how it compares to 4x MSAA, but I'd like some way to get a better idea about CPU<->GPU bandwidth and VRAM usage. NVidia's PerfHUD may be the answer to my prayers. I'll take a look as soon as I get the chance.

The storage requirements in particular worry me because the storage cost of 4x or even 2x multi-sampling at resolutions around HD is already quite high (despite not looking all that great), and you run the risk of having to move things to main memory and performance suffering.

Back to OpenGL...

Being moved to PS3 stuff at work has really revitalized my interest in OpenGL and programming in general. I figured it was high time to bring myself up to date regarding OpenGL after so many brief flings with DirectX since it was now more relevant (for me, at least). All of this prompted me to brush off this blog since I seem to be able to concentrate my efforts on something a lot better if I write about it.

I had never used OpenGL extensions before, but I found some good info about them at opengl.org. I lifted the isExtensionSupported() function to help me check for capabilities supported by the graphics driver/card and made a short macro to help me get function pointers with wglGetProcAddress(). It seems clunky but with a bit of wrapping and macro/template magic I should be able to get something decent going. I found out about GLEW from NVidia's OpenGL SDK, but I've decided to handle extensions myself for the moment since it will force me to look into them a bit more closely and I figure I won't be using that many anyway. We'll see how my patience holds up.

2006/10/05

Subtleties of Console Development

I've been trying to write my own sprite management code as a sort of exercise in C++ as well as the DS. I was having trouble getting things displayed correctly and thrashed back and forth changing one thing and then another, writing directly to memory rather than using DMA, etc. etc. etc. trying to figure out what was wrong. I'm a pretty careless programmer, so I assumed that it was either my mistake or the result of developing on an only partially documented system. Not having figured out a good way to do debugging obviously doesn't help either.

And, I finally figured it out. It was in code I wasn't even looking at; apparently I was trying to do something the system hadn't been initialized to do.

You can pass a number of flags to videoSetMode() when you are initializing the display, and apparently DISPLAY_SPR_1D_BMP is necessary in order to display 16-bit, non-palettized sprites. It seems obvious given that the ATTR0_BMP flag must be set in the sprite's OAM entry, but I didn't even know I should be looking for it.

Further, apparently you can't mix ATTR0_BMP and non-BMP sprites. At least, I can't get both to display at the same time. Perhaps I need another flag....