Rendered Obsolete: 2009/05

I've always found programming under extreme constraints and limitations to be fascinating; embedded systems development, old-school console programming, assembly language, whatever. For this reason I also found demo-scene projects to be rather interesting. Not so much for the artistic portions but more for the technological challenges. In some way they've inspired me to want to try and motivate people with the same or similar goals, this time using the SPUs. The SPUs relative simplicity and resurrection of low-level programming techniques revitalizes the purer, skill-based challenge that seemed to exist before the advent of massive libraries and hulking tools.

But most of all, it's just for the fun of it.

I've been trying to come up with a loose series of "rules" to base this all on with the primary theme being: you have the SPE and 256KB for code/data and nothing else. Without looking at other competitions, I would imagine something along the lines of:

Each contest is specified as N-way; where N is the number of SPUs you are allowed to use

LS(s) can be initialized with whatever you like, but once an SPU is started NO communication (except for debugging/status feedback purposes) between the PPU and the SPUs (including mailboxes, etc.) or DMA to/from the SPEs and main/non-LS memory is allowed except SPU-to-SPU transfers via memory-mapped LS and other allowed exceptions

A framebuffer for visual output is also provided (my inclination is to make this write-only to keep it from being used as working memory)

Demos, in general, must run at 60 fps

There are probably a lot more things to consider, loop holes to fill, etc. But it captures the essence of my interest in both time and space limitations.

I recently went back through two classic resources programming considerations for processor caches: "The Elements of Cache Programming Style" by Chris Spears, and Christer Ericson's "Memory Optimization" talk from GDC 2003. This was partially for my own amusement but also to see if there were any fun tid-bits of information that I had forgotten about. Both can be found rather easily via Google.

Both have dated examples using processors from yesteryear, but the core concepts are still relevant today as little has changed in the overall design of the memory hierarchy. In fact, they are perhaps even more relevant today since the widening gap between processor performance and memory latency they they spoke of as looming on the horizon is now upon us. While this has been partially addressed by adding an L3 cache, the many-threaded architecture of modern GPUs, to a lesser extent by SMT, and essentially side-stepped by architectures like the Cell's SPEs, the dreaded cache-miss bogeyman is still lurking under every bed and in every closet.

Both discussed watching structure member alignment and packing to reign in bloat and its impact on cache performance, something that I think people are largely already doing to reduce memory usage. Reorganizing them based on member access ordering and splitting into "hot" and "cold" areas, perhaps less so. The one thing in particular I had forgotten about was Ericson's suggestion of accessing all member variables through accessors and then instrumenting them looking for frequency and correlation statistics to determine "hotness" and order relationships, respectively. This could perhaps be done mostly automatically with gcc's -finstrument-functions or the like.

Another interesting (obvious?) suggestion was using a custom slab allocator for dynamic data structures so the various link elements all end up in the same page. Traversing linked-lists and other data structures with indirection have obvious cache miss penalties, but the suggestion is specifically meant to potentially reduce the number of cache misses a little and definitely reduce the number of TLB misses. Further dynamic memory should be freed as soon as possible and re-allocated first-fit in order to try and reuse cachelines (there are other, more global considerations such as fragmentation here but for the sake of argument we're only talking about cache implications). I'm interested to try and instrument our current project to see if this is much of an issue for us.

Finally, using gcc's __attribute__((section("name"))), linking order, and other approaches to try and reduce instruction cache misses by arranging functions in a cache friendly manner. Although I've read this is largely automated by Link Time Code Generation (Whole Program Optimization), it's probably relevant for gcc or whenever you need fine-grained control.

As an aside, Agner Fog (who has some fantastic, interesting resources at http://www.agner.org/optimize/) advocates "reusing the stack". Namely, the stack is "free"d and then it and it's allocated cache lines are reused by consecutive functions. I need a bit more time to think about how this can be used more concretely.

I plan to wrap up this journey glancing over a PowerPC 750 related post from IBM. Again, rather old, but I'm hoping it gives me some ideas for my notes own notes on the 970.

Rendered Obsolete

2009/05/10

SPU Demo Scene?

2009/05/09

Flashbacks to Cache Programming