2016/03/05

The Rising Tide of Alt Languages

We're in some kind of renaissance of programming languages. When I was first starting out I'd used a few languages (Basic, Pascal, C, a porpourri of assembly dialects), but the lack of Internet prevented me from knowing about all the miscellaneous toy "research" languages and I was only vaguely aware of Fortran/Cobol and so on. In Uni I was exposed to Java, VHDL, Scheme, Python, and a few more assembly dialects.

But upon graduating, for years and years I regarded myself as a snooty, elitist c++ programmer. Besides occasional forays into Python or (eventually) c# for doing "scripting" or maybe making tools, I accepted that all real work was done with c++. Other languages were relegated to fooling around, teaching children, or "web programming"- and a select few (Perl) seemed to be specifically designed to torment people. After years of in-house engines, the horrors of UE3/CryEngine, etc. it took a few prototypes with Unity3D to make me see the errors of my ways. My university professors were correct; "software engineering" was about a lot more than banging out code and sweet macro shenanigans.

More recently, JavaScript- which in my mind is still closely associated 90's-era "mouseover" events- came back on the radar on account of the Node.js juggernaut. Only this time it wasn't just for throw away web clients, it was for servers too. And, while I've still yet to look into it, I am aware that it is a thing. And a big one at that.

As part of Microsoft's increasingly virile affair with open source and multi-platform love-ins they released Visual Studio Code. After a bit of rummaging around, I was shocked to find out it too was made in JavaScript with a framework called Electron. Poking around their website uncovered a number of interesting platforms for developing apps/games like pixate, ionic, fireball, and superpowers. All of which I think of as the "next-generation" app/game development tools- similar to MaxPlay (which specifically mentions Electron in a Software Engineer job posting).

I've been infatuated with F# for a while now. It had this sudden-outbreak-of-common-sense freshness about it: options, statically typed with powerful type inference, interoperability with c#, asynchronous computation expressions (monads), and other cornerstones of the functional world. But maybe I'd just grown tired of OO programming; it was getting a bit old and fat and FP was new hotness. Or, the old is new again, rather. Used it for a pair of backend prototypes but nothing in production. Somehow I can't justify subjecting other people to the OCaml/ML universe just because it amuses me.

For a while we were evaluating golang for constructing our new backend. Partially swept up by it's apparently burgeoning popularity in China, we were specifically motivated by the experiences of Qihoo et al. I was also intrigued by the kind of performance numbers they were claiming (TODO: add disclaimer regarding benchmarks and contrived tests and real-world performance blah blah blah). I was pleased with their sensible approach to errors and how easy it was to get a working program, but more than a little disappointed with their insistence on flip-flopping between (IMHO) overly opinionated design decisions and embracing horrific mistakes like void* and the null.

I've been spending some of my free time toying around with Rust. I've probably started reading The Book or futzed with Rust By Example 2 or 3 times now. But, I get busy or distracted and put it aside for a month or two such that I never really get it and am still sufficiently noob that I spend most of my time fighting with the compiler. Maybe it's too strict... or maybe I need to stop trying to write c++ with it.

While we're on the topic of c++ "killers", let's not write out Java just yet. Despite my expectations that Oracle would succeed in smothering it, it's instead staged a comeback and clawed itself to the top of the Tiobe index. It's also being used (along with Akka) for the backend being developed in parallel by one of our other teams.

As a final note, it's difficult to bring up f#, golang, and rust without giving a shoutout to the numerous other languages that are often mentioned in the same breath: julia, kotlin, scala, nimrod, and numerous others. There really isn't enough time in a day to investigate them all to a satisfactory level.

Exciting stuff.

2016/03/02

C# Wrapper for AllJoyn via SWIG

I was original interested in AllJoyn because it looked like a good way to build local multiplayer mobile games. At one point a few years back they even provided a Unity3D library. I got a simple "hello world" app working on PC/iOS/Android, but then got busy with other things and the project was eventually cancelled.

I've been thinking about it again and looking at ways of getting it into .Net applications. Unfortunately they no longer provide the Unity library which would have given me a good start.

Ran into some issues with their Linux build instructions. apt-get didn't want to install ia32-libs since I already had ncurses installed. Next, the src zip directory structure didn't quite match what's in the documentation (I guess they want you to move directories around). But you can just run scons from the root of the decompressed source code. I got a compile error about not being able to find sys/capability.h. apt-get had given up ia32-libs and didn't install libcap-dev. Once I got that everything seemed to work.

Basically, I ended up doing:

sudo apt-get install build-essential libgtk2.0-dev libssl-dev xsltproc libxml2-dev libcap-dev python scons libssl-dev
scons BINDINGS=cpp WS=off BT=off ICE=off SERVICES="about"

You can verify everything works by running the sample app. On a 64-bit system scons seems to build for x86_64 by default.

In one shell:

  1. export LD_LIBRARY_PATH=`pwd`/build/linux/x86_64/debug/dist/cpp/lib:$LD_LIBRARY_PATH
  2. build/linux/x86_64/debug/dist/cpp/bin/samples/basic_service
And in another:
  1. export LD_LIBRARY_PATH=`pwd`/build/linux/x86_64/debug/dist/cpp/lib:$LD_LIBRARY_PATH
  2. build/linux/x86_64/debug/dist/cpp/bin/samples/basic_client
Now I've got the native AllJoyn library, but I still need to get it working with .Net. SWIG is my first thought to do this. PInvoke Interop Assistant is potentially another option, but it looks like it hasn't been developed for some time. Following the SWIG tutorial I created the interface file alljoyn.swig specifying header files included by samples/basic/basic_client.cc:
%module alljoyn
%{
/* Includes the header in the wrapper code */
#include "alljoyn_core/inc/alljoyn/AllJoynStd.h"
#include "alljoyn_core/inc/alljoyn/BusAttachment.h"
#include "build/linux/x86_64/debug/dist/cpp/inc/alljoyn/Status.h"
#include "alljoyn_core/inc/alljoyn/Init.h"
#include "alljoyn_core/inc/alljoyn/version.h"

// Needed for alljoyn_wrap.cxx to compile
using namespace qcc;
using namespace ajn;
%}

/* Parse the header file to generate wrappers */
%include "alljoyn_core/inc/alljoyn/AllJoynStd.h"
%include "alljoyn_core/inc/alljoyn/BusAttachment.h"
%include "build/linux/x86_64/debug/dist/cpp/inc/alljoyn/Status.h"
%include "alljoyn_core/inc/alljoyn/Init.h"
%include "alljoyn_core/inc/alljoyn/version.h"
Then build everything:
mkdir alljoyn_cs
cd alljoyn_cs
# Create alljoyn.swig here

# Generate SWIG wrapper code
swig -c++ -csharp -cpperraswarn alljoyn.swig

# Build shared lib
g++ -fPIC -c -DQCC_OS_GROUP_POSIX -I../build/linux/x86_64/debug/dist/cpp/inc -I../alljoyn_core/inc -I../common/inc alljoyn_wrap.cxx
g++ -shared -L../build/linux/x86_64/debug/dist/cpp/lib -o liballjoyn.so alljoyn_wrap.o -lajrouter -Wl,-Bstatic -lalljoyn -Wl,-Bdynamic -lssl -lcrypto

# Update so mono can find liballjoyn.so
export LD_LIBRARY_PATH=`pwd`:$LD_LIBRARY_PATH
I included all the swig generated csharp source files into a MonoDevelop project, and you can now call simple AllJoyn functions from C#:
if (alljoyn.AllJoynInit () != QStatus.ER_OK)
 return;

Console.WriteLine ("AllJoyn Library version: {0}", alljoyn.GetVersion ());
Console.WriteLine ("AllJoyn Library build info: {0}", alljoyn.GetBuildInfo ());

var status = QStatus.ER_OK;

var msgBus = new BusAttachment ("myApp", true);
if (msgBus == null)
 status = QStatus.ER_OUT_OF_MEMORY;
That's a good start, but I've run into a problem. As expected, swig doesn't always generate "natural" c# code. BusAttachment.CreateInterface() has the following signature:
public QStatus CreateInterface(string name, SWIGTYPE_p_p_InterfaceDescription iface)
Where SWIGTYPE_p_p_InterfaceDescription is some pointer wrapper for the managed/unmanaged magic that swig generated for us rather than something like "out InterfaceDescription". It looks like swig provides typemaps to solve these sorts of problems, but in any case I'm not quite done yet. Misc other docs that were helpful:

2016/02/03

Flashbacks...

I completely forgot this even existed.

I just rummaged through drafts I'd started over half a decade ago. Most only contained a few notes on things I was working on back when "blogging" was still a thing.

Fascinating.

2009/05/10

SPU Demo Scene?

I've always found programming under extreme constraints and limitations to be fascinating; embedded systems development, old-school console programming, assembly language, whatever. For this reason I also found demo-scene projects to be rather interesting. Not so much for the artistic portions but more for the technological challenges. In some way they've inspired me to want to try and motivate people with the same or similar goals, this time using the SPUs. The SPUs relative simplicity and resurrection of low-level programming techniques revitalizes the purer, skill-based challenge that seemed to exist before the advent of massive libraries and hulking tools.

But most of all, it's just for the fun of it.

I've been trying to come up with a loose series of "rules" to base this all on with the primary theme being: you have the SPE and 256KB for code/data and nothing else. Without looking at other competitions, I would imagine something along the lines of:

  • Each contest is specified as N-way; where N is the number of SPUs you are allowed to use

  • LS(s) can be initialized with whatever you like, but once an SPU is started NO communication (except for debugging/status feedback purposes) between the PPU and the SPUs (including mailboxes, etc.) or DMA to/from the SPEs and main/non-LS memory is allowed except SPU-to-SPU transfers via memory-mapped LS and other allowed exceptions

  • A framebuffer for visual output is also provided (my inclination is to make this write-only to keep it from being used as working memory)

  • Demos, in general, must run at 60 fps



There are probably a lot more things to consider, loop holes to fill, etc. But it captures the essence of my interest in both time and space limitations.

2009/05/09

Flashbacks to Cache Programming

I recently went back through two classic resources programming considerations for processor caches: "The Elements of Cache Programming Style" by Chris Spears, and Christer Ericson's "Memory Optimization" talk from GDC 2003. This was partially for my own amusement but also to see if there were any fun tid-bits of information that I had forgotten about. Both can be found rather easily via Google.

Both have dated examples using processors from yesteryear, but the core concepts are still relevant today as little has changed in the overall design of the memory hierarchy. In fact, they are perhaps even more relevant today since the widening gap between processor performance and memory latency they they spoke of as looming on the horizon is now upon us. While this has been partially addressed by adding an L3 cache, the many-threaded architecture of modern GPUs, to a lesser extent by SMT, and essentially side-stepped by architectures like the Cell's SPEs, the dreaded cache-miss bogeyman is still lurking under every bed and in every closet.

Both discussed watching structure member alignment and packing to reign in bloat and its impact on cache performance, something that I think people are largely already doing to reduce memory usage. Reorganizing them based on member access ordering and splitting into "hot" and "cold" areas, perhaps less so. The one thing in particular I had forgotten about was Ericson's suggestion of accessing all member variables through accessors and then instrumenting them looking for frequency and correlation statistics to determine "hotness" and order relationships, respectively. This could perhaps be done mostly automatically with gcc's -finstrument-functions or the like.

Another interesting (obvious?) suggestion was using a custom slab allocator for dynamic data structures so the various link elements all end up in the same page. Traversing linked-lists and other data structures with indirection have obvious cache miss penalties, but the suggestion is specifically meant to potentially reduce the number of cache misses a little and definitely reduce the number of TLB misses. Further dynamic memory should be freed as soon as possible and re-allocated first-fit in order to try and reuse cachelines (there are other, more global considerations such as fragmentation here but for the sake of argument we're only talking about cache implications). I'm interested to try and instrument our current project to see if this is much of an issue for us.

Finally, using gcc's __attribute__((section("name"))), linking order, and other approaches to try and reduce instruction cache misses by arranging functions in a cache friendly manner. Although I've read this is largely automated by Link Time Code Generation (Whole Program Optimization), it's probably relevant for gcc or whenever you need fine-grained control.

As an aside, Agner Fog (who has some fantastic, interesting resources at http://www.agner.org/optimize/) advocates "reusing the stack". Namely, the stack is "free"d and then it and it's allocated cache lines are reused by consecutive functions. I need a bit more time to think about how this can be used more concretely.

I plan to wrap up this journey glancing over a PowerPC 750 related post from IBM. Again, rather old, but I'm hoping it gives me some ideas for my notes own notes on the 970.

2009/02/13

AA-sort

A co-worker sent me a link to AA-sort some time ago, but I didn't have any reason to use it until just recently. I found myself in a situation where I have 2K~4K 4-byte elements packet into qwords and my initial O(N^2) solution just wasn't working out. So, I decided to try the O(N log N) "in-core" component of AA-sort as a pre-sort.

The paper linked above is relatively straight-forward, but they only provide pseudo-code so there's still a bit of effort required to get something usable. As per the author's goals, it's easily written as highly-vectorized, non-branching code and was rather fun to get working.

The authors were working with 8 16-bit values packed into a qword, and I had to make some simple changes to get it working with vec_uint4s.

First, my bitonic_sort() routine takes care of step one of their in-core algorithm by sorting the 4 uint32_t elements in a qword:

inline vec_uint4 bitonic_sort( vec_uint4 Vec_ )
{
static const vec_uchar16 BADC = {
0x04, 0x05, 0x06, 0x07, 0x00, 0x01, 0x02, 0x03,
0x0c, 0x0d, 0x0e, 0x0f, 0x08, 0x09, 0x0a, 0x0b
};
static const vec_uchar16 CDAB = {
0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07
};
static const vec_uint4 mask1 = {
0x00000000, 0xffffffff, 0xffffffff, 0x00000000
};
static const vec_uint4 mask2 = {
0x00000000, 0x00000000, 0xffffffff, 0xffffffff
};
static const vec_uint4 mask3 = {
0x00000000, 0xffffffff, 0x00000000, 0xffffffff
};
vec_uint4 temp, cmp, result;

temp = spu_shuffle( Vec_, Vec_, BADC ); // Compare A to B and C to D
cmp = spu_cmpgt( Vec_, temp );
cmp = spu_xor( cmp, mask1 ); // Order A/B increasing, C/D decreasing
result = spu_sel( Vec_, temp, cmp );

temp = spu_shuffle( result, result, CDAB ); // Compare AB to CD
cmp2 = spu_cmpgt( result, temp );
cmp2 = spu_xor( cmp2, mask2 ); // Order AB/CD increasing
result = spu_sel( result, temp, cmp2 );
temp = spu_shuffle( result, result, BADC ); // Compare previous A to B and C to D
cmp = spu_cmpgt( result, temp );
cmp = spu_xor( cmp, mask3 ); // Order A/B and C/D increasing
result = spu_sel( result, temp, cmp );

return result;
}

Next come cmpswap() and cmpswap_skew() from the paper:

inline vec_uint4 vector_cmpswap( vec_uint4* A_, vec_uint4* B_ )
{
const vec_uint4 a = *A_;
const vec_uint4 b = *B_;
const vec_uint4 cmp = spu_cmpgt( a, b );
*A_ = spu_sel( a, b, cmp );
*B_ = spu_sel( b, a, cmp );

return cmp;
}

inline vec_uint4 vector_cmpswap_skew( vec_uint4* A_, vec_uint4* B_ )
{
const vec_uint4 a = *A_;
const vec_uint4 b = *B_;

// The last word compared is 0xffff so that part of the mask should always be 0000
static const vec_uchar16 BCDx = {
0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0a, 0x0b,
0x0c, 0x0d, 0x0e, 0x0f, 0xc0, 0xc0, 0xc0, 0xc0
};
const vec_uint4 bShift = spu_shuffle( b, b, BCDx );
const vec_uint4 cmp = spu_cmpgt( a, bShift );
const vec_uint4 swap = spu_sel( a, bShift, cmp );
const vec_uint4 bSwap = spu_sel( bShift, a, cmp );

static const vec_uchar16 abcD = {
0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
0x18, 0x19, 0x1a, 0x1b, 0x0c, 0x0d, 0x0e, 0x0f
};
static const vec_uchar16 Aabc = {
0x00, 0x01, 0x02, 0x03, 0x10, 0x11, 0x12, 0x13,
0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b
};
*A_ = spu_shuffle( a, swap, abcD );
*B_ = spu_shuffle( b, bSwap, Aabc );

return cmp;
}

In both functions I return the comparison mask because I think it's necessary to implement step 2 of AA-sort's in-core algorithm; the modified combsort:

static const float INV_SHRINK_FACTOR = 1.f/1.3f;
uint32_t gap = uint32_t(float(edgeCount) * INV_SHRINK_FACTOR);
while (gap > 1)
{
// Straight comparisons
const uint32_t maxSwapIndex = edgeCount - gap;
for (uint32_t i = 0; i < maxSwapIndex; ++i)
{
vector_cmpswap( (vec_uint4*)&work[i], (vec_uint4*)&work[i+gap] );
}

// Skewed comparisons for when i+gap exceeds N/4
for (uint32_t i = edgeCount - gap; i < edgeCount; ++i)
{
vector_cmpswap_skew( (vec_uint4*)&work[i], (vec_uint4*)&work[i+gap - edgeCount] );
}

gap = uint32_t((float)gap * INV_SHRINK_FACTOR);
}
vec_uint4 not_sorted;
do
{
not_sorted = spu_splats((uint32_t)0);
for (uint32_t i = 0; i < edgeCount - 1; ++i)
{
not_sorted = spu_or( not_sorted, vector_cmpswap( (vec_uint4*)&work[i], (vec_uint4*)&work[i+1] ) );
}
not_sorted = spu_or( not_sorted, vector_cmpswap_skew( (vec_uint4*)&work[edgeCount - 1], (vec_uint4*)&work[0] ) );
not_sorted = spu_orx( not_sorted );
} while ( spu_extract(not_sorted, 0) );

I use a shrink factor of 1.3 as suggested by the authors as well the combsort page at Wikipedia. Standard combsort stops looping once gap is less-or-equal to 1 and no swaps have been performed in the current iteration. AA-sort splits the gap and "swap" loops so I use the not_sorted value to keep track of when swaps stop occurring (this is the reason that I return the cmp value from each of the swap() functions).

You end up with nicely sorted (albeit transposed) data like:

[0] = 00000003 03910392 04a504a6 05140515
[1] = 0000002b 03910393 04a604a7 05140516
[2] = 0000002b 03910393 04a7049f 05150514
[3] = 0000002c 03910394 04a704a8 05150516
[4] = 0000002c 03920393 04a704a8 05150516
[5] = 0000002d 03930394 04a804a9 05160515
[6] = 0003002b 03b903ba 04a9049f 05170516
[7] = 00200021 03b903bb 04a904aa 05170518
[8] = 00200022 03b903bb 04a904aa 05180517
[9] = 00210022 03b903bc 04aa04ab 05190518
...

AA-sort normally calls for a final pass that transposes the elements across the width of each vector rather than through each vector element but it wasn't necessary in my particular situation.

I took some performance statistics with the decrementer using a few of my data sets and attached the results below:


Time (ms)# Values
0.0366671254
0.008571171
0.017419684
0.1316541902
0.019900786
0.009424360
0.109599954
0.3475312415
0.3717792454


Pretty decent considering the original O(N^2) implementation was taking 3 ms for 2k+ values.

2009/02/02

ATI Open-sourcing Radeon Details

I have a lot of respect and admiration for the lengths that ATI goes to to release information to the public.  Last year they made the Radeon 6xx series ISA available.  In the last couple of weeks they've released details on 3D registers in the 6xx/7xx series, and the (albeit primitive but not simple) r600_demo (via git from freedesktop.org).  The availability of this kind of information is a huge boon to hobbyists, researchers, developers, and pretty much everyone else that cares.

I've given all the above resources a look through and I hope to spend a bit more time pouring over the r600_demo (not for the faint of heart or API-dependent).  Thus far I've only had access to documents related to NVidia hardware from the previous generation (under NDA).  The differences are seemingly non-trivial but I'm not certain if it's a result of genealogical advances or manufacturer architectural differences.