The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
Is there a way we can confirm that the variables are being aligned properly? I’m wondering if the Intel procs are less tolerant of misalignment than the AMD’s.