@satoshi: Oops, I meant -march=amdfam10. Sorry.
@everyone confused about improvement on Phenoms: I developed the code on a Phenom (940) and verified it (at least in 64bit mode) and the improvement you see is real.
Concerning Hyperthreading: It seems to give a little performance gain, maybe from running load/store instructions in parallel with aritmethic instructions. There’s only a tiny bit of plain x86 instructions for glueing the function into the ABI. They take less than ~2% of the total CPU time (measured with gprof).