tcatm's 4-way SSE2 for Linux 32/64-bit is in 0.3.10

35 messages BitcoinTalk Satoshi Nakamoto, sgtstein, aceat64, tcatm, NewLibertyStandard, Jeff Garzik, lfm, gridecon, Vasiliev, Ground Loop, teknohog, nelisky, vess, ArtForz August 15, 2010 — August 28, 2010

Satoshi Nakamoto August 15, 2010 Source · Permalink

0.3.10 has tcatm’s 4-way SSE2 as an option switch.

Use the switch “-4way” to turn it on. Without the switch you get Crypto++ ASM SHA-256.

I could only get this working with Linux.

Download: Get 0.3.10 from topic 827

Please report back your CPU and results! I think it’s pretty clear that Core 2 and lower are slower, i5 faster. I don’t think we’ve heard any i7 results yet. We need to know about the different models of AMD or other less common CPUs.

Satoshi Nakamoto August 15, 2010 Source · Permalink

0.3.10 has tcatm’s 4-way SSE2 as an option switch.

Use the switch “-4way” to turn it on. Without the switch you get Crypto++ ASM SHA-256.

I could only get this working with Linux.

Download: Get 0.3.10 from topic 827

Satoshi Nakamoto August 15, 2010 Source · Permalink

I hope someone can test an i5 or AMD to check that I built it right. I don’t have either to test with.

I’m also curious if it performs much worse on 32-bit linux vs 64-bit.

sgtstein August 15, 2010 Source · Permalink

Where is the code for this? I’m on a CentOS 5.5 box and need to build it myself. Once I do that I will report back with linux 32-bit and 1MB cache Xeon.

Satoshi Nakamoto August 15, 2010 Source · Permalink

I just uploaded a quick build so testers can check if I built it right. (I don’t have an i5 or AMD) If it checks out, I’ll put together the full package and do all the release stuff.

aceat64 August 16, 2010 Source · Permalink

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2

tcatm August 16, 2010 Source · Permalink

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it’ll improve performance by ~9%.

NewLibertyStandard August 16, 2010 Source · Permalink

Quote from: aceat64 on August 16, 2010, 12:37:54 AM

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2

You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

aceat64 August 16, 2010 Source · Permalink

Quote from: NewLibertyStandard on August 16, 2010, 01:49:01 AM

Quote from: aceat64 on August 16, 2010, 12:37:54 AM

I created a wiki page so we can keep track of the results: http://www.bitcoin.org/wiki/doku.php?id=4-way_sse2

You might want to add columns for whether hyper-threading is enabled, number of physical cores and how many cores Bitcoin is using. Without 4way, I get very slightly better results when I have half of my virtual cores hashing. With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

I’ve updated the page with your suggestions, I’ve also added footnotes to explain some of the fields.

Jeff Garzik (jgarzik) August 16, 2010 Source · Permalink

My -4way results: slower for two older boxes, faster for newer one.

(“model name” comes from Linux’s /proc/cpuinfo, which reports directly from CPU)

model name : Intel(R) Pentium(R) D CPU 3.00GHz

total cores: 2 without -4way: 0.999 Mhash/sec with -4way: 0.850 Mhash/sec

model name : Dual Core AMD Opteron(tm) Processor 280

total cores: 4 without -4way: 4.6 Mhash/sec with -4way: 4.0 Mhash/sec

model name : Genuine Intel(R) CPU 000 @ 3.20GHz

total cores: 4 without -4way: 5.7 Mhash/sec with -4way: 7.0 Mhash/sec

Satoshi Nakamoto August 16, 2010 Source · Permalink

Quote from: tcatm on August 16, 2010, 12:43:39 AM

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it’ll improve performance by ~9%.

GCC 4.3.3 doesn’t support -march=amdfamk10. I get: sha256.cpp:1: error: bad value (amdfamk10) for -march= switch

Quote from: NewLibertyStandard on August 16, 2010, 01:49:01 AM

With 4way, I get significantly better performance when I have all my virtual cores enabled. I think I get about the same amount of hashes when hyper threading is turned off with or without 4way.

Hey, you may be onto something!

hyperthreading didn’t help before because all the work was in the arithmetic and logic units, which the hyperthreads share.

tcatm’s SSE2 code must be a mix of normal x86 instructions and SSE2 instructions, so while one is doing x86 code, the other can do SSE2.

How much of an improvement do you get with hyperthreading?

Some numbers? What CPU is that?

lfm August 16, 2010 Source · Permalink

model name : AMD Phenom(tm) II X4 940 Processor at 3.0 ghz linux 64

with -4way “hashespersec” : 11132770

without “hashespersec” : 5877668

gridecon August 16, 2010 Source · Permalink

I have two quadcore Phenom II 64-bit linux machines (ubuntu 9.10 both) and the -4way option increases my hashing speed so much I’m suspicious. I get about 5-6khash/sec on these boxes previously and without -4way option. With -4way I get over 11khash/sec! In other words, the -4way switch almost DOUBLES the reported hashing speed. This level of improvement seems more than expected and makes me wonder if my boxes are really doing the hashing that much faster or if there could possible be an issue where the math operations are actually being skipped over for some reason, causing illusory speed and an inability to actually generate blocks?

Vasiliev August 16, 2010 Source · Permalink

Quote from: satoshi on August 16, 2010, 02:57:57 AM

Quote from: tcatm on August 16, 2010, 12:43:39 AM

I propose to compile sha256.cpp with -O3 -march=amdfamk10 (will work on 32bit and 64bit) as only CPUs supporting this instruction set (AMD Phenom, Intel i5 and newer) benefit from -4way and it’ll improve performance by ~9%.

GCC 4.3.3 doesn’t support -march=amdfamk10. I get: sha256.cpp:1: error: bad value (amdfamk10) for -march= switch

try -march=amdfam10

Satoshi Nakamoto August 16, 2010 Source · Permalink

Quote from: Vasiliev on August 16, 2010, 03:17:07 AM

try -march=amdfam10

That works.

That’s strange… are we sure that’s the same thing? tcatm, try amdfam10 and make sure you get the same speed measurement.

lfm August 16, 2010 Source · Permalink

model name : Intel(R) Core(TM)2 Quad CPU Q9450 @ 2.66GHz, linux 64

no difference at about 4950 khash/s

Jeff Garzik (jgarzik) August 16, 2010 Source · Permalink

Update for Code:cpu family : 6 model : 26 model name : Genuine Intel(R) CPU 000 @ 3.20GHz stepping : 4 Machine has 4 cores, each with 2 hyperthreads. /proc/cpuinfo shows 8 virtual processors.

without -4way, setgen 4: 5.7 Mhash/sec without -4way, setgen 8: 5.0 Mhash/sec

with -4way, setgen 4: 7.0 Mhash/sec with -4way, setgen 8: 9.3 Mhash/sec

So, the old wisdom of “hyperthreading slows things down” is now shattered, on this machine.

Ground Loop August 16, 2010 Source · Permalink

No winners for 4way in my other three Intel machines either:

Intel(R) Core(TM)2 Duo CPU E8500 @ 3.16GHz (64-bit Linux) 4way: 1565 std: 3002

Intel(R) Xeon(TM) CPU 3.00GHz (32-bit Linux) 4way: 1243 std: 2048

Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz 4way: 932 std: 1733

(All running 0.3.10, -1 proclimit) Experiments with proclimit weren’t any better.

Satoshi Nakamoto August 16, 2010 Source · Permalink

Quote from: jgarzik on August 16, 2010, 03:35:28 AM

Code:cpu family : 6

model : 26 model name : Genuine Intel(R) CPU 000 @ 3.20GHz stepping : 4cpu family 6 model 26 stepping 4 is an Intel Core i7. That’s a 23% speedup with -4way, 63% total speedup with -4way + hyperthreading. 33% faster with hyperthreading than without it.

tcatm August 16, 2010 Source · Permalink

@satoshi: Oops, I meant -march=amdfam10. Sorry.

@everyone confused about improvement on Phenoms: I developed the code on a Phenom (940) and verified it (at least in 64bit mode) and the improvement you see is real.

Concerning Hyperthreading: It seems to give a little performance gain, maybe from running load/store instructions in parallel with aritmethic instructions. There’s only a tiny bit of plain x86 instructions for glueing the function into the ABI. They take less than ~2% of the total CPU time (measured with gprof).

teknohog August 16, 2010 Source · Permalink

On a Core 2 Duo T7200, the default code gives about 1.8 Mhash/s, and 4way is slower at 1.0 Mhash/s. It has 4 MB of L2 cache, so it is probably not a question of cache size, as suggested at some point.

Unfortunately, the code (from svn) no longer compiles on ARM, as it now has SSE intrinsics hardcoded. I have removed the -msse2 and -DFOURWAYSSE2 flags from the makefile, and it still produces errors like this

Code:sha256.cpp:8:23: error: xmmintrin.h: No such file or directory sha256.cpp:34: error: �__m128i� does not name a type

but hopefully this is easy to fix.

Satoshi Nakamoto August 16, 2010 Source · Permalink

I wrapped sha256.cpp in #ifdef FOURWAYSSE2 #endif // FOURWAYSSE2

try it now.

Ground Loop August 18, 2010 Source · Permalink

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?

nelisky August 18, 2010 Source · Permalink

Quote from: Ground Loop on August 18, 2010, 11:00:08 PM

So is it accurate to say that, so far, only Intel Core i7 processors and certain (Phenom?) AMD processors enjoy a speed bump from -4way?

And i5, at least on my macbookpro

Ground Loop August 18, 2010 Source · Permalink

Any non-Mac i5 love? Windows i5 64-bit got slower here. [correction — not true. Windows doesn’t have -4way, and the Linux machines are Xeons.]

vess August 19, 2010 Source · Permalink

My Core i5 laptop (Ubuntu) doubled in speed. Actually, it didn’t double in speed. It stayed the same speed, but only uses half the CPU now. I can’t get it to go back to full CPU usage. That said, my laptop is a lot cooler when generating blocks now. I’ll post back if I see it successfully go up to 100% usage.

Satoshi Nakamoto August 19, 2010 Source · Permalink

Quote from: Ground Loop on August 18, 2010, 11:14:26 PM

Any non-Mac i5 love?

Windows i5 64-bit got slower here. That’s the first I’ve heard anyone say i5 was slower. Everyone else has said 4way was faster on i5. Moreso with hyperthreading enabled.

Quote from: nelisky on August 18, 2010, 11:02:25 PM

And i5, at least on my macbookpro

Good, so I take it that’s a confirmation that it’s working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac. I don’t think makefile.osx on SVN has it yet, just the built version.

nelisky August 19, 2010 Source · Permalink

Quote from: satoshi on August 19, 2010, 07:07:43 PM

Quote from: nelisky on August 18, 2010, 11:02:25 PM

And i5, at least on my macbookpro

Good, so I take it that’s a confirmation that it’s working on Mac as well?

Laszlo told me he did compile in the -4way stuff on Mac, so the -4way switch is also available to try on Mac. I don’t think makefile.osx on SVN has it yet, just the built version.

Yep, it’s working all right. The number I had posted were from an old svn revision patched with tcatm’s changes, but today I compiled trunk and while I had to once again tweak the makefile, after I did it works great with the numbers matching what I experienced before.

Changes I did for my system are below, and while some are cosmetic, like removing wx-config from making bitcoind, just to avoid the warnings if you don’t have it installed, others are system specific, like the DEPS dir, and the fact I don’t have 32bit libs which makes the link step fail if -arch i386 is there. The bsddb changes are, I believe, a typo. Includes and Libs point to db46, but then the object list for the linker states db48. Anyway, here’s the diff for what got me going:

Code:Index: makefile.osx

--- makefile.osx (revision 139) +++ makefile.osx (working copy) @@ -6,29 +6,29 @@

Laszlo Hanyecz (solar@heliacal.net)

CXX=llvm-g++ -DEPSDIR=/Users/macosuser/bitcoin/deps +DEPSDIR=/opt/local

INCLUDEPATHS= \

-I”$(DEPSDIR)/include”

-I”$(DEPSDIR)/include” -I”$(DEPSDIR)/include/db46”

LIBPATHS= \

-L”$(DEPSDIR)/lib”

-L”$(DEPSDIR)/lib” -L”$(DEPSDIR)/lib/db46”

-WXLIBS=$(shell $(DEPSDIR)/bin/wx-config —libs —static) +WXLIBS=

LIBS= -dead_strip \

$(DEPSDIR)/lib/libdb_cxx-4.8.a \
$(DEPSDIR)/lib/libboost_system.a \
$(DEPSDIR)/lib/libboost_filesystem.a \
$(DEPSDIR)/lib/libboost_program_options.a \
$(DEPSDIR)/lib/libboost_thread.a \

$(DEPSDIR)/lib/db46/libdb_cxx-4.6.a \
$(DEPSDIR)/lib/libboost_system-mt.a \
$(DEPSDIR)/lib/libboost_filesystem-mt.a \
$(DEPSDIR)/lib/libboost_program_options-mt.a \
$(DEPSDIR)/lib/libboost_thread-mt.a
$(DEPSDIR)/lib/libcrypto.a

-DEFS=$(shell $(DEPSDIR)/bin/wx-config —cxxflags) -D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0 +DEFS=-D__WXMAC_OSX__ -DNOPCH -DMSG_NOSIGNAL=0 -DFOURWAYSSE2

DEBUGFLAGS=-g -DwxDEBUG_LEVEL=0

ppc doesn’t work because we don’t support big-endian

-CFLAGS=-mmacosx-version-min=10.5 -arch i386 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS) +CFLAGS=-mmacosx-version-min=10.5 -arch x86_64 -O3 -Wno-invalid-offsetof -Wformat $(DEBUGFLAGS) $(DEFS) $(INCLUDEPATHS) HEADERS=headers.h strlcpy.h serialize.h uint256.h util.h key.h bignum.h base58.h
script.h db.h net.h irc.h main.h rpc.h uibase.h ui.h noui.h init.h

@@ -42,6 +42,7 @@ obj/rpc.o
obj/init.o
cryptopp/obj/sha.o \

obj/sha256.o
cryptopp/obj/cpu.o

@@ -55,7 +56,7 @@ $(CXX) -c $(CFLAGS) -O3 -DCRYPTOPP_DISABLE_ASM -o $@ $<

bitcoin: $(OBJS) obj/ui.o obj/uibase.o

$(CXX) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(WXLIBS) $(LIBS)

$(CXX) $(shell $(DEPSDIR)/bin/wx-config —cxxflags) $(CFLAGS) -o $@ $(LIBPATHS) $^ $(shell $(DEPSDIR)/bin/wx-config —libs —static) $(LIBS)

obj/nogui/%.o: %.cpp $(HEADERS)

ArtForz August 21, 2010 Source · Permalink

The difference between new and older CPUs is pretty easy to explain. Older microarchitectures have 64-bit mmx/sse execution units and split 128bit sse ops into 2 64bit microops. Newer archs have 128bit sse units.

AMD K8: 2 64bit units
intel Core/Core2: 3 64bit units
AMD K10: 2 128bit units
intel nehalem: 3 128bit units K10 = Opterons with 4 or more cores, Phenom, Phenom II, Athlon II nehalem = xeon 34xx/35xx/36xx/55xx/56xx/65xx/75xx, i3/i5/i7

Satoshi Nakamoto August 22, 2010 Source · Permalink

Thanks for clearing that up. I read the link someone posted about AMD making that change around 2007, but I didn’t know what the story was for Intel.

There’s no hope for Core/Core2 then. They only have half the SSE2 hardware.

Strange that Intel has 3 128bit units, but AMD with 2 128bit units is the faster one.

Ground Loop August 23, 2010 Source · Permalink

Intel Atom 230 @ 1.60GHz. Linux 32-bit. (Acer Aspire Revo)

Stock: 438 khash/sec (1 proc gives 354) 4way: 254 khash/sec

So you can take this one off the powerhouse list. Smiley Grin

sgtstein August 24, 2010 Source · Permalink

Anybody catch the new AMD Bulldozer press release? If I understand correctly, it should be capable of processing 8 64bit hashes, per core, at the same time. Would be quite a speed boost using this same code design.

Slashdot has the article. PC Perspective has the details.

Was also covered by AnandTech back in November, 2009.

Satoshi Nakamoto August 24, 2010 Source · Permalink

Quote from: ArtForz on August 21, 2010, 04:56:31 PM

AMD K10: 2 128bit units

intel nehalem: 3 128bit units This probably explains why hyperthreading increases performance with -4way. If three SSE2 units is excessive, then hyperthreading would help keep them all busy.

tcatm August 28, 2010 Source · Permalink

I just reviewed the sourcecode as I had a few ideas to optimize it further and I noticed that 4way is partly broken:

from main.cpp: Code: for (int j = 0; j < NPAR; j++) {
if (thash[7][j] == 0) {
for (int i = 0; i < sizeof(hash)/4; i++) ((unsigned int*)&hash)[i] = thash[i][j]; pblock->nNonce = ByteReverse(tmp.block.nNonce + j); }
}

The code will only process one hash (the last with thash[7] == 0) out of 32 hashes even when there is more than one hash that might be a correct one.

Somethine like this should fix it but it won’t be safe at higher difficulties. Also, I’m not sure whether the byte order should be reversed or not. Could someone review this? Code: unsigned int min_hash = ~1; for (int j = 0; j < NPAR; j++) {
if (thash[7][j] == 0) {
if(thash[6][j] < min_hash) { min_hash = thash[6][j]; for (int i = 0; i < sizeof(hash)/4; i++) ((unsigned int*)&hash)[i] = thash[i][j]; pblock->nNonce = ByteReverse(tmp.block.nNonce + j); }
}
}

Satoshi Nakamoto August 28, 2010 Source · Permalink

The simplification is intentional. There will only be more than one thash[7]=0 in one out of 134,217,728 cases. It only makes it 0.0000007% slower.