4 hashes parallel on SSE2 CPUs for 0.3.6
This patch will calculate four hashes on one core using vector instructions. There’s a test programm included that validates the new hash function against the old one so it should be correct.
The patch is against 0.3.6. Improves khash/s by roughly 115%.
Performance of stock code (as measured by my test/benchmark program) is about 1500khash/s. My code does 3500khash/s. Both figures are for one core. It scales well because I do 128 hashes at once and keep the datastructures small enough to fit in the CPU cache.
I have two local collision attacks which will squeeze another 300khash/s out, but they are not stable yet.
Tell me if it works Smiley Donations are welcome. 17asVKkzRGTFvvGH9dMGQaHe78xzfvgSSA
That’s amazing…
So are you saying you use 128-bit registers to SIMD four 32-bit data at once? I’ve wondered about that for a long time, but I didn’t think it would be possible due to addition carrying into the neighbour’s value.
Quote from: satoshi on July 31, 2010, 12:29:20 AM
That’s amazing…
So are you saying you use 128-bit registers to SIMD four 32-bit data at once? I’ve wondered about that for a long time, but I didn’t think it would be possible due to addition carrying into the neighbour’s value.
That’s how it works. Four 32 bit values in a 128 bit vector. They’re calculated independently, but at the same time.
Btw. Why are you using this alignup<16> function when attribute ((aligned (16))) will tell the compiler to align at compiletime?
Quote from: knightmb on August 02, 2010, 08:47:04 AM
Is it a AMD only optimization perhaps?
Or a 64-bit only optimization.
With the patch above, I was unable to build the test program. You?
The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
Is there a way we can confirm that the variables are being aligned properly? I’m wondering if the Intel procs are less tolerant of misalignment than the AMD’s.
Quote from: Ground Loop on August 02, 2010, 09:17:07 AM
With the patch above, I was unable to build the test program. You?
Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise “make test” would fail. Under x86_64 I had no such issue.
Quote from: petree on August 02, 2010, 09:22:29 AM
The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?
Quote from: impossible7 on August 02, 2010, 09:00:55 AM
Quote from: knightmb on August 02, 2010, 08:47:04 AM
Is it a AMD only optimization perhaps?
Or a 64-bit only optimization.
I’m trying on a Q6600 running 64bit linux (ubuntu server) and it makes things slower there, so not 64bit only. And I’m running on my mac laptop which sports an Intel i5 (also 64 bit OSX 10.6), which great speed improvement there, so not AMD only.
Quote from: impossible7 on August 02, 2010, 09:31:44 AM
Quote from: Ground Loop on August 02, 2010, 09:17:07 AM
With the patch above, I was unable to build the test program. You?
Under x86 I had to include cryptopp/obj/cpu.o in the list of object files, otherwise “make test” would fail. Under x86_64 I had no such issue.
Quote from: petree on August 02, 2010, 09:22:29 AM
The original patch posted is working just fine for me (Opteron 2376), and did double my performance over the stock 0.3.6 client. I was even able to port its minor changes to 0.3.7 successfully, with the same results.
As I said above I did notice an imporvement in performace too, but I am not sure the patched version works correctly. Have you been able to generate any blocks with the patched version?
Yes, since applying this patch I’ve generated 2 blocks.
Is it 2x fast on AMD and 1/2 fast on Intel?
Quote from: tcatm on July 31, 2010, 10:12:38 AM
Btw. Why are you using this alignup<16> function when attribute ((aligned (16))) will tell the compiler to align at compiletime?
Tried that, but it doesn’t work for things on the stack. I ran some tests.
It doesn’t even cause an error, it just doesn’t align it.
Oh, that’s a part in the code my patch doesn’t touch. You could try to remove line 2741 (CRITICAL_BLOCK(cs_main)).
CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.
Thanks! That got probably mixed up when I patched the file with an older diff. I fixed the git: http://github.com/tcatm/bitcoin-cruncher
Quote from: impossible7 on August 06, 2010, 11:37:20 AM
CRITICAL_BLOCK is a macro that contains a for loop. The assertion failure indicates that break has been called inside the body of the loop. The only break statement in this block is in line 2762. In the original source file, there is no break statement in this critical block. I think you must remove lines 2759-2762. The is nothing like that in the original main.cpp.
Sorry about that. CRITICAL_BLOCK isn’t perfect. You have to be careful not to break or continue out of it. There’s an assert that catches and warns about break. I can be criticized for using it, but the syntax would be so much more bloated and error prone without it.
Is there a chance the SSE2 code is slow on Intel because of some quirk that could be worked around? For instance, if something works but is slow if it’s not aligned, or thrashing the cache, or one type of instruction that’s really slow? I’m not sure how available it is, but I think Intel used to have a profiler for profiling on a per instruction level. I guess if tcatm doesn’t have a system with the slow processor to test with, there’s not much hope. But it would be really nice if this was working on most CPUs.
It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.
Can we implement a speed test, so different hashing engines are tried and the fastest is chosen?
Quote from: tcatm on August 08, 2010, 11:52:53 AM
It seems to be like this: everything before Core2 will be slower, everything starting with Core2 is faster. Can anyone test the code on an older AMD64? I know there was a change in the way SSE2 instructions are executed in recent architectures.
My Core2Quad (Q6600) slowed down 50%, my i5 improved ~200%, thus I don’t think what you state is accurate. Maybe starting at some specific Core2?
That big of a difference in speed, by a factor of 4 or 6, feels like it’s likely to be some quirky weak spot or instruction that the old chip is slow with. Unless it’s a touted feature of the i5 that they made SSE2 six times faster.
A quick summary:
Xeon Quad 41% slower
Core 2 Duo 55% slower
Core 2 Duo same (vess)
Core 2 Quad 50% slower
Core i5 200% faster (nelisky)
Core i5 100% faster (vess)
AMD Opteron 105% faster
aceat64:
My system went from ~7100 to ~4200.
This particular system has dual Intel Xeon Quad-Core CPUs (E5335) @ 2.00GHz.
impossible7:
on an Intel Core 2 Duo T7300 running x86_64 linux it was 55% slower compared to the stock version (r121)
nelisky:
My Core2Quad (Q6600) slowed down 50%,
my i5 improved ~200%,
impossible7:
on an AMD Opteron 2374 HE running x86_64 linux I got a 105% improvement (!)
- Does not work on 32-bit (though that’s not a problem with the algorithm).
- Patch is against older SVN. There’s a git repo at http://github.com/tcatm/bitcoin-cruncher
- Compiles on every 64bit Linux.
It’s not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I’m planning a pluggable bitcoinminer someday. But at current difficulty it’s easier to work for bitcoins than finding faster ways for mining.
I would really like to have this feature included in an official build sometimes soon along with an internal speed test to determine which algorithm to use. You can always remove the speed test later once you figure out how to determine whether it will be faster or slower without running the speed test.
Quote from: tcatm on August 13, 2010, 09:27:14 PM
- Does not work on 32-bit (though that’s not a problem with the algorithm).
- Patch is against older SVN. There’s a git repo at http://github.com/tcatm/bitcoin-cruncher
- Compiles on every 64bit Linux.
It’s not intended as a replacement for a standard client but for a dedicated bitcoinminer box. I’m planning a pluggable bitcoinminer someday. But at current difficulty it’s easier to work for bitcoins than finding faster ways for mining.
-
Do we know why it doesn’t work on 32bit? Is is it because it’s using 128bits and if so, would it help if we dropped it to 64?
-
Thank you, I will look into implementing it on my 64bit systems.
-
Excellent to hear. I’m looking forward to using it.
I was planning on using it on a PE2650 dual proc Xeon @3.2GHz w/HT. I would really like to get this figured out to utilize that system. I am planning one as well. At current difficulty I would agree, except when the system needs to be run anyway and latency isn’t an issue.
MinGW on Windows has trouble compiling it:
g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I”/boost” -I”/db/build_unix” -I”/openssl/include” -I”/wxwidgets/lib/gcc_lib/mswud” -I”/wxwidgets/include” -msse2 -O3 -o obj/sha256.o sha256.cpp
sha256.cpp: In function `long long int vector Ch(long long int vector, long long int vector, long long int vector)’: sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454 Please submit a full bug report, with preprocessed source if appropriate. See URL:http://www.mingw.org/bugs.shtml for instructions. make: *** [obj/sha256.o] Error 1
Quote from: sgtstein on August 13, 2010, 11:17:51 PM
- Do we know why it doesn’t work on 32bit? Is is it because it’s using 128bits and if so, would it help if we dropped it to 64?
No idea, maybe some alignment problem. Someone was trying to figure it out on IRC. I don’t have a SSE2 capable 32bit system. The additional registers in 64bit mode are also useful. I don’t know if your PE2650 has a recent enough CPU. You might experience a performance drop of 50% if the CPU is too old.
Btw, did anyone with Intel CPU compare performance with Hyperthreading enabled/disabled? The SSE2 loop keeps the arithmetic units and pipelines pretty busy and I can imagine Hyperthreading might decrease performance.
Quote from: satoshi on August 14, 2010, 12:49:18 AM
MinGW on Windows has trouble compiling it:
g++ -c -mthreads -O2 -w -Wno-invalid-offsetof -Wformat -g -D__WXDEBUG__ -DWIN32 -D__WXMSW__ -D_WINDOWS -DNOPCH -I”/boost” -I”/db/build_unix” -I”/openssl/include” -I”/wxwidgets/lib/gcc_lib/mswud” -I”/wxwidgets/include” -msse2 -O3 -o obj/sha256.o sha256.cpp
sha256.cpp: In function `long long int vector Ch(long long int vector, long long int vector, long long int vector)’: sha256.cpp:31: internal compiler error: in perform_integral_promotions, at cp/typeck.c:1454 Please submit a full bug report, with preprocessed source if appropriate. See for instructions. make: *** [obj/sha256.o] Error 1
Looks like we’re triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
If you haven’t already, try aligning thash. It might matter. Couldn’t hurt.
Quote from: tcatm on August 14, 2010, 12:53:07 AM
Looks like we’re triggering a compiler bug in the tree optimizer. Can you try to compile it -O0?
No help from -O0, same error.
MinGW is GCC 3.4.5. Probably the problem.
I’ll see if I can get a newer version of MinGW.
Got the test working on 32-bit with MinGW GCC 4.5. Exactly 50% slower than stock with Core 2.
MinGW GCC 4.5.0: Crypto++ doesn’t work, X86_SHA256_HashBlocks() never returns I only got 4-way working with test.cpp but not when called by BitcoinMiner
MinGW GCC 4.4.1:
Crypto++ works
4-way SIGSEGV
GCC is definitely not aligning __m128i.
Even if we align our own __m128i variables, the compiler may decide to use a __m128i behind the scenes as a temporary variable.
By making our __m128i variables aligned and changing these inlines to defines, I was able to get it to work on 4.4.1 with -O0 only: #define Ch(b, c, d) ((b & c) ^ (~b & d)) #define Maj(b, c, d) ((b & c) ^ (b & d) ^ (c & d)) #define ROTR(x, n) (_mm_srli_epi32(x, n) | _mm_slli_epi32(x, 32 - n)) #define SHR(x, n) _mm_srli_epi32(x, n)
But that’s with -O0.
Well, reporting back.
I got it to compile by specifying -msse and -msse2 to gcc when compiling. I first was hashing about 692kh/s (50% of SVN r130[1400kh/s]) but recompiled and am now receiving about ~1120kh/s. This is currently the equivalent of using both of my CPUs without HyperThreading, though I can verify that it IS using HyperThreading. With HyperThreading turned off, I get ~1350kh/s. Pretty close to the stock build.
Also, does the git contain the patched and updated code?
Code:// SVN r130 Using HT. 08/14/10 19:02 hashmeter 4 CPUs 1392 khash/s 08/14/10 19:32 hashmeter 4 CPUs 1387 khash/s 08/14/10 20:02 hashmeter 4 CPUs 1386 khash/s 08/14/10 20:32 hashmeter 4 CPUs 1380 khash/s 08/14/10 21:02 hashmeter 4 CPUs 1363 khash/s // With -msse -msse2, first run. Using HT. 08/14/10 21:32 hashmeter 4 CPUs 692 khash/s 08/14/10 22:06 hashmeter 4 CPUs 1011 khash/s 08/14/10 22:11 hashmeter 4 CPUs 1104 khash/s 08/14/10 22:16 hashmeter 4 CPUs 1120 khash/s // NOT using HT. 08/14/10 22:21 hashmeter 2 CPUs 1359 khash/s 08/14/10 22:26 hashmeter 2 CPUs 1340 khash/s
Just wanted to tell my story and help with whatever information I could.
On both MinGW GCC 4.4.1 and 4.5.0 I have it working with test.cpp but SIGSEGV when called by BitcoinMiner. So now it doesn’t look like it’s the version of GCC, it’s something else, maybe just the luck of how the stack is aligned.
I have it working fine on GCC 4.3.3 on Ubuntu 32-bit.
I found the problem with Crypto++ on MinGW 4.5.0. Here’s the patch for that:
Code:--- \old\sha.cpp Mon Jul 26 13:31:11 2010
+++
ew\sha.cpp Sat Aug 14 20:21:08 2010
@@ -336,7 +336,7 @@
ROUND(14, 0, eax, ecx, edi, edx)
ROUND(15, 0, ecx, eax, edx, edi)
- ASL(1)
+ ASL(label1) // Bitcoin: fix for MinGW GCC 4.5
AS2(add WORD_REG(si), 4*16)
ROUND(0, 1, eax, ecx, edi, edx)
ROUND(1, 1, ecx, eax, edx, edi)
@@ -355,7 +355,7 @@
ROUND(14, 1, eax, ecx, edi, edx)
ROUND(15, 1, ecx, eax, edx, edi)
AS2( cmp WORD_REG(si), K_END)
- ASJ( jne, 1, b)
+ ASJ( jne, label1, ) // Bitcoin: fix for MinGW GCC 4.5
AS2( mov WORD_REG(dx), DATA_SAVE)
AS2( add WORD_REG(dx), 64)