Clarify commentary in sha512-sparcv9.pl.
This commit is contained in:
parent
5f0477f47b
commit
79fe664f19
@ -17,7 +17,7 @@
|
||||
# Performance is >75% better than 64-bit code generated by Sun C and
|
||||
# over 2x than 32-bit code. X[16] resides on stack, but access to it
|
||||
# is scheduled for L2 latency and staged through 32 least significant
|
||||
# bits of %l0-%l7. The latter is done to achieve 32-/64-bit bit ABI
|
||||
# bits of %l0-%l7. The latter is done to achieve 32-/64-bit ABI
|
||||
# duality. Nevetheless it's ~40% faster than SHA256, which is pretty
|
||||
# good [optimal coefficient is 50%].
|
||||
#
|
||||
@ -25,14 +25,22 @@
|
||||
#
|
||||
# It's not any faster than 64-bit code generated by Sun C 5.8. This is
|
||||
# because 64-bit code generator has the advantage of using 64-bit
|
||||
# loads to access X[16], which I consciously traded for 32-/64-bit ABI
|
||||
# duality [as per above]. But it surpasses 32-bit Sun C generated code
|
||||
# by 60%, not to mention that it doesn't suffer from severe decay when
|
||||
# running 4 times physical cores threads and that it leaves gcc [3.4]
|
||||
# behind by over 4x factor! If compared to SHA256, single thread
|
||||
# loads(*) to access X[16], which I consciously traded for 32-/64-bit
|
||||
# ABI duality [as per above]. But it surpasses 32-bit Sun C generated
|
||||
# code by 60%, not to mention that it doesn't suffer from severe decay
|
||||
# when running 4 times physical cores threads and that it leaves gcc
|
||||
# [3.4] behind by over 4x factor! If compared to SHA256, single thread
|
||||
# performance is only 10% better, but overall throughput for maximum
|
||||
# amount of threads for given CPU exceeds corresponding one of SHA256
|
||||
# by 30% [again, optimal coefficient is 50%].
|
||||
#
|
||||
# (*) Unlike pre-T1 UltraSPARC loads on T1 are executed strictly
|
||||
# in-order, i.e. load instruction has to complete prior next
|
||||
# instruction in given thread is executed, even if the latter is
|
||||
# not dependent on load result! This means that on T1 two 32-bit
|
||||
# loads are always slower than one 64-bit load. Once again this
|
||||
# is unlike pre-T1 UltraSPARC, where, if scheduled appropriately,
|
||||
# 2x32-bit loads can be as fast as 1x64-bit ones.
|
||||
|
||||
$bits=32;
|
||||
for (@ARGV) { $bits=64 if (/\-m64/ || /\-xarch\=v9/); }
|
||||
|
Loading…
x
Reference in New Issue
Block a user