OpenSSL is compiled with NO_RSA, no RSA operations can be used: including
key generation storage and display of RSA keys. Since these operations are
not covered by the RSA patent (my understanding is it only covers encrypt,
decrypt, sign and verify) they can be included: this is an often requested
feature, attempts to use the patented operations return an error code.
This is enabled by setting RSA_NULL. This means that if a particular application
has its own legal US RSA implementation then it can use that instead by setting
it as the default RSA method.
Still experimental and needs some fiddling of the other libraries so they have
some options that don't attempt to use RSA if it isn't allowed.
I've chosen to nest two functions in order to save about 4K. As a result
s1-win32.asm doesn't look right (nested PROC/ENDP SEGMENT/ENDS) and it's
probably impossible to compile. I assume I have to reconsider... But not
today...
"Clean-up" stands for the fact that it's using common message digest
template ../md32_common.h and sha[1_]dgst.c are reduced down to
'#define SHA_[01]' and then '#include "sha_locl.h"'. It stands "(LP64)"
there because it's 64 bit platforms which benefit most from the tune-up.
The updated code exhibits 40% performance improvement on IRIX64
(sounds too good, huh? I probably should double check if it's not
some cache trashing that was holding it back before), 28% - on
Alpha Linux and 12% - Solaris 7/64.
Modify obj_dat.pl to take its files from the command line. Usage is now
perl obj_dat.pl objects.h obj_dat.h
this should avoid redirection shell escape problems under Win32.
Prototypes and constant declarations for non-copying reads and writes for
BIO pairs (which is totally untested as of now, so I don't yet commit
the actual source code, but reserve the numbers to avoid conflicts).
the remainder left in %edx. Here is the resulting performance improvement
matrix (improvement as a result of this *and* previous tune-up committed
two days ago). The results were obtained by profiling the "div" part of
the crypto/bn/bnspeed.c.
CPU BN_div bn_div_words overall comment
------------------------------------------------------------------------
PII +16% accumulated by +2-3% PII multiplies damn fast! Taking
inlining multiplication out of the loop
didn't make too much difference.
Eliminating of the multiplication
involved in remainder calculation
is the major factor.
Pentium +45% accumulated by +7-9% mull isn't that fast and replacing
inlining multiplications with additions in
the loop has more visible effect:-)
MIPS +75% +12% +20-25% In addition to the taking mults
R10000 out of the loop (giving 12% in the
asm/mips3.s) three mults were
eliminated in BN_div.
Alpha +30% +50% +10-15% Same as above. But remember that
EV4 bn_div_words is a C implementation.
It takes 4 Alpha mults in C to do
the same thing as 1 MIPS mult in
assembler does. So the effect (50%)
is more impressive. But not the
overall one... Well, if Alpha
bn_mul_add would be implemented
in assembler overall improvement
would be closer to MIPS...