Handbook of Applied Cryptography
Alfred J. Menezes
Paul C. van Oorschot
Scott A. Vanstone
Since RSA public key encryption is dependent on arithmetic, such as multiplication and exponentiation modulo, this needs to be fast. And the book describes methods such as Barrett reduction for acceleration of modular exponentiation, and other fast methods.
But to get to first place, one needs:
There are ways to speed up very long multiplications, such as Karatsuba multiplication or fast fourier convolution methods, but 512 bits are a little short for these excellent methods. So what I did, was to optimize use of registers, minimizing memory accesses, keeping the processor busy with multiplications without wasting time on getting and storing results.
A way of doing this is described in my
thesis,
paragraph 7.3 "Register use"
The idea is to change the order of the smaller parts of the big
multiplications. The smaller multiplications and additions.
By reordering them, one can maximize the access of intermediate results
by minimizing memory accesses, instead storing as much as possible in
registers.
After a while the phones started using cryptographic chips for this,
and they still do that in 2016 even though the processors are more
than fast enough. The chips have developed into digital fortresses
protecting phones even against the FBI. Especially Apple have done this
thorougly in the iPhone. Perhaps NSA have a backdoor in the crypto
chips.
But ARM processors are used now more than ever, especielly in small
devices on the net such as Raspberry Pi, and all of them would benefit
from faster crypto, and I could make that. But everyone use slower
open source code there, and making fast crypto again is a lot of work.