A little trick which can be used in programming, to reduce the CPU load, if the value of a Hypotenuse is being Compared.

A scenario which often happens in computing is, that there exists a quantity, call that (a), which will result accurately by squaring the quantities (x), (y) and (z) first, and then computing the square root of the sum. It could then also be said, that the following explicit function has been defined:

F(x, y, z) := sqrt(x^2 + y^2 + z^2)

Further, the idea exists in Computing, that when all one wants to compute, is (x2) for example, it takes fewer CPU cycles actually to compute (x*x), than it takes, to compute a real power function.

But, the object of the exercise could actually be, not to derive (a) from (x), (y) and (z), but rather, to compare two instances of F(x, y, z).

The biggest issue as such, with actually computing F(x, y, z), is, that to compute the square root, is even slower, than it was to compute (x2), (y2) and (z2). Therefore, if one has the luxury of knowing what (a) is in advance, what one can do, for real-number comparisons, is just to square (a), and then, not to compute the square root, which should exist within the function F(). Therefore, when two known quantities are simply being compared, the following way to do it, will run slightly faster:

a^2 < (x^2 + y^2 + z^2)

In Modern Computing, what is often done is, that actual CPU usage is ignored, to make the task of writing complex code easier, and, the situation may not always be recognizable, that two values are going to be compared, which would both have been computed as the square root of one other value. And so, to avoid having to stare at some code cross-eyed, the practice can be just as valid, to compute two instances of F(x, y, z), but, to compute them with the square root function in each case, and somewhere later in the code execution, just to compare the two resulting values.




The Original RSA Trapdoor Function

I once ran into laypeople, who were able to understand what a modulus was – so that the output of a series of computations would never equal or exceed that modulus – and who were able to understand what the exponent function is, but who were incredulous, when I told them that it was possible to compute the result of raising a 2048-bit number, to a 2048-bit exponent, on the basis that the result only needs to fit inside a 2048-bit modulus.

I believe that the most common way in which this is done, is based on the assumption that two 2048-bit numbers can be multiplied, and the result brought back down to a 2048-bit modulus. This notion can be extended, to mean that a 2048-bit number can also be squared, and the result written in the 2048-bit modulus…

Well to achieve the exponent-function, one needs a base-register, an accumulator-register, and the exponent. The accumulator is initialized to the value (1).

If the exponent has 2048 bits, then the operation can be repeated 2048 times:

  1. Square the value in the accumulator.
  2. Left-Shift the Most-Significant Bit of the Exponent out, into a bit-register that can be examined.
  3. If that bit-register is equal to (1) and not (0), multiply the base-register into the accumulator an extra time.

Because the Most-Significant Bit of the Exponent was shifted out first, its being (1) would mean that the value in the Accumulator was multiplied by the Base earlier, so that this exponent of the Base will also be squared, as a factor of the Accumulator, by the highest number of iterations, thus effectively raising the Base to the power of 1, 2, 4, 8, 16, 32, … 2^2047 , in combinations.

(Update 09/10/2018, 7h45 : )

I am well aware of the fact, that when Step 1 above is executed for the first time, it has no effect. In fact, depending on how many most-significant zeroes the exponent has, this could even repeat itself. One way in which this ‘waste’ of CPU cycles can be removed, is by changing Step 1 above to read:

  1. If the value in the accumulator is Greater Than 1, square it.

But the problems with such a supposed ‘optimization’ when performed with a 2048-bit exponent would be, that the value in the accumulator needs to be compared with (1), 2048 times, and as long as the accumulator also has 2048 bits, depending on how it’s being stored, that each comparison needs to go through the entire accumulator. If we could assume that only a few most-significant bits in the exponent are in fact zeroes, such an optimization might actually slow the loop down.

(Updated 09/15/2018, 12h50 : )

IF the reader needs to implement this type of exponentiation based on an arbitrary-precision integer library, and on the assumption that the exponent may have an arbitrary length – i.e. be 17 bits long as easily as 2048 bits long, then the first variable to take into account would be, whether the arbitrary-precision integer library is applying the approach, to make the least-significant bits the first word linked to, such that a linked list would lead to the progressively more-significant bits indirectly. This could be seen as kind of an equivalent, to Little-Endian representation, except that it would exist with linked lists, instead of with contiguous bytes, that would form a fixed-length field.

If this is the case, the organization of an integer as a linked list is not optimal, for exponentiation. In such a case, primitive operations on the linked list would be helpful, that break it down into a least-significant word first, followed by the more-significant words. This operation would be Mathematically equivalent, to dividing by ( 2^(Word-Size) ), and additionally forming the remainder. However, to make this operation speedy, it is best implemented as a C or a C++ subroutine, even if the arbitrary-precision integer library exposes purely Mathematical operations, as an API.

If instead the application programmer is only able to access purely Mathematical operations from such a library, then another approach would be, first to count how many bits long the exponent is, let’s say by halving it repeatedly until we get zero. And while doing so, it’s also possible to construct a number, the bits of which are simply the bits of the exponent reversed.

If the library exposes functions that determine the quotient and remainder, of dividing by ( 2^(Word-Size) ), then the application programmer can do so first, and devise a subroutine, which only elects to exponentiate a base, with ( E < 2^(Word-Size) ). In that case, exponentiation by an arbitrary-length exponent, can be composited out of numerous operations to perform this smaller exponentiation ‘by chunks’. And again, the exponentiation would start with the most-significant chunk.

If we can’t do this, then an arbitrary-precision integer library can also test the least-significant bit repeatedly, as it halves the resulting, derived integer. But there would be a considerable performance penalty in doing that, if the API operation to divide a 2048-bit number by 2, has not also been optimized, to arrive solely at a remainder, in less than Asymptotic Time.