





#### A 3.25 GHz Large-Integer Extended GCD Accelerator in 12 nm

#### <u>Kavya Sreedhar</u><sup>1</sup>, Gedeon Nyengele<sup>1</sup>, Mark Horowitz<sup>1</sup>, Christopher Torng<sup>2</sup>

#### <sup>1</sup>Stanford University, USA <sup>2</sup>University of Southern California, USA

skavya@stanford.edu





Many cryptography algorithms rely on fast extended GCD (XGCD)
 Recent applications are dominated by large-integer XGCD

 $\Box$  XGCD computes Bézout coefficients  $b_a$ ,  $b_b$  satisfying Bézout's Identity

$$b_a, b_b : b_a * a_0 + b_b * b_0 = gcd(a_0, b_0)$$

- □ We present the first chip for XGCD
  - Building from our carry-save-adder subtraction-based algorithm (CHES'22)

# Reducing inputs to find the GCD



- □ Algorithms use GCD-preserving transformations to find the GCD
- $\Box$  Iterations end when one of  $a_i, b_i$  is zero and the other is the GCD
- $\Box$  Every cycle, either  $a_i$  or  $b_i$  is updated
  - When  $a_i$  or  $b_i$  is even, shift by one, two, or three
  - When  $a_i$  and  $b_i$  are odd, compute  $(a_i + b_i)/4$  or  $(a_i b_i)/4$



XGCD:  $b_a * a_0 + b_b * b_0 = gcd(a_0, b_0)$ 

- $\Box$  Need to maintain two equations since either  $a_i$  or  $b_i$  will become zero
  - $u_i * a_0 + m_i * b_0 = a_i$
  - $y_i * a_0 + n_i * b_0 = b_i$
- $\Box$  Every cycle, either  $a_i, u_i, m_i$  or  $b_i, y_i, n_i$  is updated
- *u<sub>i</sub>*, *m<sub>i</sub>*, *y<sub>i</sub>*, *n<sub>i</sub>* are updated each cycle in a way that ensures these equations hold
   Divisibility of these variables may not match the divisibility of *a<sub>i</sub>*, *b<sub>i</sub>*



XGCD: 
$$b_a * a_0 + b_b * b_0 = gcd(a_0, b_0)$$

- $\Box$  At the end: one of  $a_i, b_i$  is zero and the other is the GCD
- $\Box \quad \mathsf{GCD:} \ \mathsf{gcd}(a_0, b_0) = a_i + b_i$
- □ Bézout coefficients:  $b_a = u_f + y_f$ ;  $b_b = m_f + n_f$





- Precomputes constant multiples of a<sub>0</sub>, b<sub>0</sub>
- Given four clock cycles to complete





а

b

.



Reduction  $a_f, b_f$  $a_{i+1}$ update  $a_i$  $\rightarrow a_i \rightarrow$  $u_f, y_f$  $a_0, \dots, 7a_0$  $m_f, n_f$ Pre $b_{i+1}$  $b_0, \dots, 7b_0$ processing update  $b_i$ CLK  $u_{i+1}$  $u_i \rightarrow update u_i$ Precomputes constant CLK multiples of  $a_0, b_0$ Given four clock update  $y_i$ cycles to complete CLK  $m_{i+1}$ update  $m_i$ Iteratively reduces variables  $\rightarrow n_i \rightarrow$  update  $n_i$ Requires hundreds of iterations CLK



 $\begin{array}{c|c} a \\ \hline b \\ \hline b \\ \hline processing \end{array} \begin{array}{c} a_0, \dots, 7a_0 \\ b_0, \dots, 7b_0 \end{array}$ 

- Precomputes constant multiples of  $a_0, b_0$
- Given four clock cycles to complete





- Computes final XGCD results
- Given four clock cycles to complete

Iteratively reduces variables Requires hundreds of iterations



- □ There are 23 possible updates for each of the 4 Bézout variables
- □ Most complicated Bézout update to maintain Bézout equations

 $(u_i \pm y_i \pm k b_0) \gg 2$ 



- □ There are 23 possible updates for each of the 4 Bézout variables
- □ Most complicated Bézout update to maintain Bézout equations

$$(u_i \pm y_i \pm kb_0) \gg 2$$
  
 $1 \quad 1 \quad 1$   
CS form constant

- □ To accelerate this algorithm, we
  - Compute intermediates in carry-save (CS) form: no carry-propagation
  - Use late selects and precompute control signals to hide control delay

#### Critical path







- 1. Cannot quickly compare large-integer values in CS form
- 2. Driving control signals is expensive
- 3. Shifting in CS form requires care



#### **1.** Cannot quickly compare large-integer values in CS form

- 2. Driving control signals is expensive
- 3. Shifting in CS form requires care





 $\Box$  When  $a_i, b_i$  are odd, want to compare  $a_i > b_i$ 





- $\Box$  When  $a_i, b_i$  are odd, want to compare  $a_i > b_i$
- $\Box$  We track the approximate size of  $a_i, b_i$
- **Compute**  $\delta \approx \log_2 a_i \log_2 b_i$ , building from prior work
- $\Box$  sign( $\delta$ ) approximates sign( $a_i b_i$ )



- $\Box$  When  $a_i$  or  $b_i$  is even, divide by a power of two up until eight
  - $\delta$  is correctly updated by the number of bits reduced



- $\Box$  When  $a_i$  or  $b_i$  is even, divide by a power of two up until eight
  - $\delta$  is correctly updated by the number of bits reduced
- □ When  $a_i$  and  $b_i$  are odd, compute  $(a_i + b_i)/4$  or  $(a_i b_i)/4$ 
  - $\delta$  is conservatively updated by one
  - However, there can be more cancellation: > 1 bit could have been reduced
- □ As a result,  $sign(\delta) \neq sign(a_i b_i)$  25% of the time



- □ Wrong variable can be updated, but algorithm still works
- □ Average cycle count increases by 15%
- Critical path delay is 4X shorter compared to using a slow full comparison
- **G** For 512-bit inputs, updating  $\delta$  requires a 10-bit carry-propagation adder



- 1. Cannot quickly compare large-integer values in CS form
- **2.** Driving control signals is expensive
- 3. Shifting in CS form requires care





















#### Using late selects



## Using late selects





## Precomputing control





## Precomputing control





# Precomputing control







- □ We observe that in the XGCD algorithm
  - Different variables must have the same divisibility
  - Some sum variable updates are the same
  - Some difference variable updates only differ by a sign





 $(1.0 = \text{area of baseline design with CSAs and } \delta)$ 



- 1. Cannot quickly compare large-integer values in CS form
- 2. Driving control signals is expensive
- **3.** Shifting in CS form requires care



#### □ The problem

- Our hardware has negative numbers  $\rightarrow$  must preserve sign when shifting
- Shifted result can be off by one when both *carry* and *sum* are odd



#### □ The problem

- Our hardware has negative numbers  $\rightarrow$  must preserve sign when shifting
- Shifted result can be off by one when both *carry* and *sum* are odd

#### The solution

- Build from prior work to specialize logic to minimize delay of signed logic shifting
- Use a half adder to insert shifted out carry
  - $\hfill\square$  Make detecting when to apply this correction cheap









#### Execution modes:

- Fast
- Constant time
- Debugging





Execution modes:

- Fast
- Constant time
- Debugging

|                           | 512-bit XGCD   | 255-bit XGCD |
|---------------------------|----------------|--------------|
| Technology                | GF 12nm FinFET |              |
| Voltage                   | 0.9V           |              |
| Input Bitwidth            | 512            | 255          |
| Max Clock Frequency (GHz) | 3.25           | 3.25         |

#### **Design performance is independent of bitwidth with CSAs**





Execution modes:

- Fast
- Constant time
- Debugging

|                               | 512-bit XGCD   | 255-bit XGCD |
|-------------------------------|----------------|--------------|
| Technology                    | GF 12nm FinFET |              |
| Voltage                       | 0.9V           |              |
| Input Bitwidth                | 512            | 255          |
| Max Clock Frequency (GHz)     | 3.25           | 3.25         |
| Total Area (mm <sup>2</sup> ) | 0.25           | 0.076        |

#### The 255-bit XGCD unit was optimized for constant-time applications





Execution modes:

- Fast
- Constant time
- Debugging

|                                          | 512-bit XGCD   | 255-bit XGCD |
|------------------------------------------|----------------|--------------|
| Technology                               | GF 12nm FinFET |              |
| Voltage                                  | 0.9V           |              |
| Input Bitwidth                           | 512            | 255          |
| Max Clock Frequency (GHz)                | 3.25           | 3.25         |
| Total Area (mm <sup>2</sup> )            | 0.25           | 0.076        |
| Average Execution Time (ns)              |                | 119          |
| <b>Constant-time Execution Time (ns)</b> |                | 119          |





Execution modes:

- Fast
- Constant time
- Debugging

|                                          | 512-bit XGCD   | 255-bit XGCD |
|------------------------------------------|----------------|--------------|
| Technology                               | GF 12nm FinFET |              |
| Voltage                                  | 0.9V           |              |
| Input Bitwidth                           | 512            | 255          |
| Max Clock Frequency (GHz)                | 3.25           | 3.25         |
| Total Area (mm <sup>2</sup> )            | 0.25           | 0.076        |
| Average Execution Time (ns)              | 176            | 119          |
| <b>Constant-time Execution Time (ns)</b> | 239            | 119          |



- Can easily scale performance of our design to the bitwidth in prior work
   All non-constant-time prior work used division algorithms
- □ 18X faster than prior 1024-bit HW simulations (APCCAS'20, ISCAS'21)
- □ 19X faster than a chip with a 4096-bit mod inv unit (ISSCC'23)
- □ Area comparisons with bitwidth and technology scaling
  - 8.5X to 14X smaller than APCCAS'20
  - 2X to 3.4X smaller than ISCAS'21

#### Constant-time XGCD





- □ JSSC'19, FPL'21 use division algos
- □ 255-bit constant-time XGCD speedups
  - 303X faster than JSSC'19
  - 344X faster than FPL'21
  - 23X faster than ePrint'20
- Our fast average execution is 1.4X
   faster than our constant-time XGCD

# Key contributions

- We present the first ASIC for XGCD, with a constant-time config
   Validates the advantage of subtraction-based XGCD algorithms in HW
- □ We address challenges with implementing CSAs in practice
  - Avoid large-integer comparisons for control
  - Use late selects and precompute control to minimize control delay
  - Contribute efficient circuitry for shifting in CS form
- Our chip is 23X faster than SW, 18X faster than HW sims, and 303X faster than prior chips for modular inversion









43

