# Factoring integers with ECM on the Kalray MPPA-256 processor

## Jérémie Detrey

CARAMBA team, LORIA INRIA Nancy — Grand Est Jeremie.Detrey@loria.fr

Joint work with:

Masahiro Ishii (NAIST, Nara, Japan)

Pierrick Gaudry (CARAMBA team, LORIA)

Atsuo Inomata (NAIST, Nara, Japan) Kazutoshi Fujikawa (NAIST, Nara, Japan)











#### **Context: Integer factorization**

- ► Central problem in public-key cryptography:
  - integer factorization is a (supposedly) difficult problem
  - e.g., basis for the security of the RSA public-key cryptosystem:
    - private key: large primes p and q
    - ▶ public key:  $N = p \cdot q$



#### **Context: Integer factorization**

- ► Central problem in public-key cryptography:
  - integer factorization is a (supposedly) difficult problem
  - e.g., basis for the security of the RSA public-key cryptosystem:
    - private key: large primes p and q
    - ▶ public key:  $N = p \cdot q$



- Efficient factorization is important for
  - establishing factorization records
  - recommending secure key lengths
  - breaking weak instances of RSA (short keys)

#### **Context: Integer factorization**

- ► Central problem in public-key cryptography:
  - integer factorization is a (supposedly) difficult problem
  - e.g., basis for the security of the RSA public-key cryptosystem:
    - private key: large primes p and q
    - ▶ public key:  $N = p \cdot q$



- ► Efficient factorization is important for
  - establishing factorization records
  - recommending secure key lengths
  - breaking weak instances of RSA (short keys)
- Current (publicly known) record:
  - factorization of the RSA-768 challenge (768 bits, or 232 digits)
  - ullet the computation took  $\sim$  2000 core-years [Kleinjung *et al.*, 2010]

 $\triangleright$  Find small- to medium-size prime factors p of an integer N:

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p)$

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$
  - ECM (Elliptic Curve Method) [Lenstra, 1987]:

$$\exp\left(\left(\sqrt{2} + o(1)\right)\sqrt{\log p \log \log p}\right)$$

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$
  - ECM (Elliptic Curve Method) [Lenstra, 1987]:

$$\exp\left(\left(\sqrt{2} + o(1)\right)\sqrt{\log p \log \log p}\right)$$

► Find all prime factors of an integer *N*:

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$
  - ECM (Elliptic Curve Method) [Lenstra, 1987]:

$$\exp\left(\left(\sqrt{2} + o(1)\right)\sqrt{\log p \log \log p}\right)$$

- Find all prime factors of an integer N:
  - (G)NFS (General Number Field Sieve) [Buhler et al., 1993]:

$$\exp\left(\left(\sqrt[3]{\frac{64}{9}}+o(1)\right)\left(\log N\right)^{1/3}\left(\log\log N\right)^{2/3}\right)$$

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$
  - ECM (Elliptic Curve Method) [Lenstra, 1987]:

$$\exp\left(\left(\sqrt{2} + o(1)\right)\sqrt{\log p \log \log p}\right)$$

- Find all prime factors of an integer N:
  - (G)NFS (General Number Field Sieve) [Buhler et al., 1993]:

$$\exp\left(\left(\sqrt[3]{\frac{64}{9}}+o(1)\right)\left(\log N\right)^{1/3}\left(\log\log N\right)^{2/3}\right)$$

▶ For factoring RSA moduli ( $\sim$  500 bits and above):

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$
  - ECM (Elliptic Curve Method) [Lenstra, 1987]:

$$\exp\left(\left(\sqrt{2} + o(1)\right)\sqrt{\log p \log \log p}\right)$$

- Find all prime factors of an integer N:
  - (G)NFS (General Number Field Sieve) [Buhler et al., 1993]:

$$\exp\left(\left(\sqrt[3]{\frac{64}{9}}+o(1)\right)\left(\log N\right)^{1/3}\left(\log\log N\right)^{2/3}\right)$$

- ▶ For factoring RSA moduli ( $\sim$  500 bits and above):
  - use NFS (best asymptotic complexity)

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$
  - ECM (Elliptic Curve Method) [Lenstra, 1987]:

$$\exp\left(\left(\sqrt{2} + o(1)\right)\sqrt{\log p \log \log p}\right)$$

- Find all prime factors of an integer N:
  - (G)NFS (General Number Field Sieve) [Buhler et al., 1993]:

$$\exp\left(\left(\sqrt[3]{\frac{64}{9}}+o(1)\right)\left(\log N\right)^{1/3}\left(\log\log N\right)^{2/3}\right)$$

- ▶ For factoring RSA moduli ( $\sim$  500 bits and above):
  - use NFS (best asymptotic complexity)
  - ullet requires factoring a huge quantity of smaller integers ( $\sim$  200 bits)

- $\triangleright$  Find small- to medium-size prime factors p of an integer N:
  - Trial division:  $\tilde{O}(p) = \tilde{O}(\exp(\log p))$
  - ECM (Elliptic Curve Method) [Lenstra, 1987]:

$$\exp\left(\left(\sqrt{2} + o(1)\right)\sqrt{\log p \log \log p}\right)$$

- ► Find all prime factors of an integer *N*:
  - (G)NFS (General Number Field Sieve) [Buhler et al., 1993]:

$$\exp\left(\left(\sqrt[3]{\frac{64}{9}}+o(1)\right)\left(\log N\right)^{1/3}\left(\log\log N\right)^{2/3}\right)$$

- ▶ For factoring RSA moduli ( $\sim$  500 bits and above):
  - use NFS (best asymptotic complexity)
  - ullet requires factoring a huge quantity of smaller integers ( $\sim$  200 bits)
    - $\rightarrow$  use ECM for those

#### Outline of the talk

- ECM in a nutshell
- ► The Kalray MPPA-256 processor
- ► Multiprecision modular arithmetic
- Results and conclusion

#### Outline of the talk

- ► ECM in a nutshell
- ► The Kalray MPPA-256 processor
- ► Multiprecision modular arithmetic
- Results and conclusion

▶ Let *K* be a field

- ▶ Let K be a field
- ► An elliptic curve *E* over *K* is a projective plane curve given by an equation of the form

$$E: y^2 = x^3 + Ax + B$$
,

with parameters  $A, B \in K$  so that E is smooth

- ▶ Let K be a field
- ► An elliptic curve *E* over *K* is a projective plane curve given by an equation of the form

$$E: y^2 = x^3 + Ax + B,$$

with parameters  $A, B \in K$  so that E is smooth

 $\blacktriangleright$  Its set of points on K is given as

$$E(K) = \{(x, y) \in K \times K \mid (x, y) \text{ satisfies } E\}$$

- ▶ Let *K* be a field
- ► An elliptic curve *E* over *K* is a projective plane curve given by an equation of the form

$$E: y^2 = x^3 + Ax + B,$$

with parameters  $A, B \in K$  so that E is smooth

▶ Its set of points on *K* is given as

$$E(K) = \{(x, y) \in K \times K \mid (x, y) \text{ satisfies } E\} \cup \{\mathcal{O}\},$$

where  $\mathcal{O}$  is called the point at infinity

- ▶ Let K be a field
- ► An elliptic curve *E* over *K* is a projective plane curve given by an equation of the form

$$E: y^2 = x^3 + Ax + B,$$

with parameters  $A, B \in K$  so that E is smooth

▶ Its set of points on *K* is given as

$$E(K) = \{(x, y) \in K \times K \mid (x, y) \text{ satisfies } E\} \cup \{\mathcal{O}\},$$

where  $\mathcal{O}$  is called the point at infinity

- $\blacktriangleright$  One can define an commutative addition law on E(K):
  - O is the neutral element
  - E(K) is therefore an abelian group

- ▶ Let *K* be a field
- ► An elliptic curve *E* over *K* is a projective plane curve given by an equation of the form

$$E: y^2 = x^3 + Ax + B,$$

with parameters  $A, B \in K$  so that E is smooth

▶ Its set of points on *K* is given as

$$E(K) = \{(x, y) \in K \times K \mid (x, y) \text{ satisfies } E\} \cup \{\mathcal{O}\},$$

where  $\mathcal{O}$  is called the point at infinity

- ▶ One can define an commutative addition law on E(K):
  - O is the neutral element
  - E(K) is therefore an abelian group
  - if K is finite, then so is E(K)





























#### **Back to integer factorization**

$$E/K: y^2 = x^3 + Ax + B$$

$$E/K: y^2 = x^3 + Ax + B$$

▶ What happens if we take  $K = \mathbb{Z}/N\mathbb{Z}$ , with N a composite integer?

$$E/K: y^2 = x^3 + Ax + B$$

- ▶ What happens if we take  $K = \mathbb{Z}/N\mathbb{Z}$ , with N a composite integer?
  - K is not a field, and  $E(\mathbb{Z}/N\mathbb{Z})$  is **not a group!**

$$E/K: y^2 = x^3 + Ax + B$$

- ▶ What happens if we take  $K = \mathbb{Z}/N\mathbb{Z}$ , with N a composite integer?
  - K is not a field, and  $E(\mathbb{Z}/N\mathbb{Z})$  is **not a group!**
  - however, by Chinese remaindering,  $E(\mathbb{Z}/N\mathbb{Z})$  embeds a copy of  $E(\mathbb{F}_{p_i})$  for each prime  $p_i$  dividing N:

$$E(\mathbb{Z}/N\mathbb{Z})\cong E(\mathbb{F}_{p_1})\times E(\mathbb{F}_{p_2})\times \cdots \times E(\mathbb{F}_{p_r}) \times \text{some other stuff}$$

$$E/K: y^2 = x^3 + Ax + B$$

- ▶ What happens if we take  $K = \mathbb{Z}/N\mathbb{Z}$ , with N a composite integer?
  - K is not a field, and  $E(\mathbb{Z}/N\mathbb{Z})$  is **not** a **group**!
  - however, by Chinese remaindering,  $E(\mathbb{Z}/N\mathbb{Z})$  embeds a copy of  $E(\mathbb{F}_{p_i})$  for each prime  $p_i$  dividing N:

$$E(\mathbb{Z}/N\mathbb{Z})\cong E(\mathbb{F}_{p_1})\times E(\mathbb{F}_{p_2})\times \cdots \times E(\mathbb{F}_{p_r}) \times \text{some other stuff}$$

- ▶ In fact, the addition law usually breaks when the result
  - is  $\mathcal{O}$  modulo some (but not all) of the  $p_i$ 's
  - is not O modulo the other ones

$$E/K: y^2 = x^3 + Ax + B$$

- ▶ What happens if we take  $K = \mathbb{Z}/N\mathbb{Z}$ , with N a composite integer?
  - K is not a field, and  $E(\mathbb{Z}/N\mathbb{Z})$  is **not** a **group**!
  - however, by Chinese remaindering,  $E(\mathbb{Z}/N\mathbb{Z})$  embeds a copy of  $E(\mathbb{F}_{p_i})$  for each prime  $p_i$  dividing N:

$$E(\mathbb{Z}/N\mathbb{Z})\cong E(\mathbb{F}_{p_1})\times E(\mathbb{F}_{p_2})\times \cdots \times E(\mathbb{F}_{p_r}) \times \text{some other stuff}$$

- ▶ In fact, the addition law usually breaks when the result
  - is  $\mathcal{O}$  modulo some (but not all) of the  $p_i$ 's
  - is not O modulo the other ones
- ▶ In that case, a non-zero, non-invertible element  $\xi \in \mathbb{Z}/N\mathbb{Z}$  pops up

$$E/K: y^2 = x^3 + Ax + B$$

- ▶ What happens if we take  $K = \mathbb{Z}/N\mathbb{Z}$ , with N a composite integer?
  - K is not a field, and  $E(\mathbb{Z}/N\mathbb{Z})$  is **not** a **group**!
  - however, by Chinese remaindering,  $E(\mathbb{Z}/N\mathbb{Z})$  embeds a copy of  $E(\mathbb{F}_{p_i})$  for each prime  $p_i$  dividing N:

$$E(\mathbb{Z}/N\mathbb{Z})\cong E(\mathbb{F}_{p_1})\times E(\mathbb{F}_{p_2})\times \cdots \times E(\mathbb{F}_{p_r}) \times \text{some other stuff}$$

- ▶ In fact, the addition law usually breaks when the result
  - is  $\mathcal{O}$  modulo some (but not all) of the  $p_i$ 's
  - is not O modulo the other ones
- In that case, a non-zero, non-invertible element  $\xi \in \mathbb{Z}/N\mathbb{Z}$  pops up
  - $\rightarrow$  Compute  $gcd(\xi, N)$  and collect a non-trivial factor!

▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$ 

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- $\triangleright$  For a given composite integer N:

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- $\triangleright$  For a given composite integer N:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- ► For a given composite integer *N*:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- $\triangleright$  For a given composite integer N:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$
  - compute  $Q \leftarrow kP$ , with  $k = \prod_{\pi^e < B_1} \pi^e$  (prime powers less than  $B_1$ )

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- ► For a given composite integer *N*:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$
  - compute  $Q \leftarrow kP$ , with  $k = \prod_{\pi^e \leq B_1} \pi^e$  (prime powers less than  $B_1$ ) if the group law fails  $\rightarrow$  get factor!

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- ► For a given composite integer *N*:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$
  - compute  $Q \leftarrow kP$ , with  $k = \prod_{\pi^e \leq B_1} \pi^e$  (prime powers less than  $B_1$ ) if the group law fails  $\rightarrow$  get factor!
  - compute  $\pi Q$  for every prime  $\pi$  with  $B_1 < \pi \leq B_2$

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- $\triangleright$  For a given composite integer N:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$
  - compute  $Q \leftarrow kP$ , with  $k = \prod_{\pi^e \leq B_1} \pi^e$  (prime powers less than  $B_1$ ) if the group law fails  $\rightarrow$  get factor!
  - compute  $\pi Q$  for every prime  $\pi$  with  $B_1 < \pi \le B_2$  if the group law fails  $\rightarrow$  get factor!

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- ► For a given composite integer *N*:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$
  - compute  $Q \leftarrow kP$ , with  $k = \prod_{\pi^e \leq B_1} \pi^e$  (prime powers less than  $B_1$ ) if the group law fails  $\rightarrow$  get factor!
  - compute  $\pi Q$  for every prime  $\pi$  with  $B_1 < \pi \le B_2$  if the group law fails  $\to$  get factor!
  - divide N by all factors found, rinse and repeat

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- ► For a given composite integer *N*:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$
  - compute  $Q \leftarrow kP$ , with  $k = \prod_{\pi^e \leq B_1} \pi^e$  (prime powers less than  $B_1$ ) if the group law fails  $\rightarrow$  get factor!
  - compute  $\pi Q$  for every prime  $\pi$  with  $B_1 < \pi \le B_2$  if the group law fails  $\to$  get factor!
  - divide N by all factors found, rinse and repeat
- ▶ It is a probabilistic algorithm
  - the larger  $B_1$  and  $B_2$ , the higher the probability of success

- ▶ Parameters: bounds  $B_1$  and  $B_2$  with  $0 < B_1 < B_2$
- ► For a given composite integer *N*:
  - pick a random elliptic curve E over  $\mathbb{Z}/N\mathbb{Z}$
  - pick a random point  $P \in E(\mathbb{Z}/N\mathbb{Z}) \setminus \{\mathcal{O}\}$
  - compute  $Q \leftarrow kP$ , with  $k = \prod_{\pi^e \leq B_1} \pi^e$  (prime powers less than  $B_1$ ) if the group law fails  $\rightarrow$  get factor!
  - compute  $\pi Q$  for every prime  $\pi$  with  $B_1 < \pi \le B_2$  if the group law fails  $\to$  get factor!
  - divide N by all factors found, rinse and repeat
- ▶ It is a probabilistic algorithm
  - the larger  $B_1$  and  $B_2$ , the higher the probability of success
  - ... but also the more expensive the computation

- ► Easily parallelizable:
  - try many different curves on a single N
  - try to factor several N's in parallel

- ► Easily parallelizable:
  - try many different curves on a single N
  - try to factor several N's in parallel
- ► State-of-the-art implementations:
  - software: EECM-MPFQ [Bernstein et al., 2010]
  - on GPUs: [Bos & Kleinjung, 2012] and [Miele et al., 2014]

- ► Easily parallelizable:
  - try many different curves on a single N
  - try to factor several N's in parallel
- ► State-of-the-art implementations:
  - software: EECM-MPFQ [Bernstein et al., 2010]
  - on GPUs: [Bos & Kleinjung, 2012] and [Miele et al., 2014]
- Manycore processors are potentially good target architectures for ECM

#### Outline of the talk

- ► ECM in a nutshell
- ► The Kalray MPPA-256 processor
- ► Multiprecision modular arithmetic
- Results and conclusion

► Kalray: French start-up (CEA spin-off), launched in 2008

- ► Kalray: French start-up (CEA spin-off), launched in 2008
- ▶ MPPA (Multi-Purpose Processor Architecture): (co)processor

- ► Kalray: French start-up (CEA spin-off), launched in 2008
- ▶ MPPA (Multi-Purpose Processor Architecture): (co)processor
- ▶ 256: 256 compute cores

- ► Kalray: French start-up (CEA spin-off), launched in 2008
- ▶ MPPA (Multi-Purpose Processor Architecture): (co)processor
- ▶ 256: 256 compute cores
- ► For more info, visit www.kalray.eu (or ask Nicolas Brunie!)

► Global view:

- ► Global view:
  - 4 quad-core processors for I/Os (DDR RAM, PCI-e, etc.)



- ► Global view:
  - 4 quad-core processors for I/Os (DDR RAM, PCI-e, etc.)
  - 4 × 4 compute clusters (CC)



- ► Global view:
  - 4 quad-core processors for I/Os (DDR RAM, PCI-e, etc.)
  - 4 × 4 compute clusters (CC)

toric network-on-chip (NoC)



#### ► Global view:

• 4 quad-core processors for I/Os (DDR RAM, PCI-e, etc.)

4 × 4 compute clusters (CC)

toric network-on-chip (NoC)

frequency: 400 MHz

low power: ≤12-16 W



▶ In each compute cluster:



- ▶ In each compute cluster:
  - 16 compute cores (PE)



- ▶ In each compute cluster:
  - 16 compute cores (PE)
  - + 1 system core (RM)



- ▶ In each compute cluster:
  - 16 compute cores (PE)
  - + 1 system core (RM)
  - 2 MB shared memory



- ► In each compute cluster:
  - 16 compute cores (PE)
  - $\bullet$  + 1 system core (RM)
  - 2 MB shared memory
  - network interfaces



# **Compute clusters**

- ▶ In each compute cluster:
  - 16 compute cores (PE)
  - $\bullet$  + 1 system core (RM)
  - 2 MB shared memory
  - network interfaces
  - NodeOS (POSIX-like) + pthreads



► Kalray-1 (K1) microarchitecture

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- ▶ 5 computation units:

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- 5 computation units:
  - ALU<sub>0</sub> (32 bits)
  - ALU<sub>1</sub> (32 bits)

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- ▶ 5 computation units:
  - ALU<sub>0</sub> (32 bits)
     ALU<sub>1</sub> (32 bits)

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- ▶ 5 computation units:
  - ALU<sub>0</sub> (32 bits)
     ALU<sub>1</sub> (32 bits)
     64-bit ALU
  - MAU (multiply & add)

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- ▶ 5 computation units:
  - ALU<sub>0</sub> (32 bits)
     ALU<sub>1</sub> (32 bits)
     64-bit ALU
  - MAU (multiply & add) / FPU (floating point)

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- ▶ 5 computation units:
  - ALU<sub>0</sub> (32 bits)
     ALU<sub>1</sub> (32 bits)
     64-bit ALU
  - MAU (multiply & add) / FPU (floating point) / ALU<sub>tiny</sub>

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- ▶ 5 computation units:
  - ALU<sub>0</sub> (32 bits)
     ALU<sub>1</sub> (32 bits)
     64-bit ALU
  - MAU (multiply & add) / FPU (floating point) / ALU<sub>tiny</sub>
  - LSU (load & store)

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- ▶ 5 computation units:
  - ALU<sub>0</sub> (32 bits)
     ALU<sub>1</sub> (32 bits)
     64-bit ALU
  - MAU (multiply & add) / FPU (floating point) / ALU<sub>tiny</sub>
  - LSU (load & store) / ALU<sub>tiny</sub>

- ► Kalray-1 (K1) microarchitecture
- ▶ 32-bit processor:
  - 64 registers of 32 bits each
  - 32- and 64-bit loads / stores
- 5 computation units:
  - ALU<sub>0</sub> (32 bits)
     ALU<sub>1</sub> (32 bits)
     64-bit ALU
  - MAU (multiply & add) / FPU (floating point) / ALU<sub>tiny</sub>
  - LSU (load & store) / ALU<sub>tiny</sub>
  - BCU (branch & control)



- ▶ VLIW : Very Long Instruction Word
  - specify at each clock cycle what each computation unit will do
  - instruction word for a computation unit: 32 or 64 bits
  - one "instruction bundle" issued at each cycle: from 32 to 256 bits

- ▶ VLIW : Very Long Instruction Word
  - specify at each clock cycle what each computation unit will do
  - instruction word for a computation unit: 32 or 64 bits
  - one "instruction bundle" issued at each cycle: from 32 to 256 bits
- ▶ 8-stage pipeline (actually, 5 stages for most instructions)

- ▶ VLIW : Very Long Instruction Word
  - specify at each clock cycle what each computation unit will do
  - instruction word for a computation unit: 32 or 64 bits
  - one "instruction bundle" issued at each cycle: from 32 to 256 bits
- ▶ 8-stage pipeline (actually, 5 stages for most instructions)
- ► Low branching penalty:
  - 1 cycle for unconditional branches
  - 2 cycles for conditional branches

- ► Low-latency instructions, along with write-back bypass:
  - 1 cycle for ALU instructions (e.g., 32 or 64-bit addition)
  - 2 cycles for MAU instructions (e.g., muladd 64  $\leftarrow$  32  $\times$  32 + 64)

- ► Low-latency instructions, along with write-back bypass:
  - 1 cycle for ALU instructions (e.g., 32 or 64-bit addition)
  - 2 cycles for MAU instructions (e.g., muladd 64  $\leftarrow$  32  $\times$  32 + 64)
- Caches:
  - I-cache and D-cache of 8 kB each
  - fast: 32- or 64-bit cached load in 2 cycles

- ► Low-latency instructions, along with write-back bypass:
  - 1 cycle for ALU instructions (e.g., 32 or 64-bit addition)
  - 2 cycles for MAU instructions (e.g., muladd 64  $\leftarrow$  32  $\times$  32 + 64)
- Caches:
  - I-cache and D-cache of 8 kB each
  - fast: 32- or 64-bit cached load in 2 cycles
- ► Lots of useful less conventional instructions:
  - zero-penalty hardware loops
  - multiplication of 8 × 8 matrices over F<sub>2</sub>
  - arbitrary boolean functions  $\{0,1\}^4 \to \{0,1\}^2$ , vectorized on 32 bits
  - etc.

- ► Toolchain: k1-gcc, k1-ld, etc.
  - GCC backend for the K1 microarchitecture
  - LIBC port on NodeOS
  - seamless integration in usual C/ASM development environments

- ► Toolchain: k1-gcc, k1-ld, etc.
  - GCC backend for the K1 microarchitecture
  - LIBC port on NodeOS
  - seamless integration in usual C/ASM development environments
- ► An application = (at least) 3 binaries
  - 1 for the host PC (x86-64)
  - 1 for the PCI-e I/O processor (K1)
  - 1 or more for the compute clusters (K1)

- ► Toolchain: k1-gcc, k1-ld, etc.
  - GCC backend for the K1 microarchitecture
  - LIBC port on NodeOS
  - seamless integration in usual C/ASM development environments
- ► An application = (at least) 3 binaries
  - 1 for the host PC (x86-64)
  - 1 for the PCI-e I/O processor (K1)
  - 1 or more for the compute clusters (K1)
- ► A bit of Makefile magic can take care of everything

- ► Debugging:
  - simulator → execution traces
  - simulator + GDB
  - live debugging with GDB through JTAG port

- Debugging:
  - simulator → execution traces
  - simulator + GDB
  - live debugging with GDB through JTAG port
- Optimizing critical code:
  - extensive use of assembly language
  - execution times are very stable: reproducible benchmarks
  - can predict execution times with 1-cycle accuracy

► A few rules should be followed in order to gain a few extra cycles

- ► A few rules should be followed in order to gain a few extra cycles
- ► The Pre-Fetch Buffer (PFB) can only fetch 128 bits per clock cycle from the I-cache
  - $\Rightarrow$  no more than 128 bits per bundle, so as not to starve the PFB

- ► A few rules should be followed in order to gain a few extra cycles
- ► The Pre-Fetch Buffer (PFB) can only fetch 128 bits per clock cycle from the I-cache
  - $\Rightarrow$  no more than 128 bits per bundle, so as not to starve the PFB
- ▶ 1-cycle penalty for a load accessing two different 64-byte cache lines
  - ⇒ keep data aligned on 8-byte (64-bit) boundaries in memory
  - ⇒ pack instruction bundles with nops to maintain code alignment

- ► A few rules should be followed in order to gain a few extra cycles
- ► The Pre-Fetch Buffer (PFB) can only fetch 128 bits per clock cycle from the I-cache
  - $\Rightarrow$  no more than 128 bits per bundle, so as not to starve the PFB
- ▶ 1-cycle penalty for a load accessing two different 64-byte cache lines
  - ⇒ keep data aligned on 8-byte (64-bit) boundaries in memory
  - ⇒ pack instruction bundles with nops to maintain code alignment
- ▶ MAU: accumulator and result have to be pairs of registers  $r_{2i}$ :  $r_{2i+1}$ 
  - ⇒ if need be, use an explicit 64-bit addition to avoid this constraint

#### Outline of the talk

- ► ECM in a nutshell
- ► The Kalray MPPA-256 processor
- ► Multiprecision modular arithmetic
- Results and conclusion

- ► For a given integer *N* to be factored, ECM requires:
  - additions / subtractions modulo N
  - multiplications / squarings modulo N
  - one inversion modulo N
  - two GCDs

- ► For a given integer *N* to be factored, ECM requires:
  - additions / subtractions modulo N
  - multiplications / squarings modulo N
  - one inversion modulo N
  - two GCDs
- ► Typical size of *N*: from 192 to 512 bits

- ► For a given integer *N* to be factored, ECM requires:
  - additions / subtractions modulo N
  - multiplications / squarings modulo N
  - one inversion modulo N
  - two GCDs
- ► Typical size of *N*: from 192 to 512 bits
- ▶ What we have at our disposal:
  - basic integer arithmetic (addition, multiplication, comparisons)
  - on 32- and 64-bit words

- ► For a given integer *N* to be factored, ECM requires:
  - additions / subtractions modulo N
  - multiplications / squarings modulo N
  - one inversion modulo N
  - two GCDs
- ► Typical size of *N*: from 192 to 512 bits
- ▶ What we have at our disposal:
  - basic integer arithmetic (addition, multiplication, comparisons)
  - on 32- and 64-bit words
- ⇒ Write an optimized library for multiprecision modular arithmetic
  - all low-level functions (add, sub, mul, etc.) in pure ASM
  - higher-level functions (GCD, modular inversion) in C
  - no multi-threading: all computations on a single compute core

# Multiprecision representation

▶ Consider  $X \in \mathbb{Z}/N\mathbb{Z}$ , with N an n-bit integer

# Multiprecision representation

- ▶ Consider  $X \in \mathbb{Z}/N\mathbb{Z}$ , with N an n-bit integer
  - since  $0 \le X < N$ , it also fits into n bits



# Multiprecision representation

- ▶ Consider  $X \in \mathbb{Z}/N\mathbb{Z}$ , with N an n-bit integer
  - since  $0 \le X \le N$ , it also fits into *n* bits
  - split X into  $n_W = \lceil n/32 \rceil$  32-bit words (or limbs):

$$X = x_{n_W-1}2^{32(n_W-1)} + \cdots + x_12^{32} + x_0$$



# Multiprecision representation

- ▶ Consider  $X \in \mathbb{Z}/N\mathbb{Z}$ , with N an n-bit integer
  - since  $0 \le X \le N$ , it also fits into *n* bits
  - split X into  $n_W = \lceil n/32 \rceil$  32-bit words (or limbs):

$$X = x_{n_W-1}2^{32(n_W-1)} + \cdots + x_12^{32} + x_0$$



# Multiprecision representation

- ▶ Consider  $X \in \mathbb{Z}/N\mathbb{Z}$ , with N an n-bit integer
  - since  $0 \le X < N$ , it also fits into *n* bits
  - split X into  $n_W = \lceil n/32 \rceil$  32-bit words (or limbs):

$$X = x_{n_W-1}2^{32(n_W-1)} + \cdots + x_12^{32} + x_0$$

- ▶ In our library,  $n_W$  is fixed at compile-time:
  - uint32\_t *X*[*n*<sub>W</sub>]
  - supported values:  $2 \le n_W \le 16$ , i.e. from 64 to 512 bits
  - write (or generate) code optimized for each value of  $n_W$



ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition
  - need to propagate carry



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition
  - need to propagate carry



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition
  - need to propagate carry



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition
  - need to propagate carry



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition
  - need to propagate carry



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition
  - need to propagate carry



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - right-to-left word-wise addition
  - need to propagate carry
  - use 64-bit additions to halve the number of operations



- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - ld / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)

| C | ycle | BCU | LSU | MAU | $ALU_1$ | ALU <sub>0</sub> |
|---|------|-----|-----|-----|---------|------------------|
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |
|   |      |     |     |     |         |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)

| cycle | BCU | LSU               |                | MAU | $ALU_1$ | ALU <sub>0</sub> |
|-------|-----|-------------------|----------------|-----|---------|------------------|
| <br>t |     | $x \leftarrow 1d$ | 8 <i>i</i> [X] |     |         |                  |
|       |     |                   |                |     |         |                  |
|       |     |                   |                |     |         |                  |
|       |     |                   |                |     |         |                  |
|       |     |                   |                |     |         |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)

| cycle | BCU | LSU                                    | J              | MAU | $ALU_1$ | ALU <sub>0</sub> |
|-------|-----|----------------------------------------|----------------|-----|---------|------------------|
|       |     |                                        | 0: <b>[V</b> ] |     |         |                  |
| t + 1 |     | $x \leftarrow 1d$<br>$y \leftarrow 1d$ | 8i[X]<br>8i[Y] |     |         |                  |
| ·     |     |                                        |                |     |         |                  |
|       |     |                                        |                |     |         |                  |
|       |     |                                        |                |     |         |                  |
|       |     |                                        |                |     |         |                  |
|       |     |                                        |                |     |         |                  |
|       |     |                                        |                |     |         |                  |
|       |     |                                        |                |     |         |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)

| cycle | BCU | LSU               |                         | MAU | $ALU_1$ | ALU <sub>0</sub> |
|-------|-----|-------------------|-------------------------|-----|---------|------------------|
|       |     |                   |                         |     |         |                  |
| t     |     | $x \leftarrow 1d$ | 8 <i>i</i> [X]          |     |         |                  |
| t+1   |     | $y \leftarrow 1d$ | 8 <i>i</i> [ <i>Y</i> ] |     |         |                  |
| t+2   |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |
|       |     |                   |                         |     |         |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)

| cycle | BCU | LSU               | J                       | MAU | $ALU_1$                   | ALU <sub>0</sub> |
|-------|-----|-------------------|-------------------------|-----|---------------------------|------------------|
|       |     |                   |                         |     |                           |                  |
| t     |     | $x \leftarrow 1d$ | 8 <i>i</i> [X]          |     |                           |                  |
| t+1   |     | $y \leftarrow 1d$ | 8 <i>i</i> [ <i>Y</i> ] |     |                           |                  |
| t+2   |     |                   |                         |     |                           |                  |
| t+3   |     |                   |                         |     | $r \leftarrow \text{add}$ | dc x, y          |
|       |     |                   |                         |     |                           |                  |
|       |     |                   |                         |     |                           |                  |
|       |     |                   |                         |     |                           |                  |
|       |     |                   |                         |     |                           |                  |
|       |     |                   |                         |     |                           |                  |
|       |     |                   |                         |     |                           |                  |
|       |     |                   |                         |     |                           |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)

| cycle | BCU | LSU                     |                                       | MAU | $ALU_1$                   | ALU <sub>0</sub> |
|-------|-----|-------------------------|---------------------------------------|-----|---------------------------|------------------|
|       |     |                         |                                       |     |                           |                  |
| t     |     | $x \leftarrow 1d$       | 8 <i>i</i> [X]                        |     |                           |                  |
| t+1   |     | $y \leftarrow 1d$       | 8 <i>i</i> [ <i>Y</i> ]               |     |                           |                  |
| t+2   |     |                         |                                       |     |                           |                  |
| t+3   |     |                         |                                       |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+4   |     | 8 <i>i</i> [ <i>R</i> ] | $\leftarrow \mathtt{sd} \ \mathit{r}$ |     |                           |                  |
|       |     |                         |                                       |     |                           |                  |
|       |     |                         |                                       |     |                           |                  |
|       |     |                         |                                       |     |                           |                  |
|       |     |                         |                                       |     |                           |                  |
|       |     |                         |                                       |     |                           |                  |
|       |     |                         |                                       |     |                           |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)
  - total latency:  $5\lceil n_W/2\rceil + O(1)$  cycles

| cycle | BCU | LSU                           | MAU | $ALU_1$                   | ALU <sub>0</sub> |
|-------|-----|-------------------------------|-----|---------------------------|------------------|
|       |     |                               |     |                           |                  |
| t     |     | $x \leftarrow 1d$ 8 $i[X]$    |     |                           |                  |
| t+1   |     | $y \leftarrow 1d$ 8 $i[Y]$    |     |                           |                  |
| t+2   |     |                               |     |                           |                  |
| t+3   |     |                               |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+4   |     | $8i[R] \leftarrow sd r$       |     |                           |                  |
| t+5   |     | $x \leftarrow 1d \ 8(i+1)[X]$ |     |                           |                  |
| t+6   |     | $y \leftarrow 1d \ 8(i+1)[Y]$ |     |                           |                  |
| t+7   |     |                               |     |                           |                  |
| t + 8 |     |                               |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+9   |     | $8(i+1)[R] \leftarrow sd r$   |     |                           |                  |
|       |     |                               |     |                           |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - ld / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)
  - total latency:  $5\lceil n_W/2\rceil + O(1)$  cycles

| cycle | BCU | LSU                                        | MAU | $ALU_1$                   | ALU <sub>0</sub> |
|-------|-----|--------------------------------------------|-----|---------------------------|------------------|
|       |     |                                            |     |                           |                  |
| t     |     | $x \leftarrow 1d$ $8i[X]$                  |     |                           |                  |
| t+1   |     | $y \leftarrow 1d$ $8i[Y]$                  |     |                           |                  |
| t+2   |     | $x \leftarrow 1d \ 8(i+1)[X]$              |     |                           |                  |
| t+3   |     | $y \leftarrow 1d \ 8(i+1)[Y]$              |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+4   |     | $8i[R] \leftarrow sd r$                    |     |                           |                  |
| t+5   |     |                                            |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+6   |     | $8(i+1)[R] \leftarrow \operatorname{sd} r$ |     |                           |                  |
| t+7   |     |                                            |     |                           |                  |
| t + 8 |     |                                            |     |                           |                  |
| t+9   |     |                                            |     |                           |                  |
|       |     |                                            |     |                           |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)
  - total latency:  $5\lceil n_W/2\rceil + O(1)$  cycles

| cycle | BCU | LSU                                        | MAU | $ALU_1$                   | ALU <sub>0</sub> |
|-------|-----|--------------------------------------------|-----|---------------------------|------------------|
|       |     |                                            |     |                           |                  |
| t     |     | $x \leftarrow 1d$ $8i[X]$                  |     |                           |                  |
| t+1   |     | $y \leftarrow 1d$ $8i[Y]$                  |     |                           |                  |
| t+2   |     |                                            |     |                           |                  |
| t+3   |     | $x \leftarrow 1d \ 8(i+1)[X]$              |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+4   |     | $y \leftarrow 1d \ 8(i+1)[Y]$              |     |                           |                  |
| t+5   |     | $8i[R] \leftarrow sd r$                    |     |                           |                  |
| t+6   |     |                                            |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+7   |     |                                            |     |                           |                  |
| t + 8 |     | $8(i+1)[R] \leftarrow \operatorname{sd} r$ |     |                           |                  |
| t+9   |     |                                            |     |                           |                  |
|       |     |                                            |     |                           |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)
  - total latency:  $5\lceil n_W/2\rceil + O(1)$  cycles

| cycle | BCU | LSU                                        | MAU | $ALU_1$                     | ALU <sub>0</sub> |
|-------|-----|--------------------------------------------|-----|-----------------------------|------------------|
|       |     |                                            |     |                             |                  |
| t     |     | $x \leftarrow 1d$ $8i[X]$                  |     | $r \leftarrow \mathtt{add}$ | dc x, y          |
| t+1   |     | $y \leftarrow 1d$ $8i[Y]$                  |     |                             |                  |
| t+2   |     | $8(i-1)[R] \leftarrow \text{sd } r$        |     |                             |                  |
| t+3   |     | $x \leftarrow 1d \ 8(i+1)[X]$              |     | $r \leftarrow \mathtt{add}$ | dc x, y          |
| t+4   |     | $y \leftarrow 1d \ 8(i+1)[Y]$              |     |                             |                  |
| t+5   |     | $8i[R] \leftarrow sd r$                    |     |                             |                  |
| t+6   |     |                                            |     | $r \leftarrow \text{add}$   | dc x, y          |
| t+7   |     |                                            |     |                             |                  |
| t + 8 |     | $8(i+1)[R] \leftarrow \operatorname{sd} r$ |     |                             |                  |
| t+9   |     |                                            |     |                             |                  |
|       |     |                                            |     |                             |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)
  - total latency:  $5\lceil n_W/2\rceil + O(1)$  cycles

| cycle | BCU | LSU                                        | MAU | $ALU_1$                   | ALU <sub>0</sub> |
|-------|-----|--------------------------------------------|-----|---------------------------|------------------|
|       |     |                                            |     |                           |                  |
| t     |     | $x \leftarrow 1d$ $8i[X]$                  |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+1   |     | $y \leftarrow 1d$ $8i[Y]$                  |     |                           |                  |
| t+2   |     | $8(i-1)[R] \leftarrow \text{sd } r$        |     |                           |                  |
| t+3   |     | $x \leftarrow 1d \ 8(i+1)[X]$              |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+4   |     | $y \leftarrow 1d \ 8(i+1)[Y]$              |     |                           |                  |
| t+5   |     | $8i[R] \leftarrow sd r$                    |     |                           |                  |
| t+6   |     | $x \leftarrow 1d \ 8(i+2)[X]$              |     | $r \leftarrow \text{add}$ | dc x, y          |
| t+7   |     | $y \leftarrow 1d \ 8(i+2)[Y]$              |     |                           |                  |
| t + 8 |     | $8(i+1)[R] \leftarrow \operatorname{sd} r$ |     |                           |                  |
| t+9   |     |                                            |     | $r \leftarrow \text{add}$ | dc x, y          |
|       |     |                                            |     |                           |                  |

- ightharpoonup addn(R, X, Y): addition of two  $n_W$ -word integers X and Y
  - 1d / sd: 64-bit memory accesses
  - adddc: 64-bit addition with carry (on ALU<sub>0</sub> and ALU<sub>1</sub>)
  - total latency:  $3\lceil n_W/2\rceil + O(1)$  cycles

| cycle | BCU | LSU                                        | MAU | $ALU_1$                     | ALU <sub>0</sub> |
|-------|-----|--------------------------------------------|-----|-----------------------------|------------------|
|       |     |                                            |     |                             |                  |
| t     |     | $x \leftarrow 1d$ $8i[X]$                  |     | $r \leftarrow \mathtt{add}$ | dc x, y          |
| t+1   |     | $y \leftarrow 1d$ $8i[Y]$                  |     |                             |                  |
| t+2   |     | $8(i-1)[R] \leftarrow \operatorname{sd} r$ |     |                             |                  |
| t+3   |     | $x \leftarrow 1d 8(i+1)[X]$                |     | $r \leftarrow \text{add}$   | dc x, y          |
| t+4   |     | $y \leftarrow 1d 8(i+1)[Y]$                |     |                             |                  |
| t+5   |     | $8i[R] \leftarrow sd r$                    |     |                             |                  |
| t+6   |     | $x \leftarrow 1d 8(i+2)[X]$                |     | $r \leftarrow \text{add}$   | dc x, y          |
| t+7   |     | $y \leftarrow 1d 8(i+2)[Y]$                |     |                             |                  |
| t + 8 |     | $8(i+1)[R] \leftarrow \operatorname{sd} r$ |     |                             |                  |
| t+9   |     | $x \leftarrow 1d 8(i+3)[X]$                |     | $r \leftarrow \text{add}$   | dc x, y          |
|       |     |                                            |     |                             |                  |

ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)





- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)





- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)





- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)





- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)





- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - schoolbook method:  $n_W^2$  32-by-32-bit subproducts
  - final product fits into 2n<sub>W</sub> words
  - order for subproducts: operand scanning (simpler loop control)



ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y

- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products

- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - *n*<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)

$$-2$$
  $-1$  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 → Cycle LSU MAU ALU<sub>0/1</sub>

- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)
  - first the even-rank partial products  $x_{2i} \times y_i$



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)
  - first the even-rank partial products  $x_{2i} \times y_i$
  - then the odd-rank partial products  $x_{2j+1} \times y_i$



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)
  - first the even-rank partial products  $x_{2i} \times y_i$
  - then the odd-rank partial products  $x_{2j+1} \times y_i$
  - constraints on register pairs: explicit accumulation using adddc



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)
  - first the even-rank partial products  $x_{2i} \times y_i$
  - then the odd-rank partial products  $x_{2j+1} \times y_i$
  - constraints on register pairs: explicit accumulation using adddc
  - Acc<sub>0</sub> requires an extra cycle to store the output carry



- ightharpoonup muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)
  - first the even-rank partial products  $x_{2i} \times y_i$
  - then the odd-rank partial products  $x_{2j+1} \times y_i$
  - constraints on register pairs: explicit accumulation using adddc
  - Acc<sub>0</sub> requires an extra cycle to store the output carry



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)
  - first the even-rank partial products  $x_{2i} \times y_i$
  - then the odd-rank partial products  $x_{2j+1} \times y_i$
  - constraints on register pairs: explicit accumulation using adddc
  - Acc<sub>0</sub> requires an extra cycle to store the output carry
  - repeated n<sub>W</sub> times



- $\blacktriangleright$  muln(R, X, Y): multiplication of two  $n_W$ -word integers X and Y
  - multiplicand X kept in the register file (whence  $n_W \leq 16$ )
  - multiplier Y processed sequentially, word by word
  - n<sub>W</sub>-word accumulator for partial products
- ▶ Basic iteration:  $Acc \leftarrow Acc + X \times y_i$  (a.k.a. addmul\_1 in GMP)
  - first the even-rank partial products  $x_{2i} \times y_i$
  - then the odd-rank partial products  $x_{2j+1} \times y_i$
  - constraints on register pairs: explicit accumulation using adddc
  - Acc<sub>0</sub> requires an extra cycle to store the output carry
  - repeated n<sub>W</sub> times
- ▶ Total latency:  $n_W(n_W + 1) + O(1)$  cycles,  $\approx 1$  cycle / subproduct



- ▶ Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)

- $\triangleright$  Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$

- ▶ Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$

- ightharpoonup Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$
  - repeat *n<sub>W</sub>* times:

- ightharpoonup Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$
  - repeat  $n_W$  times:
    - $q \leftarrow (x_0 \cdot \mu) \mod 2^{32}$  // multiplication  $32 \leftarrow 32 \times 32$

- $\triangleright$  Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$
  - repeat  $n_W$  times:

- $\triangleright$  Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$
  - repeat n<sub>W</sub> times:

```
 \begin{array}{ll} \bullet & q \leftarrow (x_0 \cdot \mu) \bmod 2^{32} & // \mbox{ multiplication } 32 \leftarrow 32 \times 32 \\ \bullet & Y \leftarrow X + q \cdot N & // \mbox{ addmul\_1; } Y \equiv 0 \mbox{ (mod } 2^{32}) \\ \bullet & X \leftarrow Y/2^{32} & // \mbox{ exact division, for free} \end{array}
```

- ightharpoonup Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$
  - repeat n<sub>W</sub> times:
    - ►  $q \leftarrow (x_0 \cdot \mu) \mod 2^{32}$  // multiplica ►  $Y \leftarrow X + q \cdot N$  // addmul\_1; ►  $X \leftarrow Y/2^{32}$  // exa
  - if X > N then  $X \leftarrow X N$

```
// multiplication 32 \leftarrow 32 \times 32
// addmul_1; Y \equiv 0 \pmod{2^{32}}
// exact division, for free
```

- ▶ Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$
  - repeat n<sub>W</sub> times:
    - $\begin{array}{ll} \bullet & q \leftarrow (x_0 \cdot \mu) \bmod 2^{32} & // \mbox{ multiplication } 32 \leftarrow 32 \times 32 \\ \bullet & Y \leftarrow X + q \cdot N & // \mbox{ addmul\_1; } Y \equiv 0 \mbox{ (mod } 2^{32}) \\ \bullet & X \leftarrow Y/2^{32} & // \mbox{ exact division, for free} \end{array}$
  - if  $X \geq N$  then  $X \leftarrow X N$
- ▶ Total latency:  $n_W(n_W + 3) + O(1)$  cycles, still ≈ 1 cycle / subproduct

- $\triangleright$  Perform computations modulo N ( $n_W$  words)
  - addition, subtraction: trivial
  - multiplication: use Montgomery reduction (REDC)
- ▶ REDC :  $X \mapsto (X/2^{32n_W}) \mod N$ 
  - precomputation:  $\mu \leftarrow (-N)^{-1} \mod 2^{32}$
  - repeat n<sub>W</sub> times:
  - if  $X \ge N$  then  $X \leftarrow X N$
- ▶ Total latency:  $n_W(n_W + 3) + O(1)$  cycles, still ≈ 1 cycle / subproduct
- ▶ Multiplication then REDC:  $2n_W(n_W + 2) + O(1)$  cycles
  - for  $n_W = 16$ , in ASM: 614 cycles
  - same thing in C: 3221 cycles!

### Outline of the talk

- ► ECM in a nutshell
- ► The Kalray MPPA-256 processor
- ► Multiprecision modular arithmetic
- Results and conclusion



► ECM with  $B_1 = 256$  and  $B_2 = 2^{14}$  (cost: 5 381 modular mults)



► ECM with  $B_1 = 1024$  and  $B_2 = 7 \cdot 2^{14}$  (cost: 22 878 modular mults)



► ECM with  $B_1 = 8192$  and  $B_2 = 80 \cdot 2^{14}$  (cost: 181852 modular mults)



► ECM with  $B_1 = 8192$  and  $B_2 = 80 \cdot 2^{14}$  (cost: 181852 modular mults)





▶ ECM with  $B_1 = 256$  and  $B_2 = 2^{14}$  (cost: 5 381 modular mults)



ightharpoonup ECM with  $B_1=1024$  and  $B_2=7\cdot 2^{14}$  (cost: 22 878 modular mults)



► ECM with  $B_1 = 8192$  and  $B_2 = 80 \cdot 2^{14}$  (cost: 181852 modular mults)



- ► Library for efficient multiprecision modular arithmetic:
  - support for precision up to 512 bits
  - carefully optimized critical low-level functions

- ▶ Library for efficient multiprecision modular arithmetic:
  - support for precision up to 512 bits
  - carefully optimized critical low-level functions
  - ullet quasi-optimal cost for quadratic ops: pprox 1 cycle / subproduct

- ▶ Library for efficient multiprecision modular arithmetic:
  - support for precision up to 512 bits
  - carefully optimized critical low-level functions
  - ullet quasi-optimal cost for quadratic ops: pprox 1 cycle / subproduct
- State-of-the-art implementation of ECM:
  - quite fast: 2-3× speedup wrt. dual 8-core Intel CPU,
     2-3× slowdown wrt. high-end GPU

- ▶ Library for efficient multiprecision modular arithmetic:
  - support for precision up to 512 bits
  - carefully optimized critical low-level functions
  - ullet quasi-optimal cost for quadratic ops: pprox 1 cycle / subproduct
- State-of-the-art implementation of ECM:
  - quite fast: 2-3× speedup wrt. dual 8-core Intel CPU,
     2-3× slowdown wrt. high-end GPU
  - ullet energy efficient:  $\sim 30 imes$  better than CPU,  $\sim 5 imes$  better than GPU

- ▶ Library for efficient multiprecision modular arithmetic:
  - support for precision up to 512 bits
  - carefully optimized critical low-level functions
  - ullet quasi-optimal cost for quadratic ops: pprox 1 cycle / subproduct
- State-of-the-art implementation of ECM:
  - quite fast: 2-3× speedup wrt. dual 8-core Intel CPU,
     2-3× slowdown wrt. high-end GPU
  - ullet energy efficient:  $\sim 30 imes$  better than CPU,  $\sim 5 imes$  better than GPU
  - more on-chip memory: can tackle larger sizes than GPUs

## Thank you for your attention

# Questions?

#### More info:

- Git repo: https://gforge.inria.fr/projects/kalray-ecm
- Paper: https://eprint.iacr.org/2016/365