

# Silo: <u>Speculative Hardware Logging</u> for Atomic Durability in Persistent Memory

#### Ming Zhang, Yu Hua

Huazhong University of Science and Technology, China

29th IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2023











#### **Atomic Durability**

> A group of updates are written to PM in an *all or nothing* manner

Current 64-bit CPUs only support 8B atomic write<sup>[1-3]</sup>

#### **Atomic Durability**

> A group of updates are written to PM in an <u>all or nothing</u> manner

Current 64-bit CPUs only support 8B atomic write<sup>[1-3]</sup>



#### **Atomic Durability**

> A group of updates are written to PM in an *all or nothing* manner

Current 64-bit CPUs only support 8B atomic write<sup>[1-3]</sup>



[1] Fast&Fair@FAST'18 [2] Level Hashing@OSDI'18 [3] Recipe@SOSP'19











Software Logging

| Tx_begin   |  |
|------------|--|
| create Log |  |
| write Log  |  |
| flush Log  |  |
| sfence     |  |
| write data |  |
| flush data |  |
| sfence     |  |
| Tx_end     |  |

Log operations exist on the critical path Throughput decreases by up to **70%**<sup>[1]</sup>

Software Logging

| Tx_begin   |
|------------|
| create Log |
| write Log  |
| flush Log  |
| sfence     |
| write data |
| flush data |
| sfence     |
| Tx_end     |



#### Hardware Logging

Tx\_begin write data Tx end

Log operations exist on the critical path Throughput decreases by up to **70%**<sup>[1]</sup>

Software Logging

| Tx_begin   |
|------------|
| create Log |
| write Log  |
| flush Log  |
| sfence     |
| write data |
| flush data |
| sfence     |
| Tx_end     |

Log operations exist on the critical path Throughput decreases by up to **70%**<sup>[1]</sup>



- ✓ Better performance
- Easy programming











#### Log as backup State-of-The-Art



### Challenges

Tx\_begin write A write B Tx\_end







→ Exacerbate PM endurance



➔ Exacerbate PM endurance

#### **Ordering Constraints**



Exacerbate PM endurance



→ Exacerbate PM endurance



→ Exacerbate PM endurance

Hardware undo+redo



Hardware undo+redo

- **FWB**<sup>[1]</sup> writes logs to PM before the updated data for each write
- **MorLog**<sup>[2]</sup> flushes logs to PM before commit to ensure durability

Logging supports to recover data from a system crash, but increases the write traffic

Exacerbate PM endurance

Time

Time



Hardware undo+redo

• **FWB**<sup>[1]</sup> writes logs to PM before the updated data for each write

• **MorLog**<sup>[2]</sup> flushes logs to PM before commit to ensure durability

#### → Increase latency

Logging supports to recover data from a system crash, but increases the write traffic

Exacerbate PM endurance

Time

Time



#### **Key Ideas**

#### Speculative Logging

• Crash is rare for a single machine<sup>[1-2]</sup>

Do not conservatively write logs to PM in common cases (no failures)

Only write logs to PM in rare cases (e.g., crashes) to guarantee atomic durability

#### Speculative Logging

• Crash is rare for a single machine<sup>[1-2]</sup>

Do not conservatively write logs to PM in common cases (no failures)

→ Only write logs to PM in rare cases (e.g., crashes) to guarantee atomic durability

#### Log as Data

Logs are able to record the new data

Use on-chip logs to in-place update the PM data after commit in common cases

#### Speculative Logging

• Crash is rare for a single machine<sup>[1-2]</sup>

Do not conservatively write logs to PM in common cases (no failures)

Only write logs to PM in rare cases (e.g., crashes) to guarantee atomic durability

#### Log as Data

- Logs are able to record the new data
- Use on-chip logs to in-place update the PM data after commit in common cases



#### Speculative Logging

• Crash is rare for a single machine<sup>[1-2]</sup>

Do not conservatively write logs to PM in common cases (no failures)

Only write logs to PM in rare cases (e.g., crashes) to guarantee atomic durability

#### Log as Data

Logs are able to record the new data
 Use on-chip logs to in-place update the PM data after commit in common cases



#### Speculative Logging

• Crash is rare for a single machine<sup>[1-2]</sup>

Do not conservatively write logs to PM in common cases (no failures)

Only write logs to PM in rare cases (e.g., crashes) to guarantee atomic durability

#### Log as Data

Logs are able to record the new data
 Use on-chip logs to in-place update the PM data after commit in common cases

Make the common case fast and guarantee recoverability







Monitor L1D updates during txns and generate log entries





[1] ReDU@MICRO'18 [2] Deneva@VLDB'17 [3] Hybrid Index@SIGMOD'16 [4] FOEDUS@SIGMOD'15 [5] From Oracle (*https://www.oracle.com/database/what-isoltp/*) \* Lithium thin-film battery



[1] ReDU@MICRO'18 [2] Deneva@VLDB'17 [3] Hybrid Index@SIGMOD'16 [4] FOEDUS@SIGMOD'15
[5] From Oracle (<u>https://www.oracle.com/database/what-isoltp/</u>) \* Lithium thin-film battery



[1] ReDU@MICRO'18 [2] Deneva@VLDB'17 [3] Hybrid Index@SIGMOD'16 [4] FOEDUS@SIGMOD'15
[5] From Oracle (<u>https://www.oracle.com/database/what-isoltp/</u>) \* Lithium thin-film battery

Reduce the size of on-chip log buffer based on write behaviors

Reduce the size of on-chip log buffer based on write behaviors



A write does not modify the data

- E.g., copy and assignment<sup>[1]</sup>
- Old data == New data



Log generator ignores this write

• Does not produce log entry

Reduce the size of on-chip log buffer based on write behaviors



A write does not modify the data

- E.g., copy and assignment<sup>[1]</sup>
- Old data == New data



Log generator ignores this write

• Does not produce log entry



- Temporal locality of programs
- Only the oldest and newest data are required

Reduce the size of on-chip log buffer based on write behaviors



A write does not modify the data

- E.g., copy and assignment<sup>[1]</sup>
- Old data == New data



Log generator ignores this write

• Does not produce log entry



- Temporal locality of programs
- Only the oldest and newest data are required



Reduce the size of on-chip log buffer based on write behaviors



A write does not modify the data

- E.g., copy and assignment<sup>[1]</sup>
- Old data == New data



Log generator ignores this write

• Does not produce log entry



- Temporal locality of programs
- Only the oldest and newest data are required



Reduce the size of on-chip log buffer based on write behaviors



A write does not modify the data

- E.g., copy and assignment<sup>[1]</sup>
- Old data == New data



Log generator ignores this write

• Does not produce log entry



- Temporal locality of programs
- Only the oldest and newest data are required



Reduce the size of on-chip log buffer based on write behaviors



A write does not modify the data

- E.g., copy and assignment<sup>[1]</sup>
- Old data == New data



Log generator ignores this write

• Does not produce log entry



- Temporal locality of programs
- Only the oldest and newest data are required



> Use the new data in on-chip logs to in-place update the data region

- > Use the new data in on-chip logs to in-place update the data region
- Not block cacheline evictions
  - Set the flush-bit to 1 to discard the log after commit if an updated cacheline is evicted

- > Use the new data in on-chip logs to in-place update the data region
- Not block cacheline evictions
  - Set the flush-bit to 1 to discard the log after commit if an updated cacheline is evicted



- Use the new data in on-chip logs to in-place update the data region
- Not block cacheline evictions
  - Set the flush-bit to 1 to discard the log after commit if an updated cacheline is evicted



- Use the new data in on-chip logs to in-place update the data region
- Not block cacheline evictions
  - Set the flush-bit to 1 to discard the log after commit if an updated cacheline is evicted



- > Use the new data in on-chip logs to in-place update the data region
- Not block cacheline evictions
  - Set the flush-bit to 1 to discard the log after commit if an updated cacheline is evicted



#### Benefits

- Write reduction: Don't write logs to PM in common cases
- No ordering constraints: Don't wait for flushing logs (and cachelines) to the log (and data) regions

- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer
  - W1-W3 have overlapped bytes



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer
  - W1-W3 have overlapped bytes
  - W4-W5 are not overlapped



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer
  - W1-W3 have overlapped bytes
  - W4-W5 are not overlapped
  - W6 is merged into cachelines



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer
  - W1-W3 have overlapped bytes
  - W4-W5 are not overlapped
  - W6 is merged into cachelines
- Correctness: No race risk



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer
  - W1-W3 have overlapped bytes
  - W4-W5 are not overlapped
  - W6 is merged into cachelines
- Correctness: No race risk





① Flush-bit in log is 1. CE updates the data region



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer
  - W1-W3 have overlapped bytes
  - W4-W5 are not overlapped
  - W6 is merged into cachelines
- Correctness: No race risk





Flush-bit in log is 1. CE updates the data region
 LU and CE are coalesced to update the data region



- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced in an on-PM buffer
  - W1-W3 have overlapped bytes
  - W4-W5 are not overlapped
  - W6 is merged into cachelines
- Correctness: No race risk



| j | ① Flush-bit in log is 1. CE updates the data region   |
|---|-------------------------------------------------------|
| j | ② LU and CE are coalesced to update the data region   |
| j | ③ LU writes the data region. CE will not write twice* |





\* By using bit-level write reduction schemes, e.g., DCW@ISCA'09

































## **Evaluation**

#### Benchmarks

- Micro-benchmarks
  - Array, Btree, Hash, Queue, RBtree
- Macro-benchmarks
  - TPCC, YCSB

#### Comparisons

- Base: A hardware logging baseline
- **FWB**<sup>[2]</sup>: The hardware logging design of FWB
- **MorLog**<sup>[3]</sup>: The morphable hardware logging
- LAD<sup>[4]</sup>: The logless atomic durability design
- Silo: Our speculative logging design

#### **Gem5 Simulation**

#### Processor

| Cores             | 8 cores, x86-64, 2 GHz                         |  |
|-------------------|------------------------------------------------|--|
| L1 I/D            | Private, 64B per line, 32KB, 8-way, 4 cycles   |  |
| L2                | Private, 64B per line, 256KB, 8-way, 12 cycles |  |
| LLC               | Shared, 64B per line, 8MB, 16-way, 28 cycles   |  |
| Mem Ctrl          | FRFCFS, 64-entry queue in ADR domain           |  |
| Log Buffer        | 680B per core, FIFO, 8 cycles, battery-backed  |  |
| Persistent Memory |                                                |  |
| Capacity          | 16GB phase-change memory                       |  |
| Latency           | Read / Write: 50 / 150 ns <sup>[1]</sup>       |  |

## **Transaction Throughput**



| Silo improves throughput               | 1 core | 8 cores      |
|----------------------------------------|--------|--------------|
| Existing hardware logging designs      | 1.4x   | 4.3x         |
| Existing hardware logless design (LAD) | 1.1x   | <b>1.5</b> x |



| Silo improves throughput               | 1 core | 8 cores     |
|----------------------------------------|--------|-------------|
| Existing hardware logging designs      | 1.4x   | 4.3x        |
| Existing hardware logless design (LAD) | 1.1x   | <b>1.5x</b> |



| Wait to persist logs and cachelines | Silo improves throughput               | 1 core | 8 cores      |
|-------------------------------------|----------------------------------------|--------|--------------|
|                                     | Existing hardware logging designs      | 1.4x   | <b>4.3</b> x |
|                                     | Existing hardware logless design (LAD) | 1.1x   | <b>1.5</b> x |



#### Write Traffic



#### Write Traffic



#### Write Traffic



## **Overhead of Log Buffer**



# **Overhead of Log Buffer**



| <b>Battery consumption*</b>                                        | Intel's eADR             | BBB@HPCA'21                 | Our Silo                 |
|--------------------------------------------------------------------|--------------------------|-----------------------------|--------------------------|
| Flush Size for 8 cores (KB)                                        | 10,496                   | 16                          | 5.3125                   |
| Flush Energy (µJ)                                                  | 54,377                   | 194                         | 62                       |
| Supercapacitor (size: mm <sup>3</sup> ; area: mm <sup>2</sup> )    | <mark>151</mark> ; 28.4  | <mark>0.54</mark> ; 0.66    | <mark>0.17</mark> ; 0.31 |
| Lithium thin-film (size: mm <sup>3</sup> ; area: mm <sup>2</sup> ) | <mark>1.51</mark> ; 1.32 | <mark>0.0054</mark> ; 0.031 | 0.0017; 0.014            |

\* Based on the energy calculation model from BBB@HPCA'21

# **Overhead of Log Buffer**



| Battery consumption*                                               | Intel's eADR             | BBB@HPCA'21              | Our Silo                             |
|--------------------------------------------------------------------|--------------------------|--------------------------|--------------------------------------|
| Flush Size for 8 cores (KB)                                        | 10,496                   | 16                       | Smaller than<br>[eADR] 888.2x; 91.6x |
| Flush Energy (µJ)                                                  | 54,377                   | 194                      | [BBB] <b>3.2x</b> ; <b>2.1x</b>      |
| Supercapacitor (size: mm <sup>3</sup> ; area: mm <sup>2</sup> )    | <mark>151</mark> ; 28.4  | <mark>0.54</mark> ; 0.66 | 0.17; 0.31                           |
| Lithium thin-film (size: mm <sup>3</sup> ; area: mm <sup>2</sup> ) | <mark>1.51</mark> ; 1.32 | 0.0054; 0.031            | 0.0017; 0.014                        |

\* Based on the energy calculation model from BBB@HPCA'21

# **More Results**

- Handle large transactions
  - Log overflow occurs
  - Throughput decreases by only 7.4%
- Change latency of log buffer
  - A 128-cycle log buffer only decreases the throughput by **3.3%** over an 8-cycle one

#### Find more details in our paper!



Ensuring atomic durability becomes important for PM

- > Ensuring atomic durability becomes important for PM
- Prior hardware logging studies: Log as Backup
  - Heavy writes to PM
  - Ordering constraints between persisting logs and data

- > Ensuring atomic durability becomes important for PM
- Prior hardware logging studies: Log as Backup
  - Heavy writes to PM
  - Ordering constraints between persisting logs and data
- > We propose a speculative logging design Silo: Log as Data
  - Use on-chip logs to in-place update data (Make common case fast)
  - Write logs to back up data in rare cases (Guarantee recoverability)

- > Ensuring atomic durability becomes important for PM
- Prior hardware logging studies: Log as Backup
  - Heavy writes to PM
  - Ordering constraints between persisting logs and data
- > We propose a speculative logging design Silo: Log as Data
  - Use on-chip logs to in-place update data (Make common case fast)
  - Write logs to back up data in rare cases (<u>Guarantee recoverability</u>)

#### Benefits

- Improve transaction throughput
- Reduce write traffic to PM
- Low hardware overhead

# Thank you!