Fast and Correct Load-Link/Store-Conditional Instruction Handling in DBT Systems

Citation for published version:

Digital Object Identifier (DOI):
10.1109/TCAD.2020.3013048

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Peer reviewed version

Published In:
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s) and / or other copyright owners and it is a condition of accessing these publications that users recognise and abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer content complies with UK legislation. If you believe that the public display of this file breaches copyright please contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and investigate your claim.
Fast and Correct Load-Link/Store-Conditional Instruction Handling in DBT Systems

Martin Kristien, Tom Spink, Brian Campbell, Susmit Sarkar, Ian Stark, Björn Franke, Igor Böhm, and Nigel Topham

Abstract—Dynamic Binary Translation (DBT) requires the implementation of load-link/store-conditional (LL/SC) primitives for guest systems that rely on this form of synchronization. When targeting e.g. x86 host systems, LL/SC guest instructions are typically emulated using atomic Compare-and-Swap (CAS) instructions on the host. Whilst this direct mapping is efficient, this approach is problematic due to subtle differences between LL/SC and CAS semantics. In this paper, we demonstrate that this is a real problem, and we provide code examples that fail to execute correctly on QEMU and a commercial DBT system, which both use the CAS approach to LL/SC emulation. We then develop two novel and provably correct LL/SC emulation schemes: (1) A purely software based scheme, which uses the DBT system’s page translation cache for correctly selecting between fast, but unsynchronized, and slow, but fully synchronized memory accesses, and (2) a hardware accelerated scheme that leverages hardware transactional memory (HTM) provided by the host. We have implemented these two schemes in the Synopsis DesignWare® ARC® nSIM DBT system, and we evaluate our implementations against full applications, and targeted micro-benchmarks. We demonstrate that our novel schemes are not only correct, but also deliver competitive performance on-par or better than the widely used, but broken CAS scheme.

I. INTRODUCTION

Dynamic Binary Translation (DBT) is a widely used technique for on-the-fly translation and execution of binary code for a guest Instruction Set Architecture (ISA), on a host machine with a different native ISA. DBT has many uses, including cross-platform virtualisation for the migration of legacy applications to different hardware platforms (e.g. Apple Rosetta, and IBM PowerVM Lx86, both based on Transitive’s QuickTransit, or HP Aries) and the provision of virtual platforms for convenient software development for embedded systems (e.g. OVPsim by Imperas). A popular open-source DBT system is QEMU [1], which has been ported to support many different guest/host architecture pairs. QEMU is often regarded as a de-facto standard DBT system, but there exist many other proprietary systems such as Wabi, the Intel IA-32 Execution Layer, or the Transmeta Code Morphing software.

Atomic instructions are fundamental to multi-core execution, where shared memory is used for synchronization purposes. Complex Instruction Set Computer (CISC) architectures typically provide various read-modify-write instructions, which perform multiple memory accesses (usually to the same memory address) with atomic behavior. A prominent example of these instructions is the Compare-and-Swap (CAS) instruction. A CAS instruction atomically updates a memory location only if the current value at the location is equal to a particular expected value. The semantics of the CAS instruction is shown in Figure 1. For example, Intel’s x86 processors provide the CMPXCHG instruction to implement compare-and-swap semantics.

RISC architectures avoid complex atomic read-modify-write instructions by dividing these operations into distinct read and write instructions. In particular, load-link (LL) and store-conditional (SC) instructions are used, and the operation of these instructions is shown in Figure 2. LL and SC instructions typically operate in pairs, and require some kind of loop to retry the operation if a memory conflict is detected.

A load-link (LL) instruction performs a memory read, and internally registers the memory location for exclusive access. A store-conditional (SC) instruction performs a memory write if, and only if, there has been no write to the memory location since the previous load-link instruction. Among competing store-conditional instructions only one can succeed, and unsuccessful competitors are required to repeat the entire LL/SC sequence.

For DBT systems, a key challenge is how to emulate guest LL/SC instructions on hosts that only support CAS instructions. Since LL/SC linkage should be broken by any

Fig. 1: The compare-and-swap instruction atomically reads from a memory address, and compares the current value with an expected value. If these values are equal, then a new value is written to memory. The instruction indicates (usually through a return register or flag) whether or not the value was successfully written.

M. Kristien, T. Spink, B. Campbell, I. Stark, B. Franke, and N. Topham are with the School of Informatics, University of Edinburgh, UK. S. Sarkar is with the University of St. Andrews, UK. I. Böhm is with Synopsys Inc. Manuscript received April 18, 2020; revised June 12, 2020; accepted July 6, 2020. This article was presented in the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems 2020 and appears as part of the ESWEEK-TCAD special issue.
write to memory, the efficiency of detecting intervening
memory writes becomes crucial for the overall performance.
A naïve approach would synchronize all store instructions,
using critical sections to enforce atomicity when executing
the instruction. However, such a naïve scheme would incur
a high runtime performance penalty. A better approach to
detect intervening stores efficiently is to use a compare-and-swap
instruction. In this approach, the guest load-link instruction
reads memory, and stores the value of the memory location
in an internal data structure. Then, the guest store-conditional
instruction uses this value as the expected parameter of the
host CAS instruction. If the value of the memory location
has not changed, then the store-conditional can succeed.
Unfortunately, although fast, this approach does not preserve
the full semantics of LL/SC instructions, and it suffers from the
ABA problem: A memory location is changed from value A
to B and back to A [2]. In this scenario, two interfering writes
to a single memory location have occurred, and the store-
conditional instruction should fail. However, since the same
value has been written back to memory, this goes unnoticed
by the instruction, and it incorrectly succeeds. We refer to this
broken approach as the CAS approximation, as this emulation
strategy only approximates correct LL/SC semantics.

A. Motivating Example

Figure 3 shows how the CAS approximation implements the
operation of the load-link and store-conditional instructions.
The load-link instruction records the value of memory (into
linked-value) at the given address, and returns it as usual.
The store-conditional instruction performs a compare-and-
swap on the guest memory, by comparing the current memory
value to linked-value (from the load-link instruction), and
swapping in the new value if the old values match. If the CAS
instruction performed the swap, then the store-conditional is
deemed to have been successful.

1) Trivial Broken Example: The following sequence of
events describes a possible interleaving of guest instructions,
which will cause LL/SC instructions implemented with the
CAS approximation to generate incorrect results. The sequence
of events are:

- t0: Core 1 performs a load-link instruction, marking
memory address 0x1000 for exclusive access. The value
returned from memory is #1.
- t1: Core 2 writes the value #2 to memory, using a regular
store instruction.
- t2: Core 2 writes the value #1 to memory, again using
a regular store instruction.
- t3: Core 1 performs a store-conditional instruction, and
since the value in memory is the same from when it
was read in t0, incorrectly performs the actual store, and
returns a SUCCESS result.

The CAS approximation approach violates the semantics of
the LL/SC pair at t3, as from the point-of-view of the store-
conditional instruction, the value in memory has not changed,
and so the instruction incorrectly assumes no changes were
made, and hence succeeds. However, this assumption is
wrong, since two interfering writes (at t1 and t2) have
been performed, and thus should cause the store-conditional
instruction to return a FAILED result.

We constructed a simple program to test this behavior on
both a real 64-bit Arm machine, and an emulation of a 64-
bit Arm machine using QEMU. Our analysis found that the
Fig. 4: Implementation of a lock-free stack pop operation.

Since the CAS approximation detects modifications to memory only through changes of values, other techniques must be used to prevent the ABA problem and guarantee correctness.

Similar to our QEMU experiment to detect incorrect behavior, we also constructed a test program of an LL/SC based implementation of a lock-free stack. We discovered that in QEMU, interleaving stack accesses as depicted in Table I results in stack corruption.

In this paper we contribute a novel, provably correct scheme for implementing load-link/store-conditional instructions in a DBT system and show that our correct implementation delivers application performance levels comparable to the broken CAS approximation scheme.

II. SCHEMES FOR LL/SC HANDLING

We introduce four schemes for handling LL/SC instructions in DBT systems. These schemes range from a naïve baseline scheme, through to an implementation utilizing hardware transactional memory (HTM).

1) Naive Scheme (Section II-A): This scheme inserts standard locks around every memory instruction for tracking linked addresses, effectively turning memory instructions into critical sections. The scheme is correct, but impractical due to the extensive performance penalty associated with synchronizing on every memory access.

2) Broken: Compare-and-Swap-based Scheme (Section II-B): This scheme is used in state-of-the-art DBT systems such as QEMU. The scheme maps guest LL/SC instructions onto the host system’s compare-and-swap instructions, resulting in high performance. However, it violates LL/SC semantics.

3) Correct: Software-based Scheme (Section II-C): This scheme utilizes facilities available in the DBT system to efficiently manage linked memory addresses, by taking advantage of the page translation cache.

4) Correct: Hardware Transactional Memory (Section II-D): This scheme exploits the hardware transactional memory to efficiently detect conflicting memory accesses in LL/SC pairs.

The handling of LL/SC instructions in DBT systems closely follows the observed hardware implementation of these instructions. Each load-link instruction creates a CPU-local record of the linked memory location (i.e. storing the memory address in a hidden register). Then, the system monitors all writes to the same memory location. If a write is detected, the linkage of all CPUs is broken, to ensure that no future store-conditional instruction targeting the same memory location can succeed.

Emulating store-conditional instructions requires verifying that the local linkage is still valid. If so, the store-conditional can succeed, and invalidate the linkage of other CPUs for the same memory location atomically.

Since emulating an SC instruction comprises several checks and updates that must appear to be atomic, this emulation must be properly synchronized with the emulation of other SC and LL instructions. Furthermore, concurrent regular stores that...
interleave between the LL/SC pairs have to be detected so that future SC instructions cannot succeed. Detecting interleaving regular stores is the main challenge in the efficient emulation of LL/SC instructions.

A. Naïve: Lock Every Memory Access

A naïve implementation of LL/SC instructions guards all memory operations to the corresponding memory address with the same lock. Conceptually, a global lock can be used to guarantee mutual exclusion of all LL/SC and regular stores. In practice, more fine-grained locking can be used to improve performance, by allowing independent memory locations to be accessed concurrently.

This emulation scheme is presented in Figure 5. A load-link instruction enters a critical section, and creates a local linkage under mutual exclusion with respect to store-conditional, and regular stores to the same locations.

The store-conditional instruction checks the local linkage, and if it matches, then it performs the actual write to the memory address. The linkages of other CPUs corresponding to the same memory address are also invalidated, so that no future store-conditional can succeed. The emulation of a regular store invalidates the linkage on all CPUs corresponding to the same memory address. The actual write is performed unconditionally.

Although simple, this scheme suffers a significant performance penalty, due to the critical sections causing a slowdown of all regular store instructions. Lock acquisition has to be performed by every regular store, even those that do not access memory locations currently in use by LL/SC pairs. For typical applications, the majority of regular stores are slowed down unnecessarily.

B. Broken: Using CAS Style Semantics

This scheme uses a compare-and-swap instruction to approximate the semantics of load-link/store-conditional pairs. The LL/SC linkage comprises not only the address, but also the memory value read by the LL instruction.

Emulating the load-link instructions saves the linked address as well as the linked value (i.e. the result of the corresponding read). Then, emulating the store-conditional instruction uses the linked value for comparison with the current value of the linked memory location using the CAS instruction. If the current value of the memory at the previously stored linked address does not match the linked value saved from the previous LL instruction, an interleaving memory access has been detected. Since intervening writes to the memory location are detected by changes in memory value, no linkage invalidation of other CPUs is required. Similarly, regular stores can proceed without any need of linkage invalidation or synchronization.

This scheme offers great performance, since the emulation of LL/SC and regular stores does not need to synchronize at all. Furthermore, the compare-and-swap instruction is a well established synchronization mechanism, and thus its performance is optimized by the host hardware. However, as we have demonstrated, the CAS scheme only approximates the semantics of LL/SC instructions and in particular, utilizing the CAS scheme for this purpose can cause the emulation of guest programs to break.

C. Correct Software-Only Scheme

This scheme improves upon the approach taken by the naïve scheme. The key idea is to slow down only those regular stores
that access memory locations that are currently being used by LL/SC instructions. To achieve this, we take advantage of a software Page Translation Cache (PTC), which is used in the DBT system to speed up the translation of guest virtual addresses to host virtual addresses. Emulated memory accesses can query this cache to avoid a costly full memory address translation, which may incur walking guest page tables, and resolving guest physical pages to corresponding host virtual pages.

Upon a hit in the cache, the emulation can use a fast-path version of the instruction. For regular stores, fast-path version involves no synchronization with LL/SC instructions. Note that each core has its own private PTC, and that there exists a number of separate per-core PTCs based on memory access type (e.g. read, write, execute).

This scheme allows regular stores that are not conflicting with LL/SC memory addresses to proceed without any slowdown, by only synchronizing if the Write-PTC query misses in its corresponding cache. To achieve this behavior, emulating a load-link instruction involves invalidating the Write-PTC entry for the corresponding memory address. Then, future regular stores will be guaranteed to miss in the cache, and will be forced to synchronize with concurrent LL/SC instructions. In other words, no concurrent regular store can enter the fast-path without invalidating all LL/SC linkages for the corresponding memory address.

1) Page Translation Cache Race Condition: Implementing the Software-only scheme without regard for the ordering of operations between cores leaves room for a race condition to appear. This particular scenario is depicted in Table II.

<table>
<thead>
<tr>
<th>Time</th>
<th>Core 1</th>
<th>Core 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>$t_0$</td>
<td>Regular Store</td>
<td>Load-link</td>
</tr>
<tr>
<td>$t_1$</td>
<td>PTC lookup hits in cache</td>
<td>Create linkage</td>
</tr>
<tr>
<td>$t_2$</td>
<td>Invalid PTC</td>
<td>Read data from memory</td>
</tr>
<tr>
<td>$t_3$</td>
<td>Write data to memory</td>
<td></td>
</tr>
<tr>
<td>$t_4$</td>
<td></td>
<td></td>
</tr>
<tr>
<td>$t_5$</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

**TABLE II:** A particular interleaving of operations that leads to a race-condition in the Software-only scheme.

This interleaving, on Core 1, the emulation of a regular store performs a successful PTC lookup, and enters the fast-path at time $t_0$. At this point, Core 2 performs the full emulation of a load-link instruction, i.e. the linkage is established, the corresponding PTC entry is invalided, and the data is read from memory ($t_2$ through $t_4$). Then, back on Core 1, the emulation of the regular store actually updates the data in memory, at time $t_5$.

After this execution, a store-conditional instruction on Core 2 can still succeed, as Core 1 did not invalidate the underlying linkage. However, the data read by Core 2 at time $t_4$ is now out of date, since an interleaving write to the memory address (Core 1 at $t_3$) has been missed.

This race-condition manifests itself when PTC invalidations performed by LL instructions happen after a successful PTC lookup by a regular store, but before the actual data write by the regular store instruction.

To prevent such behavior, we protect the PTC entry during fast-path execution of regular stores by making a cache line tag annotation. In particular, successful PTC lookups atomically annotate the underlying tag value by setting a bit corresponding to an ongoing fast-path execution. This tag bit can be cleared only after the actual data update has happened. While the tag bit is set, PTC invalidation during LL emulation is blocked.

2) Optimizations: There are a number of optimizations that can be applied to the software-only scheme, to make further improvements.

Firstly, since the linkage comprises a single word, it can be updated atomically. This means that mutual exclusion is not required for LL emulation. However, operations still have to have a particular order to guarantee correctness. In the case of LL emulation, the data read must happen after the PTC invalidation, which must happen after the linkage is created.

The required ordering is depicted in Figure 7 using a red dotted line to indicate a memory fence across which operations cannot be reordered. We also provide a proof of correctness of this scheme in Section IV.

Secondly, since the emulation of LL does not use locks for mutual exclusion, the state of the underlying lock can be used to infer if a concurrent write operation is in flight.

This can be used to immediately fail store-conditional instructions without checking whether or not the linkage is present. If the emulation of an SC instruction observes the lock as being already held, then there is already a concurrent update to the memory location in progress, so the SC instruction will fail anyway. In this scenario, the SC instruction fails immediately. In other words, store-conditional emulation acquires the underlying lock if and only if it has enough confidence that it is still possible to succeed. Note that the emulation of regular stores always acquires the lock, as this write is unconditional.

These optimizations reduce the amount of code that executes within a critical section, and especially in many-core applications, this results in less code having to serialize, thus resulting in much better scalability.

D. Correct Scheme Using Hardware TM

Hardware Transactional Memory (HTM) has been used for implementing memory consistency in cross-ISA emulation on multicore systems before [6]. We take inspiration from this approach, and develop a scheme for implementing LL/SC handling using similar HTM constructs.

HTM allows the execution of a number of memory accesses to be perceived as atomic. This can be exploited for the handling of LL/SC instructions by encapsulating the whole LL/SC sequence in a transaction. In particular, the emulation of a load-link begins a transaction (e.g. the XBEGIN Intel TSX instruction) and then reads the data. The emulation of a store-conditional writes the new data, and then commits the ongoing transaction (e.g. XCOMMIT).

Since all memory accesses between LL and SC instructions (inclusive) are part of a transaction, any concurrent operations (i.e. another LL/SC pair) succeed only if there are no conflicts between any memory accesses. Furthermore, even memory accesses that are not part of a transaction will abort ongoing transactions in the case of a conflict.
This behavior is guaranteed by the strong transactional semantics supported by many modern architectures. As a result, this scheme can avoid any locking or linkage invalidation by relying on transactional semantics to resolve conflicts between concurrent LL/SC pairs, and to detect intervening regular stores.

Using HTM always requires a fallback mechanism to be in place, as HTM is not guaranteed to succeed. In our case, the fallback mechanism is the Software-only scheme, described in Section II-C.

The emulation of an SC instruction can check if it is executing within a transaction by e.g. using the Intel TSX XTEST instruction. For any core, both LL and SC instructions either take the transactional path or the fallback path. However, it is still possible for one core to execute a LL/SC sequence transactionally, while another core concurrently executes LL/SC using the fallback mechanism. Therefore, the transactional path has to correctly interact with fallback path.

Since the transaction is perceived to be atomic, we analyze the interaction from the perspective of the fallback execution, assuming that transactional execution happens instantaneously. The transaction can happen while inside a critical section, during the emulation of a store-conditional in the fallback path.

To guarantee correct execution, we use a lock elision technique [7] that aborts (e.g. XABORT) the transaction if any cores are currently executing in the fallback path (i.e. within a critical section).

Another scenario is when the transaction happens while the fallback execution emulates instructions between LL and SC pairs. Here, the lock elision technique cannot be used, since there is no lock to elide. Instead, the fallback linkage is invalidated, causing the future fallback SC instruction to fail. This approach allows the transactional execution to continue, to reduce the number of transaction aborts.

### III. Evaluation

We compiled applications from EEMBC Multibench [8] and PARSEC [9] benchmark suites for bare-metal parallel execution by linking the binaries against a custom bare-metal pthread library. The pthread library assigns each application thread to a virtual core. In our evaluation, we use 10-core execution to support eight workload threads for each benchmark. The two remaining cores are necessary to support non-workload threads, e.g. the main application thread that spawns the workload threads.

In addition to the application benchmarks, we constructed our own suite of micro-benchmarks. These are designed to stress the use of LL/SC instructions. The benchmarks vary in the level of congestion in space and time. The micro-benchmarks are described in Table III and the details of the host machine used for experimentation are shown in Table IV.

#### A. Key Results: Application Benchmarks

In this section, we present results demonstrating that our different schemes do not adversely affect normal application performance. In these benchmarks, the number of LL/SC sequences are low, but since the LL/SC mechanism interacts with regular memory accesses, we show that our schemes do not incur a significant performance overhead during normal operation.

Figure 9 shows that the Multibench suite does not exhibit much change in performance. For most benchmarks, all schemes result in the runtime performance falling within 5% relative to the Naïve scheme. This is due to the infrequent execution of the affected instructions (i.e. LL, SC, and regular store) by these benchmarks. For example, data-parallel benchmarks synchronize using LL/SC instructions only at the beginning and at the end of the execution. In this case, the
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>SC %</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>space_&lt;x&gt;</td>
<td>3.84</td>
<td>atomically increment random counter in an array of size x</td>
</tr>
<tr>
<td>space_indep</td>
<td>16.67</td>
<td>atomically increment a thread-private counter</td>
</tr>
<tr>
<td>time_&lt;x&gt;_&lt;y&gt;</td>
<td></td>
<td>variable workload loop performs y instructions, x of which are inside LL/SC sequence</td>
</tr>
<tr>
<td>stack</td>
<td>2.32</td>
<td>alternate pushing and popping element in a lock-free stack</td>
</tr>
<tr>
<td>prgn</td>
<td>16.67</td>
<td>generate random numbers by a lock-free random number generator</td>
</tr>
</tbody>
</table>

TABLE III: Micro-benchmarks: Heavy use of LL/SC corresponds to a high percentage of executed SC instructions.

Counter-intuitively, on average the CAS scheme performs slightly worse than HTM and PTC, but the difference is negligible, and we attribute this minor performance difference to indirect effects, such as the dynamic memory layout of the simulator.

Similar to the Multibench suite, Figure 10 shows that PARSEC benchmarks do not heavily use LL/SC instructions. However, the use of regular store instructions is much more frequent. This results in the PARSEC suite showing a significant performance improvement by using any other scheme compared to the Naïve scheme. Since the Naïve scheme incurs a synchronization overhead for all regular stores, its performance degrades compared to the other schemes that do not synchronize independent regular stores. The performance of the other schemes is comparable, with the PTC scheme achieving a speedup of 1.18× over Naïve on average.

B. Drilling Down: Micro-Benchmarks

These benchmarks exhibit more frequent use of LL/SC instructions and emphasise performance implications.

We evaluate the implementation overhead of each scheme by running a single-core version of the benchmarks. There is no congestion, as there are no concurrent cores. As a result, all SC instructions succeed and no LL/SC sequence is repeated. The single-core performance results are shown in Figure 11.

The CAS scheme results in the best performance with the average speed-up of 1.04× over the naïve scheme. This is due to implementation simplicity and lack of explicit synchronization. However, this scheme does not preserve the LL/SC semantics. The performance of the PTC scheme is on par with the naïve scheme. This indicates that the additional cache invalidation and lookup present in the PTC scheme incurs negligible overhead. The HTM scheme shows performance comparable to the other schemes only if the transactions are small, i.e. there are few instructions between LL and SC. In some benchmarks, there can be around 200 guest instructions between LL and SC, and we typically execute around 300
host instructions per guest instruction. This quickly leads to large transaction sizes on the host machine (e.g. in the order of 60,000 instructions), significantly increasing the chances of an abort. This causes a significant drop in performance for benchmarks that perform substantial work inside an LL/SC sequence.

Next, we evaluate the performance of the proposed schemes in a multi-core context with eight cores. Here, each core must synchronize and communicate the LL/SC linkage, resulting in different LL/SC success rates for different schemes. The efficiency of linkage communication affects the overall performance beyond the single-core overhead. The results are shown in Figure 12.

In the case of high space congestion (space_1), PTC outperforms the naïve scheme by 2.6×. The broken CAS scheme results in the best performance. Fast LL/SC handling lowers the chances of interleaving concurrent LL/SC pairs, resulting in lower SC failure rates. Using the CAS scheme, each SC instruction fails three times on average before succeeding, while the PTC scheme results in six failures for each successful SC instruction.

Without congestion and all LL/SC pairs updating independent memory locations (space_indep), the naïve scheme outperforms both PTC and HTM. Since the benchmark has few regular stores, the naïve scheme does not suffer the overhead of unnecessary synchronization of all regular stores. But the naïve scheme emulates LL instructions much more efficiently, as it does not need to handle the PTC invalidation.

In the case of time congestion (time_0_64), large frequent LL/SC sections result in low HTM performance. The poor HTM performance in this case has also been observed for single-core execution. If the LL/SC sections are small (time_0_64), the HTM scheme outperforms the PTC scheme with a speed-up over Naïve of 1.36× for HTM and 1.19× for PTC. This is because HTM can rely on transactional semantics to avoid data races and thus, if most transactions succeed, it can avoid explicit synchronization with other cores.

C. Lock Granularity

To implement critical sections in the proposed schemes, we use standard host-side locks. For correct LL/SC emulation, it is necessary to use the same lock for matching target memory addresses of concurrent LL/SC instructions. Utilising multiple locks enables different LL/SC targets to proceed independently.

To achieve this, we create a hash-map of host locks. By masking the target of an LL/SC instruction, we obtain an index into the hash-map, which contains the lock to be used for the critical section. Manipulating the size of the mask allows us to control the granularity of the host-side lock. We have experimented with several mask sizes using the space_1024 micro-benchmark (Figure 13), and the results are shown in Figure 13.

The experimental data shows that using a single global lock results in poor performance. Note, the CAS scheme does not use host-side locking at all, and therefore is not affected by the lock granularity. Although correct, using a single lock to facilitate critical sections even for independent LL/SC instructions is an obvious performance bottleneck. We find that using even slightly finer-grained locking improves performance, and results in the level of congestion being predominantly controlled by the guest-side (as opposed to host-side) lock granularity.

In practice, there can only be as many independent LL/SC targets in flight at the same time as is the number of cores being simulated. For example, in an 8-core configuration, any lock granularity that allows for eight random LL/SC targets to map to independent host locks can already achieve optimal performance. Therefore, in all other experiments, we use a word-sized mask, and a hash-map big enough to render collisions for (e.g. eight) random targets insignificant.

D. Scalability

We evaluate the effect of varying the number of cores on the performance of the schemes. We vary the number of cores from one to eighteen in two core increments. The upper number of cores is selected such that all simulation threads, i.e. virtual cores, can be scheduled at the same time, on the same processor. This configuration results in the greatest level of parallelism without incurring any NUMA overhead caused by inter-processor communication.

All micro-benchmarks perform a constant amount of work overall. The benchmarks split the same workload evenly between all available cores. We expect to see two types of ideal scaling characteristics. First, data parallel benchmarks with a low level of congestion (such as space_indep) should exhibit runtimes inversely proportional to the number of cores. Second, high-congestion benchmarks (such as space_1) need to sequentialize almost all execution. In this case, adding more cores should result in only marginal runtime
improvements, i.e. the runtime is expected to remain constant for all number of cores.

Without much congestion (space_indep), the schemes scale almost ideally. For a greater number of cores, the performance degrades most significantly for the HTM scheme. However, CAS exhibits little performance degradation even for high number of cores. This is because CAS is a long established synchronization mechanism that has been highly optimized by the hardware designers, whereas hardware transaction technology is relatively new and implementations haven’t been optimized to the same degree.

High space-congestion benchmarks (space_1) exhibit near ideal scaling for CAS. HTM scales poorly and above ten cores becomes the worst performing scheme. For this many cores, very few transactions are able to complete without conflicts, resulting in almost all LL/SC instructions taking the fallback path. The PTC scheme scales significantly better than the Naïve scheme, especially at the medium number of cores (up to ten). The Naïve scheme shows no improvement even when moving from one-core to two-core execution. We attribute this scalability improvement to the optimizations introduced by PTC, as discussed in Section II-C2. In particular, using a lock to handle both LL and SC instructions forces the execution (in the case of Naïve) to sequentialize unnecessarily.

The time-congestion benchmarks show similar behavior to the space-congestion benchmarks. CAS scales almost ideally for all levels of congestion. The PTC scheme scales similarly to CAS, and outperforms all other correct schemes. The HTM scheme shows poor and unstable performance. For a low core counts, the performance is sensitive to the hardware implementation, which affects the efficiency of conflict detection, and transaction failures. For higher numbers of cores, few LL/SC sequences succeed on the transactional path, resulting in most executions taking the fallback path.

IV. PROOF OF CORRECTNESS

We prove that the naïve and software-only schemes’ behavior is allowed. We do not consider the HTM scheme here for space reasons. We assume that the host has a Total-Store-Order memory model like x86, where each core’s writes appear in memory in program order. However, guest architectures with LL/SC such as Arm will often have a weaker memory model, so first we define our expectations for LL/SC in such a model.

a) Axiomatic Definition of Atomicity: In an axiomatic memory consistency model such as ARMv8 [11], we assume a coherence order relation (co) between all Writes to the same address, and a reads-from relation (rf) between a Write and a Read that reads its value. A from-reads relation (fr) can be derived from these as between a Read and a Write that is later (in co) than the Write the Read reads-from. We can also consider parts of these relations that relate events from different external threads, and call these coe, fre, rfe.

The atomicity condition for LL/SC is then that there is no fre followed by coe between any LL and a successful SC.
coherence order co) the one that reads-from and before the Write from a successful SC. This captures ARMv8-like LL/SC correctness. For architectures which allow no Writes (not even same-thread Writes) between the LL and SC, the condition can be stated as there is no fr followed by co between any LL and a successful SC.

b) Axiomatic Definition of Properly Locked Executions: For locks we have two events, lock acquire and lock release. The definition of Properly Locked executions says that any successful lock acquire is followed by the corresponding lock release, and preceded by the previous release (or no lock events for the initial acquire). This order must be common to all threads. Further, same thread events (Writes, Reads) after the lock acquire are globally ordered after the acquire, and same thread events before the release are globally ordered before the release.

c) Naïve Scheme from Section II-A: All LL, SC and Writes are guarded by locks, and furthermore the same lock is used for any particular address.

d) Proof: We have a global order for any properly locked executions. Since all LL, SC and Write events are between successful lock acquires and releases on the same thread, they are globally ordered by the lock ordering. We can read off the coherence order co (and coe) as a subset of this global order, and similarly for from-reads order fr (and fre). Note that the place of a LL in this lock order corresponds to when its address is saved, not when the read is done. This is safe because if any write (SC or normal) from other threads intervenes in the lock order between the LL and a SC, the lock address is guaranteed to be invalidated and thus the SC must fail. The correctness of the atomicity condition is thus ensured by the check done for successful SC (within a locked region) that no other thread has done a Write since the last LL.

e) Software-only Scheme from Section II-C: The SC and slow path Writes are still guarded by locks, but Writes on the fast path used when the page table cache lookup succeeds are guarded by a different lock (tag bit) in the cache entry. An LL invalidates the cache entries for the matching virtual address on all other threads (within a locked region), spinning on the lock in the old cache entry to avoid races with ongoing fast path Writes, before invalidating all other locked addresses, and then reading the value. Plain Writes then check whether the virtual address is valid on that thread. If so, the Write action is done immediately (fast path). If not (slow path), the thread acquires the lock, performs the Write action, and then clears matching locked addresses on all other threads (and releases the lock). An SC checks whether the locked address is still held, and if so then does the Write and succeeds, otherwise it fails.

f) Proof: If all Writes go through the slow path then the proof is analogous to the naïve version. The only wrinkle is that the LL is not locked and therefore does not participate in the lock ordering. Consider however that the LL only does the underlying read of the value after saving the locked address, and the locked address is invalidated by any such write, after the write. So either the LL reads the old value, and then the write (and possible invalidation) occurs, or the LL reads the new value, but if so the locked address will get invalidated before the writing thread releases the lock. Both cases are safe. Now let us consider the case that one or more Write goes through the fast path. We show that no such Write can come in between the LL and a successful SC (in the time order implied by the ordering between the LL and SC’s acquire). Indeed, using the fast path means the virtual address is valid on the Write’s thread. Moreover, it must be valid when the Write happens because it is protected by the cache tag entry lock. Since the LL invalidates virtual addresses on all threads before doing the Read, this means the Write must be ordered before the invalidate which is part of the LL, or else the write cannot go on the fast path until after the SC mutex release. In the first case, since the LL does the invalidate while spinning on the lock in the cache entry, the Write must have been done before the invalidate completed, i.e. definitely before the underlying read of the LL. In the second, such a fast-path Write is by definition not between the LL and SC. Of course, when the LL has invalidated the Writer’s page table entry, the write can still go on the slow-path, but then we are back to the previous case.

V. RELATED WORK

[12] highlights the issues in correct handling of LL/SC instructions in QEMU, and a possible solution is provided. While similar to our software-only scheme, it does not provide implementation details of how the PTC invalidation is performed and how the concurrent fast-path is reintroduced subsequently. These details are crucial for both performance and correctness, and without them it is impossible to follow their correctness argument. In addition, the slow-path implementation relies on expensive mutexes. In contrast, we exploit lock-free optimizations for the emulation of both LL and SC instructions. Prior solutions developed in PQEMU [13] and COREMU [14] are shown to suffer from the ABA problem, i.e. they incorrectly implement the LL/SC semantics as demonstrated in our motivating example. Pico [15] introduces support for hardware transactional memory for the emulation of LL/SC synchronization, while an intermediate software approach uses extensive locking. In contrast, our approaches employ sparse locking and we deliver a formal proof-of-correctness. Qelt [16] is a recent development based on QEMU. While fast thanks to its floating-point acceleration, the paper does not offer any details on the implementation of their LL/SC emulation approach. [17] employs a lock-free FIFO queue for LL/SC emulation, but the paper is hard to follow and lacks a convincing correctness argument. XEMU [18] considers guest architectures with native LL/SC support. [19] presents a method to adapt algorithms using LL/SC to a pointer-size CAS without the ABA problem by using additional memory for each location, which would not be suitable for an emulator. The implementation presented assumes sequential consistency, but does include a formal proof mechanized in the PVS proof assistant. A more theoretical approach to LL/SC implementation without performance evaluation is taken in [20]. A wait-free multi-word Compare-and-Swap operation is the subject of [21].
VI. SUMMARY & CONCLUSIONS

We have shown that existing DBT systems implement an approximate version of load-link/store-conditional instructions that fail to fully capture their semantics. In particular, we showed how these implementations can cause bugs in real world applications, by causing the ABA problem to appear in e.g. lock-free data structures. We presented software-only and HTM assisted schemes that correctly implement LL/SC semantics in the commercial Synopsys DesignWare® ARC® nSIM DBT system. We evaluate our schemes and show that we can maintain high simulation throughput for application benchmarks for provably correct LL/SC implementations.

REFERENCES


