Transaction Scheduling: From Conflicts to Runtime Conflicts

This paper studies how to improve the performance of main memory multicore OLTP systems for executing transactions with conflicts. A promising approach is to partition transaction workloads into mutually conflict-free clusters, and distribute the clusters to different cores for concurrent execution. We show that if transactions in each cluster are properly scheduled, transactions that are traditionally considered conflicting can be executed without conflicts at runtime. In light of this, we propose to schedule transactions and reduce runtime conflicts, instead of partitioning based on the conventional notion of conflicts. We formulate the transaction scheduling problem to minimize runtime conflicts, and show that the problem is NP-complete. This said, we develop an efficient scheduling algorithm to improve parallelism. Moreover, for transactions that are not packed in batches, we show that runtime conflict analysis also helps reduce conflict penalties, by proposing a proactive deferring method. Using standard and enhanced benchmarks, we show that on average our scheduling and proactive deferring methods improve the throughput of existing partitioners and concurrency control protocols by 131% and 109%, respectively, up to 294% and 152%.


Introduction
There has been increasing demand on high volume of concurrent transactions from e.g., e-commerce, FinTech and cloud applications [28].This and the growing dominance of multicore machines highlight the need for pursuing higher throughput and parallelism of multi-thread transaction processing [4,31,38,56].Due to contended operations that read and write the same data items, transaction execution has to be guarded against concurrency anomalies to uphold the desired isolation levels.There are mainly two types of approaches to maximizing concurrency while providing isolation guarantees: (a) partition-based approaches [14,21,31,34,38,45] that use transaction partitioners to decide a transaction-to-thread assignment for a batch of transactions before their execution; and (b) concurrency control (CC) based approaches [9,24,38,49,57] that, instead of searching for good transaction-to-thread assignments, focus on CC protocols to resolute contended transactions on-the-y.
Transaction partitioning methods focus on a bundle of transactions that are e.g., a set of transaction collected in a xed duration [34] or workloads [21] for which the transaction logic is known in advance (e.g., stored procedures and hard-coded templates), as targeted by deterministic databases [4,36].Such workloads allow the system to carry out analysis over the transactions, i.e., transaction partitioning, to decide the best way of assigning transactions to threads.More speci cally, given a workload W over threads, a transaction partitioning method computes a partitioning ( 1 , 2 , . . ., ) of W that assigns each to a distinct thread such that transactions in the same partition are executed serially, while concurrent execution is governed by CC for the desired isolation levels.Its objective is to minimize con icts among transactions across different partitions to reduce CC cost, while maximizing load balance so as to minimize total parallel execution time.Instead of searching for the best thread assignment for the transactions, CC aims to resolute con icts between transactions that are concurrently executed, and do not require the access sets of transactions before execution.There are two types of CC cost: (a) CC overhead incurred on every transaction and (b) con ict penalty when a con ict between transactions happens at runtime, e.g., block (waiting) time with locking-based CC [10,17] or abort/retry penalty with optimistic concurrency control protocols [5,24].In practice, CC protocols with higher overhead are more likely to reduce con ict penalties.Hence, the e ectiveness of CC relies on delicate trade-o s between the overhead and the ability to reduce con ict penalties.
We refer to transactions targeted by the partitioning and CC approaches as bundled transactions and unbundled ones, respectively.
Runtime con ict.Con icts are conventionally de ned relative to the isolation level adopted by transaction systems.For serializability [20], transactions and are in con ict if they access the same data item and at least one of them updates it.For snapshot isolation [7], and are con icting if they write to the same data item.
However, transactions that are conventionally considered in conicts can be executed concurrently without con icts at runtime as long as they are scheduled properly, as illustrated below.
Example 1: Consider a set W 0 of transactions { 1 , 2 , 3 , 4 , 5 }: , and where R[ 1 ] denotes a read of data item 1 and W[ 1 ] is a write to 1 .Assume that the targeted isolation level is serializability.Then 1 , 2 and 3 are in con ict; similarly for ( 2 , 5 ) and ( 4 , 5 ).Assume that we have two cores.Following [34], we partition W 0 into 1 = { 1 , 2 , 3 } and 2 = { 4 }, along with 5 as a crosspartition transaction, as shown in Fig. 1(a).One can execute 1 and 2 concurrently without CC, followed by 5 after both 1 and 2 are completed.Assuming that each of read and write operation takes a unit time; then the makespan of such execution (the concurrent execution of the transactions) of W 0 is 20 time units.Note that the workloads 1 and 2 are not balanced, and the second core may idle for long before the cross-partition transaction 5 can start.
Alternatively, by following [14,21], one can start 5 as soon as 4 completes, as shown in Fig. 1(b); however, this requires CC since 5 con icts with 2 that is being concurrently executed at the other core.This may cause a retry of 2 (when using OCC), and a makespan of 17 for W 0 (ignoring CC overhead for all transactions).
In contrast, if we schedule the transactions as two queues Q 1 = 2 , 1 , 3 and Q 2 = 4 , 5 , i.e., imposing an execution order of transactions in the partitions, we can process W 0 at two cores concurrently without CC as shown in Fig. 1(c), by executing Q 1 and Q 2 in their orders at the two cores independently.Although 2 of Q 1 is in con ict with 5 of Q 2 , their executions actually do not overlap.In fact, the executions are serializable even without CC.Note that the workloads are more balanced than the case with partitioning, and the makespan of such an execution of W is 14 instead of 20. 2 This example suggests that we consider runtime con icts instead for partition-based approaches, which provide a more accurate characterization of transaction executions and more opportunities to improve concurrency.Moreover, runtime con icts could also help with CC-based approaches for unbundled transactions.
Example 2: Consider the execution of transactions in Example 1 following a CC-based approach: we execute 1 , 2 and 3 at core 1 and 4 , 5 at core 2 in order, using OCC.Then 2 retries due to conict with 5 , causing a total execution time of 17 (as shown Fig. 1(b)).
If prior to the start of 2 , core 1 could nd that 5 is being executed at core 2, which will incur con ict with 2 , then a better option is to "defer" 2 and execute 3 rst instead.In this way, 3 , 5 and 2 can all commit without retry with a total time of 14 instead of 17 as shown in Fig. 1(d).That is, we can reduce runtime con icts during execution for CC-based approaches, by proactively deferring transactions and changing their execution order on-the-y.2 Transaction scheduling.For bundled transactions that are handled by existing partitioners, we propose to schedule transactions so as to minimize runtime con icts.Given a workload W that has been partitioned over cores, we generate queues (Q 1 , . . ., Q ) and a residual set R, such that Q 's consist of transactions in W without incurring runtime con icts, and hence can even be executed without CC.As opposed to partitioning, transactions in Q are scheduled, by placing an ordering on the executions of their transactions.Transactions in R are executed over all cores using CC.
Intuitively, transactions in Q and those in Q ( ≠ ) have no runtime con ict if they are executed on schedule, although they may be conventionally considered in con ict; hence transactions in Q run serially at the -th core, and in parallel with those in Q ( ≠ ) without con icts.Transactions in R in ict runtime con icts and hence are executed at all cores with CC.As shown in Example 1, scheduling allows more transactions to be put in Q 1 , . . ., Q and executed concurrently without con icts compared to partitioning.
We show that transaction scheduling is intractable.This is not surprising since it is more intriguing than the transaction partitioning problem, which is already NP-hard [34].This said, we develop an e cient algorithm that is able to either re ne a transaction partition into a schedule, i.e., partitions with ordering, or compute a schedule starting from scratch when a partition is not available.Scheduling targets bundled transactions to which the partitioning methods are applicable.It not only reduces runtime con icts but also balances workloads among the threads, improving the throughput.It is particularly e ective for transactions with skewed costs or I/O latency, which existing partitioners do not handle very well.
< / l a t e x i t > I < l a t e x i t s h a 1 _ b a s e 6 4 = " y 4 1 m q r s e 7 i r s 2 X j t L W r v 9 v d + 9 p q 7 B / M X 7 p 1 9 A q 9 R j t I R + / R P j p B P W Q j g m L 0 C / 1 G f 2 p 2 L a l 9 r / 2 4 R V d X 5 j 4 v U W H V f v 4 H t B U P 1 g = = < / l a t e x i t > (TsDefer) Figure 2: A typical transaction system with TS : TS works as a plug-in for existing transaction systems that either use transaction partitioners to assign transactions to threads or directly rely on CC protocols.It sits between the transaction-to-thread assignment module and transaction execution engines to reduce runtime con icts.
Proactive transaction deferment.For unbundled transactions handled by CC-based approaches, we develop a lightweight method to reduce their CC cost by proactively deferring transactions onthe-y that are on course to in ict runtime con icts with other transactions being executed.Unlike partitioners that are often used for preprocessing batched transactions prior to their execution, CC is part of the execution phase and hence proactive deferment has to be at a low cost since otherwise the bene t of reduced runtime con icts could be canceled out by the increased CC overhead.
To this end, we develop a lock-free structure that tracks the progresses of transaction execution of all the threads, which allows us to e ciently detect whether a runtime con ict would happen before executing a new transaction, during execution time.The lock-free design avoids data racing overhead when threads update and look up execution progresses concurrently.In addition, it has a parametric complexity to trade the detection overhead for larger reduced con ict penalties and hence higher throughput.
Prototype and evaluation.As a proof of concept, we develop TS , a lightweight tool as shown in Fig. 2 for improving existing partitioning-based or CC-based transaction systems via scheduling (module T P ) and proactive deferment (module T D ).Given a workload W for partitioners, TS (T P ) rst learns rough runtime estimates of the transactions in W via execution histories, partial dry-runs used by partitioners [4], or more sophisticated cost estimators [11,42,46,51].It then converts the partitioning of W into a schedule, i.e., queues (Q 1 , . . ., Q ) and residual R, to reduce runtime con icts and improve load balance.The transactions in Q 1 , . . ., Q can be concurrently executed as scheduled without CC if the estimates are accurate.A subtle issue arises when time estimates are not perfectly accurate; if so some transactions may end up in icting runtime con icts.To cope with this, TS uses both CC and proactive transaction deferment to execute the scheduled queues and R.This guarantees that TS always executes transactions correctly, while still bene ting from reduced con icts and balanced workload among the threads via scheduling.
For unbundled transactions that are executed directly using CC, TS employs proactive transaction deferment (T D ) to reduce con ict penalties.T D detects potential con icts between transactions concurrently executed at runtime, and it imposes no restrictions on the workload that CC schemes deal with.
To allow a fair and repeatable evaluation, we integrate TS into DBx1000, a popular open-source testbed for in-memory transaction processing with built-in implementation of major CC schemes [2,56].Using TPC-C and YCSB, we verify that TS improves popular partitioners and CC protocols by 131% and 109%, respectively, and reduces their retries by 45.3% and 45.7%, up to 294% and 152% for throughput and 61.1% and 63.9% for retry reduction.Moreover, TS remains e ective for transactions with varying degrees of long-tail I/O latency, making existing partitioners and CC schemes robust against e.g., databases hosted in disks.We observe that TS is less e ective for extremely short-run transactions as con ict penalties play a much smaller role in their performance.
Contributions & organization.To summarize, in this work we develop techniques that improve both transaction partitioners and CC schemes by reducing runtime con icts.More speci cally: • We formalize a notion of runtime con icts and propose transaction scheduling to improve transaction partitioners by making con icting transactions con ict-free at runtime (Section 2.2). • We propose proactive transaction deferment for CC protocols to reduce runtime con icts during execution time (Section 2.3).• We develop TS , a tool that implements both methods.It can serve as a plugin for existing transaction systems that use either transaction partitioners or CC protocols directly, providing delicate controls over and exible trade-o s between transaction retry penalties and CC overheads (Section 3).• We settle the complexity of scheduling and develop an e cient scheduling algorithm for transaction scheduling (Section 4).• We develop a lightweight lock-free structure that enables proactive transaction deferment (Section 5).• We empirically verify that TS improves the throughput of both partitioners and CC schemes, and makes them robust against large skewed I/O latency and runtime (Section 6).
We remark that TS is not to replace existing CC schemes or transaction partitioners.Instead, it is positioned as a tool to expose opportunities to improve their performance in a non-intrusive way.
We discuss related work in Section 7 and future work in Section 8.

Reducing Runtime Con icts for Transactions
In this section, we rst review the basics of transaction processing (Section 2.1).We then present the notion of runtime con icts, based on which we propose transaction scheduling (Section 2.2) and proactive transaction deferring (Section 2.3).

Transaction Preliminaries
A transaction is a sequence of database actions that are to be executed as atomic work units, including reads from and writes to a database.The tuples in the database that are read (resp.written) by a transaction are referred to as the read (resp.write) set of .
Con icts.Two transactions and are in con ict w.r.t. a CC protocol if they contain contended operations under .Contention is de ned relative to the particular isolation level [7] upheld by .For instance, if enforces serializability, then and are in con ict if they both access (read or write) the same data item and at least one of them writes .If enforces snapshot isolation, then and are in con ict if they both write the same data item.
As an example, under serializability, 2 and 5 of Example 1 are in con ict; however, they do not con ict under snapshot isolation.
Transaction is con ict-free with if they are not in con ict.Con icting transactions require CC to ensure the correctness at the particular isolation level upheld by the CC protocol .
Workload model.As shown in Fig. 2, we consider multi-thread transaction systems that either use transaction partitioners for bundled transactions or directly use CC for unbundled ones.Indeed, the majority of transaction systems fall into these two categories.They consist of (a) input bu er I that receives transactions to be processed; and (b) thread-local bu ers T 1 , . . ., T , where each T ( ∈ [1, ]) contains the transactions to be executed by thread .
For bundled workloads W, i.e., a set of transactions revealed to I all at once prior to execution, the practice is to rst decide an assignment for transactions in W to the thread-local bu ers, by partitioning them based on their read/write sets [4,14,21,31,34,38,45].All threads then execute assigned transactions in their local bu ers concurrently.By spending more preprocessing time, partitioners reduce transaction execution time with fewer aborts and less CC cost.
For transactions that are coming unbundled in I, they are periodically ushed to the thread-local bu ers via much lighter method than transaction partitioning, e.g., round-robin, random or lightweight ML-based assignment methods [41].All transactions are executed concurrently with CC.Hence, the impact of CC is greater on unbundled transactions when compared to bundled workloads.

Concurrency control (CC).
When transactions in the local bu ers are executed concurrently, con icting transactions may cause anomalies that the desired isolation level prohibits, impairing the correctness of the execution results.To this end, transaction systems use CC protocols [9,20,24,38,49,57] to prevent such anomalies.The cost of CC consists of (a) the overhead of the CC protocols, which is charged to every transaction even when it involves in no con ict, and (b) the cost of preventing anomalies when con icts happen, e.g., transaction blocking (waiting) time for locking-based CC and abort/retry costs for optimistic CC methods.
Transaction partitioning.The partitioning methods assign bundled transactions (workload) to threads for concurrent execution by analyzing the con icts between transactions.It is typically used as a preprocessing method and incurs computation cost [4].
Conceptually, the process that transaction partitioners [4,14,33] partition a workload W can be modeled as a graph cut problem: by nding a minimum -cut of the con ict graph of W, where vertices are transactions and an edge ( , ) indicates that con icts with , one breaks W into partitions, where the edge cut denotes con icts between partitions.Each partition is executed serially by a dedicated thread and partitions are executed concurrently with CC.
One can also gathers all con icting transactions into a single set R such that the partitions become mutually con ict-free.We refer to R as a residual partition for W and the remaining sets of transactions as CC-free partitions.It has been shown that by executing CC-free partitions without CC rst, and then executing R with CC after all CC-free partitions complete, one can achieve higher throughput for high contention transactions [34].
In this setting, the objective of partitioning is to minimize the size of residual set (or -cut) while balancing the partitions.For conve-nience, we represent a partitioning of W simply as ( 1 , . . ., , R), where 's are CC-free partitions and R is the residual.

From Transaction Partitioning to Scheduling
We next propose transaction scheduling based on runtime con icts.
Execution time.Transactions in practice often have varying execution time.The execution time of a transaction , denoted by time( ), is the serial execution time of by a single thread, from the start to the completion of .It measures the duration (length) of .
A variety of methods have been explored to estimate time( ), from a brute-force one that counts reads and writes in (e.g., Example 1), to advanced statistical and ML strategies [11,42,46,51].
For a set of transactions, time( ) denotes the total execution time of transactions in , assuming that they are executed serially.
Transaction schedule.A schedule of a transaction workload W over threads, denoted by W ≺ , is a pair ( , ≺), where • is a function that partitions W into + 1 disjoint sets Q 1 , . . ., Q and R such that (a) =1 Q ∪R = W and (b) and are not in runtime-con ict for any ∈ Q , ∈ Q and ≠ ; and • ≺ is an order relation on W such that for ∈ [1, ] and each Q , ≺ is a total order on Q .
We refer to Q as a con ict-free queue.Similar to conventional transaction partitioners, function clusters transactions in W by assigning them to threads.In contrast to partitioning, relation ≺ orders transactions assigned to each of the queues.Moreover, the partition is guided by runtime con icts (see below).As a consequence, the con ict-free queues may include residual transactions for partitioning, and R only contains transactions that incur runtime con icts with those in the queues.
Given a schedule ( , ≺) over threads, each core serially executes transactions assigned to it by , in the order speci ed by ≺.
Runtime con ict.For a transaction ∈ W that is assigned to queue Q , denote by ts( ) the scheduled start time of by schedule ( , ≺).It is de ned as ∈pred( ) time( ), where pred( ) is the set of transactions in W that are assigned to the same Q as by and are ordered prior to by ≺ (recall that time( ) is the serial execution time of ).Similarly, tc( ) denotes the scheduled complete time of by ( , ≺) and is de ned as ts( ) + time( ).The range [ts( ), tc( )] is called the scheduled runtime of of .
Two transactions and are in con ict at runtime by schedule ( , ≺) if they are in con ict and their scheduled runtime overlap.If they are not in con ict at runtime, and are called RC-free.
Consider a conventional transaction partitioner that splits workload W into a partition plan = ( 1 , . . ., , R).Let ( , ≺) be a schedule of W over threads.Then we say ( , ≺) re nes partitioning if is a subset of Q partitioned by for all ∈ [1, ].
Intuitively, the schedule is more tolerant than the partition by including conventionally con icting transactions in the RC-free queues.As a consequence, R is a subset of residual R, and hence more transactions can be executed concurrently.
Example 3: Continuing with Example 1, the queues Q 1 and Q 2 given there form a schedule for the workload W 0 .Moreover, the schedule re nes the partitioning of W 0 that consists of 1 and 2 and 5 .As shown there, the scheduling reduces the makespan of the execution of W 0 from 20 to 14, and improves the throughput. 2

Proactively Deferring Con icting Transactions
For residual transactions or unbundled transactions that are directly executed using CC, we propose another method, called proactive transaction deferment, to further reduce con ict penalties for CC.
Recall that each thread executes transactions in its local bu er T using CC so that anomalies can be prevented when con icting transactions are being executed concurrently.Proactive deferment works by detecting whether runtime con icts would happen when a thread is about to execute the next transaction in T after committing the current one; if so, it defers and skips to the next transaction for execution.As will be seen in Section 5, the complexity of con ict detection is parameterized with knobs, to provide exible trade-o s between the overhead of con ict detection and con ict penalties if runtime con icts are not prevented in time.
For instance, as shown in Example 2, when core 1 detects that executing 2 would in ict a con ict with 5 being executed at core 2, it defers 2 and skips to 3 , reducing retry cost of 2 .
We remark that proactive deferring is not a replacement for CC since it does not aim to detect and prevent all the runtime con icts.Instead, it serves as a lightweight lter of transactions in the thread-local bu ers before passing them to the transaction execution engine.It makes CC more robust against transactions with varying complexities by virtue of its parametric cost.

A Prototype Transaction Scheduler
In this section, we present an overview of TS , a lightweight tool for transactions to reduce runtime con icts and CC penalties.
As shown in Fig. 2, TS works with existing transaction systems by serving as an intermediate layer between the transaction-tothread assignment component and the execution engine.It consists of two components: (a) a transaction scheduling (T P ) module that schedules and optimizes the partitioning generated by an existing transaction partitioner, and (b) a proactive transaction deferment (T D ) module that optimizes the thread-local bu ers on-the-y.
Scheduling (T P module).Given a transaction workload W and a partitioning of W, T P computes a schedule ( , ≺) that turns the partitioning into RC-free queues (Q 1 , . . ., Q ) and a residual R .T P then assigns transactions in Q to the local bu er T of thread in the order; all threads can execute the assigned transactions in parallel without CC if time estimates are accurate.After all threads nish their local transactions, transactions in R are then executed with CC by all threads (with T D enabled; more below).By default, T P assigns transactions to the thread by round-robin for its simplicity and load balance.One can also use other lightweight transaction-to-thread assignment methods supported by the underling systems, e.g., random or ML-based [41].
To decide runtime con icts for schedules, TS estimates the execution cost of the transactions in W by following e.g., [11,42,46,51].T P does not rely on the actual transaction execution time; instead, it is only sensitive to the relative length of transactions.Hence, any estimates that roughly preserve the relative costs of transactions su ce.By default, T P uses execution histories to coarsely estimate costs; if is instantiated with the same param-eters as in the history, and if and are based on the same template (e.g., stored procedure), then T P uses the cost of as an estimate for .When no has the same parameters as , T P picks a with parameters close to that of as a coarse estimate.For cases when execution histories are not available, T P adopts the partial dry-run approach used by e.g., deterministic databases [4] to generate the estimates.The idea is to execute (samples) of transactions partially such that no writes are physically executed during the dry-run; this has also been shown e ective to deduce access sets of transactions for partitioning [4].In the extreme case, TS uses the sizes of the access sets of the transactions as a fallback.
To cope with inaccurate or even missing estimates, by default TS uses CC and T D (more below) to guard the execution of RC-free transactions against potentially overlooked con icts.This ensures that TS always upholds the desired isolation level no matter how bad the estimates can be.As will be shown in Section 6, TS still bene ts from reduced con icts and more balanced workload among the cores via scheduling, and yields higher throughput, especially for transactions with runtime skewness.
We will study the core problem underlying T P in Section 4.
Proactive deferment (T D module).For unbundled transactions that are directly assigned to the thread-local bu ers upon arrival or transactions of which runtime estimates are unavailable, TS uses T D to reduce con ict penalties by further reducing runtime con icts.T D works by altering the ordering of transactions in the thread-local bu ers during execution time, via carefully designed lightweight operations.It works as a dynamic lter of the transactions in the thread-local bu ers and passes transactions that are not likely to cause runtime con icts to the execution engine by deferring problematic transactions.
In a nutshell, T D checks, prior to passing a transaction for execution (with CC) at thread , whether is in con ict with transactions that are (very likely) being or will be immediately executed by some other thread.If so, it defers the execution of by moving to the end of the transaction queue in the local bu er for thread ; otherwise it passes to the transaction execution engine.
We will discuss the implementation of T D in Section 5. Remarks.
(1) TS aims to improve existing transaction systems, no matter whether partition-based [4,14,16,31,33,34,37,38 [43,45].It neither imposes any new restrictions on nor makes assumptions about how these systems work.This enables existing systems to deploy TS without changes to bene t from the reduced runtime con icts by transaction scheduling and deferment, via reduced CC overhead and con ict penalties.
(2) TS also uses T D to reduce the execution cost of the unscheduled residual R returned by T P for bundled transactions.
(3) Neither T P nor T D of TS is xed to a speci c isolation level such as serializability; instead, they work with arbitrary isolation levels that the underlying systems uphold, by observing con icts according to the preferred isolation level.

Limitations. Because of its non-intrusive design, TS (T P and T D
) inherits some limitations of the systems it optimizes.
(1) Access sets.Transaction partitioners require that the access set of transactions is known upfront so that they could deduce con icts between transactions.T P inherits this prerequisite to schedule the partitioning of bundled transactions.As a result, no random client-driven transactions are expected for both partitioners and T P ; instead, they target transactions in the form of stored procedures or hard-coded templates, which are prevalent in e.g., banking, e-commerce, business applications.Due to the importance of transaction partitioning, techniques have been developed to identify access sets in various scenarios [4,36].Moreover, for unbundled transactions where access sets are not known beforehand, we can fall back to CC-based approaches and use T D instead.Another limitation inherited is that TS executes range queries with CC, since partitioners do not optimize range queries for which read/write-sets are not available.On the positive side, T P inherits and reuses con ict graphs constructed by partitioners [14,34] without reconstruction, as will be seen in Section 4.
(2) Application-speci ed dependencies.Similar to CC-based approaches for unbundled transactions, T D operates on transactions visible during execution time only; hence, they do not have control on the global order of transaction execution to enforce transaction (e.g., causal) dependencies implied by application logic.A possible way of mitigating this is to integrate consistency protocols with CC so that we have both isolation and causal consistency guarantees; T D could be extended for such cases by only deferring a transaction when a limited number of transactions depend on it.
Di erent from CC and T D , transaction partitioners and T P can readily incorporate transaction dependencies by enforcing dependencies in partitions and during scheduling.
(3) Generalization.In principal T P is not limited to the inmemory setting; it can be applied to shared-nothing distributed systems.In contrast, T D cannot be trivially generalized as it relies on lightweight probing operations to detect runtime con ict at execution time; such operations will incur too much overhead in the shared-nothing architecture due to network latency involved.

Transaction Scheduling
In this section, we formulate the transaction scheduling problem, settle its complexity bound, and develop an e cient algorithm for it.The results serve as the foundation of the T P module.
Problem & complexity.The key to T P is to compute transaction schedules for transaction partitioners.Referred to as the transaction scheduling problem, this is abstracted as follows: • I : A transaction workload W, number of threads, and a partitioning of W, i.e., ( 1 , . . ., , R).Here (a) the makespan of the RC-free queues is their concurrent execution time, calculated as the maximum of the serial execution time among all queues.We (b) minimize the amount of work in the residual, and hence maximize work that is handled via RC-free queues.With both (a) and (b), we aim to nd a schedule that mini-mizes the total execution time of all the transactions in workload W and hence, improve the throughput for executing W.
As opposed to transaction partitioning problems [14,34,38], this bi-criteria optimization problem is more challenging in that it considers runtime con icts and execution time.
Theorem 1: The transaction scheduling problem is NP-complete.It is already NP-hard to decide whether there exists a schedule such that (a) it has non-empty RC-free queues; or (b) the makespan of the schedule is no larger than a given number, even when = 3.
The lower bounds hold even when all transactions take a unit time. 2 Proof sketch: We give an NP algorithm that guesses a schedule for W over cores, and checks in PTIME whether the schedule satis es the conditions.We prove the NP-hardness of (a) by reduction from the maximum independent set problem [32], and (b) by reduction from the bounded independent sets problem [27]. 2 Nonetheless, below we develop an e cient algorithm, denoted by TSgen, to compute transaction schedules for T P .Algorithm TSgen works in two settings.Given a workload W of transactions, (1) it can take as input a transaction partition plan and turns it into a schedule for W, as shown in Algorithm 1. (2) It can also compute a schedule for W starting from scratch.Below we rst present TSgen for case (1), and then show how it handles case (2).
From partitions to schedules.Given a workload W and a partition plan = ( 1 , . . ., , R) of W, TSgen re nes into a schedule ( , ≺) for W. It preserves transactions of cluster in queue Q , makes as many residual transactions in R to be RC-free as possible, and balances the workloads of RC-free queues (Q 1 , . . ., Q ).
TSgen (Algorithm 1) starts with empty RC-free queues Q and empty set R .It examines residual transactions in R and decides whether to merge them into the RC-free queues of and if so, how to do it.In the process, it schedules transactions in 's and preserves their assignment by , i.e., a transaction in is added to queue Q at thread .Along the way, TSgen generates RC-free queue .The unscheduled residual transactions in R of remain in R .
More speci cally, algorithm TSgen works as follows.Initially, the set R and RC-free queues Q are empty for all ∈ [1, ] (lines 1-2).TSgen iteratively expands Q by examining transactions in R one by one, following an ordering ì of R (lines 4-14; by default, TSgen picks a random ordering of R as ì for simplicity).For each transaction * in R (line 5), it checks whether it can be merged into the input CC-free partition with the smallest total execution time (line 6), in order to balance the workload of RC-free queues.
To do this, it rst nds all transactions in ( ∈ [1, ], ≠ ) that are in con ict with * , appends them to the corresponding RC-free queue Q , and removes them from (lines 7-9).It then checks whether appending * to RC-free queue Q would cause runtime con ict with transactions that are already in queue Q ( ≠ ) via procedure ckRCF (omitted).If not, it appends * to Q , and adjusts the load (len ) of thread by including time( * ) (lines 10-11).Otherwise, TSgen decides that * is a residual transaction that cannot be scheduled, and moves it into R (line 12).
After all transactions in R are examined and assigned to either one of the RC-free queues or R , TSgen appends the remaining transactions in each partition to the corresponding RC-free queue Q (lines [13][14].It returns the RC-free queues (Q 1 , . . ., Q ), and the set R of unscheduled residual transactions (line 15) Example 4: Given the partition of W 0 in Example 1, i.e., 1 = { 1 , 2 , 3 }, 2 = { 4 } and residual R = { 5 }, algorithm TSgen assigns 5 of R to 2 , which turns to RC-free queue Q 2 = 4 , 5 , with Q 1 = 2 , 1 , 3 , exactly the same schedule as in Example 3. 2 To complete TSgen, we next present how it identi es transactions that con ict with * from each input CC-free partition (lines 7-9).
More speci cally, to identify transactions in each input partition that are in con ict with * , TSgen makes use of the con ict graph of W.Here is an undirected graph in which (i) the nodes are the transactions of W, and (ii) two transactions are connected by an edge if they are in con ict with each other.Note that variants of con ict graph have already been used in transaction partitioners (e.g., [14,59]), and are found e cient to construct and e ective in practice; TSgen re-uses their con ict graphs when looking up conicts.TSgen identi es transactions that are in con ict with * by searching the neighbor nodes of * in , and checking whether they are in the input CC-free partitions (line 8).It appends such con icting transactions in to RC-free queue Q (line 9), and checks whether * incurs runtime con icts with the expanded Q (ckRCF; line 10).It appends * to Q if there is no runtime con ict.
As will be seen in Section 6, TSgen improves the throughput by 131% on average, up to 294%; the overhead it adds to transaction partitioners is less than 5% of that of the partitioners it optimizes.
Scheduling without input partition.Algorithm TSgen is also able to compute a schedule for workload W in the absence of a partition plan .More speci cally, given a transaction workload W, we simply treat W as the residual and runs TSgen with empty CC-free partitions, i.e., 1 = • • • = = ∅.TSgen works in the same way as how it re nes a non-empty partitioning of W.
Complexity.Algorithm TSgen can be implemented in (|W| + ( − 1)|R|)-time, where |W| (resp.|R|) is the number of transactions in W (residual R) and is the number of threads, by reusing the conict graph from partitioners.That is, TSgen is linear in |W| when the number of threads is a constant.Indeed, (a) each transaction in the CC-free partitions is examined only once to decide its assignment.(b) Each transaction in R is examined by ckRCF in ( )-time.
Remark.TSgen looks up con icts between transactions from the con ict graph of W. However, it does not necessarily need to construct during scheduling.Typically transaction partitioners already build [14] or its variants [34] for partitioning in order to minimize cross-partition con icts, and TSgen re-uses them from the partitioners.TSgen aims to strike a balance between the scheduling cost added to the partitioners and the quality of the schedules computed.It reduces the execution cost of residual queues by preserving con ict-free transactions of partitions 's in the RC-free queues and moving residual transactions of R to RC-free queues.

Proactive Transaction Deferment
In this section, we present how T D reduces runtime con icts during execution time for unscheduled residual R from T P and for unbundled transactions that are directly assigned to threads.
T D acts directly on thread-local bu ers and reduces runtime con icts for CC by proactively deferring transactions that are likely to cause runtime con icts with other transactions that are being executed.It is to reduce con ict panelties.Nonetheless, unlike transaction partitioners and T P , T D in icts extra overhead to the CC protocols for execution.Hence, it needs to be extremely lightweight while being e ective since otherwise its overhead would cancel out any bene t from reduced con icts.
This imposes challenges to the design of T D .In particular, tracking and sharing execution progress among threads are essentially contended operations, while locking is not an option due to its large overhead on execution.Moreover, a thread cannot a ord to simply look at all concurrent transactions to decide runtime con icts due to the overhead of reading their read/write sets.
To tackle the challenges, T D proposes two techniques.
(a) It uses a lock-free structure for all threads to keep track of their execution progress and look up the progress of other threads for con ict detection.This avoids the locking overhead and data races among threads on tracking and sharing progress.
(b) To suppress the overhead of checking con icts among transactions on-the-y, T D randomly probes data items of the read/write sets of active transaction at other threads within a limited times, and defers the transaction with a certain probability if con icting items are found.The crux is that each data item probe takes only a constant time in a lock-free manner, independent of both the size of transactions and the number of threads.

Runtime progress tracking.
As shown in Fig. 3, T D represents the transaction queue in each thread-local bu er as an array of transaction IDs; the bu er is shared globally across all threads.Each thread uses two pointers, headp and tailp, that point to the next transaction to be executed and the end of the transaction list, respectively.Both the array and the pointers are writable by its own thread and are read-only for other threads.T D implements three operations for each thread : • regPos to update the progress of thread by moving headp to the next transaction ID once the current one commits; and • lookup that, upon each invocation, returns in a constant time a data item in the write set of some active transaction of another thread, where a transaction is active for thread if it (a) is at thread ( ≠ ) and (b) is pointed by headp of thread .• defer that moves a transaction to the end of the queue that maintains the deferred transactions of the current thread.
Proactive deferment.With these, T D works as follows.
(1) When thread is about to execute a new transaction in its local bu er T , it checks whether would cause a runtime con ict with active transactions at other threads by invoking a bounded #lookups number of lookup operations.
(2) Let be the number of distinct data items retrieved from lookup.
If #lookups − is above a threshold (typically 1), T D decides that is likely to cause a runtime con ict.In such cases, thread defers via the defer operation with a probability of deferp%.
(3) If is deferred, thread then moves to the next transaction and updates its progress via regPos.In addition, it records the deferred at tailp, and then moves tailp to the next slot.
(4) If is not deferred, thread then goes on to execute .If subsequently commits successfully, then thread updates its progress via regPos.If aborts, thread will retry immediately until it commits.During the period, it does not update its progress.
Lock-free Implementation.We next describe the implementation of the progress tracking structure.The key idea is to keep the overhead low and controllable; the rationale is that T D does not replace CC, i.e., it does not detect all runtime con icts during execution time.In light of this, it implements both operations regPos and lookup with C++ atomic builtins [1] in a lock-free manner.Note that lookup may read slightly stale progress due to the lock-free design.However, we nd that such staleness has negligible implication as T D is supposed to be lightweight and does not aim to identify all potential runtime con icts.For long-run transactions of which con icts are costly, one can compensate this by instructing lookup to check transactions that are further in the future w.r.t. the one it sees from headp, within bounded steps.
To ensure each lookup has a constant complexity, each thread maintains the predicted access sets of all other transactions.This does not have to be exact since some of the read/write items are not possibly known without execution; for such cases, T D only records the estimated access sets based on the constants and template of the transaction.For instance, the warehouse id (w_id), district id (d_id) and customer id (c_id) of an instantiated Payment transaction in TPC-C could largely determine its access sets; sim-ilarly, the access sets of YCSB transactions can also be accurately inferred by the instantiated parameters (keys) in most cases.
Each invocation of lookup then randomly takes a thread ID and index , via reservoir sampling; it then returns the -th item in the write set of the transaction that is currently pointed by headp of thread .Note that each invocation requires only one read to the global structure, to retrieve the transaction ID determined by , say * .Then lookup simply reads the data item of * indexed by in its local copy of the read/write set of * .
Example 5: Recall thread-local bu ers T 1 = 1 , 2 , 3 and T 2 = 4 , 5 from Example 2. Assume deferp% = 100%.With T D , thread 1 can defer 2 with a probability of 50% with 1 lookup (i.e., #lookups = 1), and defer 2 for certain with only 2 lookup (i.e., #lookups = 2).Indeed, for thread 1 and 2 , the only active transaction is 5 , which has two data items, 1 and 5 .Hence one lookup has 50% of chance of returning 1 , which witnesses a con ict with 2 , while two lookup calls return 1 for certain, triggering T D to defer 2 .Note that reading 5 alone would already cost 6 reads of data items, already a higher overhead than the deferment of 2 . 2 Parameters and trade-o .Observe that the accuracy and overhead of T D is parameterized by the following two parameters: • #lookups: the number of lookup operations invoked; and • deferp%: the probability to deferring a candidate.
The overhead of T D is determined by the number of lookup operations, i.e., #lookups.With larger #lookups, however, more potentially contended operations in remote threads can be discovered; hence, the chance of not deferring a transaction that is in runtime con ict with other active transactions becomes lower.This helps us reduce the CC cost (abort/retry penalty) via reduced con icts among transactions being executed.Probability deferp% allows T D to adapt to varying contention levels: for extremely high contention workloads, T D uses a relatively lower deferp% to avoid excessive number of transactions being deferred.
With these tunable parameters, T D can adapt to transaction workloads with varying characteristics, by trading deferring overhead for the accuracy of con ict reduction for CC.When conict penalties (abort/retry cost) make a larger factor, e.g., for longrun transactions or transactions with heavy application-level abort penalties, one may prefer larger #lookups for a higher chance of identifying a runtime con ict, to avoid a costly abort/retry with a bit higher overhead.On the contrary, for light and simple transactions, e.g., key-access over a key-value table as YCSB transactions do, one may prefer smaller #lookups since abort/retry is cheap and overhead is more sensitive.In the extreme case, one can disable T D with #lookups = 0, avoiding any overhead to CC at all.Such exible trade-o between overhead and con ict penalty reduction is in particular viable for transactions that have complex application logic and hence have varying costs of aborts/retries.

Experimental Study
Using benchmarks, we evaluated the e ectiveness of TS in improving both partitioning-based systems for bundled transaction workloads (Section 6.2) and CC-based systems for unbundled transactions (Section 6.3), in particular for transactions with varying runtime skewness.We specify the experimental settings in Section 6.1 and summarize the evaluation results in Section 6.4.

Implementation and Experimental Setup
Systems.We have implemented a prototype of TS as described in Sections 3-5, and integrated it with DBx1000 [2, 56] as the transaction execution engine, which implements multiple CC protocols.Since DBx1000 directly initializes thread-local bu ers with instantiated unbundled transactions, to evaluate the e ectiveness of TS for partitioning-based systems, we extend it by also porting external transaction partitioners into DBx1000, making it a full testbed for both partitioning-based transaction processing and non-partitioning CC-based transaction processing methods.This is done by initializing the transaction bu er for each thread in DBx1000 with a partition generated by the partitioner, so that transactions are executed according to the partitioning.When TS is enabled for transaction partitioners, each thread local bu er receives the transaction queue generated by the partitioner and TS .

TS instances. We deployed ve instances of TS , depending on whether it partitions transactions and what partitioner it employs:
• TS [S]: an instance of TS that targets systems with partitioning; it employs Strife [34], a recent partitioner that has been shown e ective for highly contended transaction workloads [34].We used its open source implementation from [3].
• TS [CC]: an instance of TS targeting unbundled transactions that are directly handled by DBx1000's default transaction-to-thread assignment with CC; it only uses T D .
Among the three partitioners, S generates partitioning of transactions with an explicit residual set, while S and H do not.For the partitioning of S and H , TS rst extracts a residual set that contains all those transactions that are in con ict with some other transactions from another partition, and then carries out the scheduling as it does with S .
Baselines.We compared the performance of the scheduling-based TS [S], TS [C] and TS [H] with their partitioning counterparts S [34], S [14] and H [33], respectively, for bundled transactions that are known to the partitioners prior to execution.Both S [14] and S [34] are general partitioners that work with any given transaction workloads with known read/write sets.H is hard-coded for TPC-C [48] and YCSB [12] workloads, and is not a full-edged partitioner.
To study the e ectiveness of TS (T D ) for unbundled transactions that are not known beforehand, we also compared TS [CC] with DBx1000's default con guration that executes transactions using CC without partitioning (denoted by D CC).
Con guration.For all TS instances, all the transactions are executed with CC to guarantee correctness.This is a suboptimal implementation of TS and is in favor of the baselines since one can retain the lower cost of CC-free execution of the RC-free queues by enforcing the scheduled order via, e.g., dependency tracking in Table 1: Workload and system parameters: blue/yellow for TPC-C/YCSB parameters; green for system parameters; red for runtime skeet's and I/O latency; and gray for T D parameters.When varying a parameter, we use the default for all the other parameters.[35,36].T P uses the warm-up dry-run trails of DBx1000 as the source of histories to derive coarse cost estimates for transactions.The isolation level of all tests is set to serializability.
Benchmarks.We used two transaction benchmarks.TPC-C [48].We tested with full TPC-C transactions.Since the builtin TPC-C implementation in DBx1000 contains only the NewOrder and Payment transactions without insertion (update only), we extended it to cover full TPC-C.In particular, we enabled insertions in NewOrder and Payment and added OrderStatus, StockLevel and Delivery transactions to DBx1000 by following [6].
To evaluate the impact of contention levels, we also enabled TPC-C to vary the originally hard-coded percentage ( %) of transactions that cross multiple warehouses.We set % to 25% and #whn/#core to 2 when varying the number of cores (#core), to simulate workloads with high contention, where #whn is the number of warehouses.YCSB [12].We also tested with the YCSB benchmark.We used the built-in YCSB driver in DBx1000, which implements the YCSB core A workload [55].We used a single table of 20M records, where each record is 128 bytes in size and is accessed by a unique key; each transaction accesses 16 records.Contention in YCSB is con gured by a Zip an distribution that controls data skewness; we varied the parameter of Zip an from 0.7 to 0.9 ( = 0.8 by default).
Extension with runtime skewness.Both TPC-C and YCSB transactions are short, e.g., a YCSB transaction consists of key accesses in a key-value table.In practice transactions may have varying lengths (i.e., runtime) and skewness in their distribution.To test more complicated transactions, we extend both TPC-C and YCSB by "lower bounding" the runtime of transactions: assuming that a transaction is lower bounded by min , if the actual execution time of exceeds min , then nothing changes and commits as usual; otherwise, it delays its committing until the total runtime is min .
More speci cally, we extend TPC-C and YCSB with three parameters minT, and to emulate runtime skewness.Transactions are assigned with a minimum runtime randomly drawn from a range [minT• , • minT• ], following a Zip an distribution with the skewness parameter .Here is the average transaction runtime and minT (≤ 1) is a small coe cient such that minT • serves as the "unit" time of transaction execution, and integer constrains the maximum "lower bound" runtime.These lower bounds follow a Zip an distribution with skewness degree varied by .By varying minT, and , we emulate di erent runtime patterns of the transactions.We set minT small, e.g., as low as 1/8 so that the original short-run TPC-C and YCSB transactions are included as a subclass of the workloads for all the tests.These parameters improve the expressiveness of TPC-C and YCSB, not to restrict the benchmarks.
Extension with I/O latency.To evaluate the impact of I/O latency, we further extended DBx1000 with a new knob similar to [47], to add an arti cial delay to simulate I/O latency at transaction commit time.The delays draw values from [0, IO • minIO] for a Zip an distribution with skewness parameter IO , where (a) minIO is set to 5000 CPU cycles, about 1/6 (resp.1/8) of the average TPC-C (resp.YCSB) transaction runtime, and (b) IO varies from [0, 100].By varying IO and IO , we get various patterns of I/O latency to DBx1000.In particular, larger IO means longer worst-case I/O latency and higher IO indicates a "longer-tail" latency distribution.
Metrics.We measure the performance of systems by the following: • throughput: the number of transactions committed per second; • #retry: the total number of retries per 100,000 transactions.
Experimental setup.The experiments were run on an AWS EC2 m5.8xlarge instance, with 32 vCPU and 128 GB of memory.Each experiment was run 3 times.The average is reported here.All the parameters, including their con guration ranges and default settings, are summarized in Table 1.When varying a parameter, we use the default con guration of all other parameters.By default, each bundle consists of 10,000 transactions.

TS on Partitioning-based Systems
We rst evaluated the e ectiveness of transaction scheduling in improving the throughput and #retry of partition-based systems.

We compared the performance of TS [S], TS [C] and TS [H] with S , S and H
, respectively, to nd out improvement over transaction partitioners introduced by TS via scheduling.Varying each parameter, we tested the throughput and #retry of all methods on TPC-C, YCSB and their skewed extensions.
Throughput: scheduling vs. partitioning.We rst compared the transaction throughput of schedules computed by TS [S], TS [C] and TS [H] with that of partitioning generated by S , S and H , respectively, with varying parameters.The results over YCSB and TPC-C are shown in Figures 4a-4h.and 101% higher than that of S , S and H , respectively, up to 183%, 104% and 141%.The reason is two-fold.
(a) Higher concurrency.By scheduling (T P ), TS achieves better balanced load among the cores, yielding a higher level of concurrency.In contrast, existing partitioners can be sub-optimal in balancing the load, e.g., the average ratio of the largest partition over the smallest partition of S is 3.2 over YCSB, while this reduces to 1.2 after deploying T P atop S .Moreover, by executing R with CC and proactive transaction deferment (T D ), TS further balances the workloads across threads.
(b) Reduced #retry.Furthermore, by combining T P and T D together, such higher level of concurrency does not incur larger CC costs due to the ability of TS to reduce runtime con icts.Indeed, in contrast to the normal pattern that higher concurrency means larger #retry, the #retry of all systems even decreases with TS .For instance, the #retry of TS [S], TS [C] and TS [H] is consistently lower than that of S , S and H , in all cases (e.g., Fig. 4i).On average, TS reduces 49.7%, 43.6% and 33.6% of the #retry of that of S , S and H on YCSB, respectively, and 54.4%, 36.7% and 53.9% over TPC-C.
(2) The e ectiveness of TS is robust w.r.t.CC protocols; it is even more evident with added cores as shown in Figures 4b-4c.For instance, over YCSB, TS [S] consistently achieves over 203% higher throughput than S with either OCC, SILO or TICTOC.The gap even increases with lager #core, e.g., TS [H] improves S by 133% with 8 cores, and 294% with 32 cores.
(3) Overall, TS is more e ective for workloads with higher contention, e.g., the throughput improvement of TS [H] over H increases from 143% to 225% when varies from 0.7 to 0.9 over YCSB, as shown in Fig. 4a; similarly, when % increases from 15% to 35% for TPC-C, the throughput improvement of TS [C] over S increases from 80.5% to 98.2% as depicted in Fig. 4g.
(4) Without an input partitioning, TS [0] still achieves throughput on average 85.8%, 65.1% and 69.7% higher than S , S and H over TPC-C, respectively, and 184%, 29.1% and 101% higher over YCSB, as shown in Fig. 4. Compared to partitioners, TS [0] achieves better load balancing via scheduling, by treating all transactions as residual; in addition, it has low retries due to reduced con icts.This suggests that TS [0] is an option for nonpartitioned workloads or partitioners that do not produce a residual.[S] improve the throughput of S by 203%, 108%, 73.2%, respectively; similarly for S and H and over YCSB.This shows that for bundled workloads, T P plays a bigger role in improving the performance.Moreover, T P and T D perform the best when working together, e.g., even better than the sum of the improvement by the two separately on S .
Runtime skewness.TS improves the robustness of all systems on YCSB by 171% with set to 32, while this increases to 186% with = 64.This is because the quality of partitions degrades with transactions of varying lengths, while with TS they can adapt to the variance and skewness in transaction runtime.Moreover, longer transactions in ict larger con ict penalties, and T P is more e ective in reducing their runtime con icts.
I/O latency.We next evaluated the impact of I/O latency.Varying IO and IO , we tested the throughput of all methods; partial results over YCSB and TPC-C are shown in Figures 4k-4l.We nd that TS still improves all three partitioners in all cases.Although the raw throughput degrades for all methods with larger IO or smaller IO (i.e., not so long-tail), the improvement brought by TS is relatively stable, e.g., consistently around 205% for S over YCSB; similarly for other partitioners and over TPC-C.This is because while latency reduces the accuracy of scheduling by T P , it increases the cost of retries; and TS consistently reduces the retries of all partitioners (e.g., Fig. 4l) and balances the partitions in the presence of long-tail latency.Hence, the e ectiveness of TS remains robust against I/O latency.This suggests that TS can also work with transactions involving I/O or network latency.
Overhead.We also tested the overhead that T P adds to the transaction partitioners, measured as the ratio of the runtime of TSgen to the partitioning time of the partitioners, denoted by overheadR.For workloads consisting of 100,000 transactions, we nd that the average overheadR of T P over S and S over YCSB is 4.1% and 3.7%, respectively, and 4.6% and 4.4% over TPC-C.This veri es that T P has moderate overhead.
Cost estimation.By default, T P uses the warm-up process of DBx1000 to assess the cost of transactions when scheduling (recall Section 3).To assess the accuracy of scheduling with such coarse estimates, we measured (a) the average scheduled percentage ( %) of residual transactions that were merged to RC-free queues; and (b) the #retry when executing the RC-free queues, with and without employing T D .The results are shown in Table 2.We nd that TS schedules a decent percentage of residual transactions, e.g., 30.7% and 62.3% over TPC-C and YCSB, respectively.Due to the coarse estimates used by TS , it is understandable that RC-free queues incur con icts and retries.Nonetheless, after employing T D , TS signi cantly reduces #retry for the RC-free transactions.For instance, TS [S], TS [C] and TS [H] reduce 66.9%, 63.1% and 62.2% of #retry for the RC-free transactions over YCSB, and by 53.0%, 41.0% and 39.2% over TPC-C, respectively.As shown earlier, this even makes the #retry of all systems much lower than without scheduling, although they have much higher level of concurrency and better load balancing with TS deployed.
Although the percentage of the scheduled transactions over TPC-C is lower than YCSB, the throughput improvement of TS over TPC-C is comparable to that of YCSB.This is because with even a moderate number of scheduled residual transactions, TS is already able to make TPC-C partitions balanced.

TS on Non-partitioning CC-based Systems
We also evaluated the e ectiveness of TS (T D ) for transaction systems that use CC without partitioning.To do this, we compared the performance of TS [CC] with that of D CC, with varying parameters and CC protocols.The results over YCSB are shown in Figure 5 (TPC-C is similar and omitted due to space limit).
Contention.Varying from 0.7 to 0.9 for YCSB and % from 15% to 35%, we tested the impact of transaction contention level on the e ectiveness of T D .We nd that T D consistently improves D CC in both throughput and #retry.As shown in Fig. 5a, on YCSB, on average the throughput of TS [CC] is 111% higher than that of D CC, while the #retry of TS [CC] is 49.8% lower.By digging into the pro ling statistics, we nd that the reduction of retries by T D is largely consistent with that of mutex contention of the execution.In fact, T D reduces #contended_mutex [29], the total number of times that a mutex was contended (already locked when a lock request was made) in D CC by 53.8% on average.
Moreover, the improvement is even more signi cant with larger .For instance, T D improves the throughput of D CC by 44.8% and reduces the #retry by 43.9% when is 0.7, while these increase to 171% and 53.4%, respectively, with = 0.9.The results over TPC-C are similar.This is because for workloads with higher contention, there exist more runtime con icts, which could bene t from T D with a higher chance.This is also con rmed by the runtime statistics, e.g., TS [CC] reduces #contended_mutex of D CC by 44.5% with = 0.7 and this increases to 56.4% with = 0.9.
Runtime skewness.Varying parameters minT, and , we evaluated the e ectiveness of TS in response to di erent runtime skewness.As shown in Figures 5d-5f, T D is particularly e ective with longer (larger minT) and more skewed and variable (larger and lower ) transactions.For instance, over YCSB when minT increases from 1/8 to 1, the throughput improvement of TS [CC] over D CC increases from 34.5% to 119%.In contrast, when is 0.9, TS [CC] reduces #retry of D CC by 49.5%, and this increases to 53.3% when = 0.7; similarly over TPC-C.
I/O latency.Varying IO and IO , we evaluated the impact of I/O latency on T D for unbundled transactions.We nd that with larger IO or smaller IO , the throughput of both D CC and TS [CC] decreases as transactions are prolonged by the I/O stalls; however, TS [CC] remains robust in improving D CC in both throughput and retries, as shown in Fig. 6.This is because T D is insensitive to transaction lengths via runtime progress tracking; hence its e ectiveness is stable w.r.t.varying patterns of I/O latency., by varying #lookups and deferp%.We nd that larger #lookups naturally improves the e ectiveness of T D in reducing runtime con icts and retries, e.g., over YCSB the #retry of D CC is reduced by 49.7% with TS enabled when #lookups = 1, and this increases to 54.1% with #lookups = 5 (as shown in Fig. 5g).However, larger #lookups comes with higher overhead, which would work with longer transactions whose runtime and retry cost could cover the overhead of T D .For shortrun TPC-C and YCSB transactions, we nd that #lookups = 2 gives the best throughput, e.g., TS [CC] improves TS by 116% over YCSB and 105% over TPC-C.Similarly, we nd that higher deferp% gives better reduction on #retry for TS over D CC.
CC and scalability.Varying CC protocols and the number of cores, we tested the impact of CC and the scalability of both TS [CC] and D CC.The results over YCSB are depicted in Figures 5b-5c.We nd that TS [CC] consistently improves the throughput and #retry of D CC via T D , by 117% and 51.8%, respectively, on average.Moreover, the gap even widens with larger #core.TS works the best with TICTOC, yielding a throughput improvement of 152% and a reduction of #retry by 63.9% over YCSB.It has the least advantage with OCC (the default CC used in all tests), but still improves D CC by 116% for throughput and 52.0% for #retry.
Impact of inaccurate access sets.Finally, we evaluated the impact of inaccurately determined transaction read/write sets on the e ectiveness of TS [CC].To do this, we restricted TS [CC] such that it could only use an -fraction of the actual access sets, i.e., the determined transaction read/write sets have an accuracy of .Varying from [0.5, 1], we tested the performance of TS [CC] with D CC.The results over YCSB are shown in Fig. 5h (the results over TPC-C are similar and thus omitted).We nd that TS [CC] still improves the throughput of D CC even when TS [CC] overlooked 50% of the actual access set, i.e., when is as low as 50% over YCSB.This is because T D only randomly probes access sets of concurrent transactions to detect potential con icts and it needs only part of the access sets; hence, it still observes con icts and improves D CC even when it overlooked reads/writes moderately.Naturally, the e ectiveness of TS [CC] improves when the access sets are more accurately determined with higher .

Summary
From the experiments we nd the following.
(1) Transaction scheduling (T P ) is e ective in improving the performance of partitioning-based transaction systems, e.g., it improves the throughput of state-of-the-art transaction partitioners by 131% on average, up to 294% over TPC-C and YCSB benchmarks.
(2) The bene t of TS (T P ) is even more evident when a larger number of cores are available.For instance, over YCSB, on average TS improves the throughput of partitioners by 75.3% when 8 cores are used; the improvement increases to 214% with 32 cores.
(3) Proactive transaction deferment is e ective for CC-based systems.On average it improves the throughput of D CC by 109% and reduces its #retry by 45.7%, up to 152% and 63.9%, respectively.

Related Work
We categorize the related work as follows.Transaction partitioning.There has been a host of work on transaction partitioning for parallel OLTP systems [14,33,37,54,59].The methods are designed for distributed systems under a sharednothing architecture, where the database is partitioned across multiple computing nodes to minimize transactions that access data from multiple partitions since it requires costly distributed CC.
Closer to this work is the study of multi-core transaction processing [16,31,34,38,45].Instead of assigning transactions to cores, [31] decomposes transactions into smaller read and write actions, and assigns such actions to threads to minimize contention.Adaptive concurrency control is proposed by [45] for changing workloads, by dynamically clustering data and adopting the optimistic CC (OCC) protocols for each cluster.[16] uses transaction batching and reordering during the validation phase to reduce the retries of OCC-based protocols.[38] develops Orthrus, a multicore transaction system that separates transaction logic and concurrency control, and employs a set of dedicated cores for the latter; the deterministic approach of [39] pre-processes transaction workloads via partitioning.In particular, [34] proposes Strife, a transaction partitioner to dynamically cluster ner-grained batches of contended workloads that do not have a good static partition.
Di erent from the partitioning strategies, we propose to schedule transactions of possibly varying execution times; transaction scheduling not only partitions transactions but also sorts them in each partition.This is based on runtime con icts, which characterize contention among transactions at a ner-grained granularity than traditional transaction con icts, and allow conventionally con icting transactions to be executed without con icts.We also develop an algorithm that can re ne existing transaction partitioning and make it a transaction schedule while minimizing runtime con icts.In addition, we develop a strategy that proactively detects and defers transactions that are highly likely to cause runtime con icts on-they, to reduce aborts and retries.Our method is not restricted to OCC.Transaction assignment for OLTP.Most OLTP systems use random or round-robin like strategies to assign unbundled transactions that are coming and processed on-the-y [60].Unlike partitioners, they do not have prior knowledge of the entire workload, and hence assign transactions one by one upon arrival.[41] proposes to use a lightweight ML model that predicts, for an incoming transaction, which thread it should be assigned to in order to have lower abort chance.Our work complements these methods by altering the assignment during execution, by tracking thread progress and proactively deferring transactions that are likely to cause con icts.Concurrency control.Concurrency control (CC) provides isolation guarantees for concurrent transaction processing.A variety of CC protocols have been proposed, which mostly fall in two classes: locking-based and timestamp based.Locking-based ones, such as two-phase locking [10,17] and its variants, are pessimistic in that a transaction accesses a tuple only after it acquires a lock with the required permission.For high-contention transactions, timestampbased protocols such as OCC [5,24], multi-version concurrency control (MVCC) [9] and their variants [8,15,18,26,30,40,44,52,53,57,58] have been veri ed e ective in reducing con icts and blocking time.Hybrid approaches that combine the two strategies have also been explored [43,45].There has also been work on learned CC to further mix CC parameters and specialize to a given workload [50].
While these CC protocols provide controlled concurrent execution of con icting transactions and ensure isolation guarantees, they do not specify how the transactions should be allocated to the threads and in what order they should be executed; both of these have signi cant impact on the performance of concurrent transaction execution.Our work aims to bridge the gap by minimizing runtime con icts via transaction scheduling and proactive transaction deferment, as an alternative to conventional approaches that only cluster transactions or directly invoke CC protocols.Multi-core in-memory OLTP systems.A number of multi-core inmemory OLTP systems have been developed recently, e.g., Tic-Tok [57], Cicada [26], Foedus [23], ERMIA [22], Silo [49] and Orthrus [38].These systems focus on the design of new CC protocols and architectural optimizations for OLTP workloads.
Di erent from these systems, we do not aim to build yet another full-edged OLTP database system.Instead, we present TS as a lightweight tool that can be incorporated into these OLTP database systems to improve their throughput by reducing runtime con ict.Deterministic approaches.Related is also the work on deterministic databases (see [4] for a survey), which execute bundled transactions via a predetermined order that is typically decided via transaction dependency analysis [13,19,25,35,36,38,47].They scale well in distributed systems, where the determined ordering ensures transactions to always read consistent data across multiple cache copies.
Di erent from these, our work studies how to reduce runtime con icts by proposing transaction scheduling and proactive deferring.It neither imposes restrictions on how CC should work, e.g., deterministically or nondeterministically, nor breaks down transactions.This said, this work and the previous work on deterministic databases complement each other.For instance, their techniques for determining read/write sets and analyzing workloads can be used by transaction partitioners for clustering and assigning transactions.Their dependency-based execution method can be adopted by transaction scheduling when runtime estimates are not reliable.

Conclusion
We have shown that by considering runtime con icts, more transactions can be executed concurrently with low con ict penalties than adopting the conventional notion of con icts.We have proposed transaction scheduling by placing an ordering on the execution of transactions, instead of transaction partitioning.We have shown that the scheduling problem is NP-complete.This said, we have developed an e cient algorithm to re ne a partitioning into a schedule for bundled transaction.Moreover, for unbundled transactions targeted by CC protocols, we have proposed a proactive deferring strategy to reduce con ict penalties.Our experimental study has veri ed that transaction scheduling and deferment are promising in improving throughput and reducing con ict penalties.
One topic for future work is to develop ML models that decide TS parameters specialized for given workloads.Another topic is to combine transaction scheduling with other strategies, e.g., transaction decomposition [31] and dynamic adjustment [45].

14 Figure 1 :
Figure 1: Di erent transaction executions in Examples 1-2 t e x i t s h a 1 _ b a s e 6 4 = " z 5 9 H 1 V P 5 8 / d 9 3 z x 2 Y l 8 T 6 p m 8 9 / K 6 o O 1 2 s N H 6 4 / r T 5 4 + e / 5 i Y / P l m Q x j Q V m f h n 4 o L h w i m e 9 x 1 l e e 8 t l F J B g J H J + d O 5 P D b P / 8 m g n p h d x S s 4 g N 9 e t z m b 0 j A I C B 8 m 9 m S a X r Y G S W I z L m P B s g p w k r d S u k m j l a a p f V d D I Y 6 o P M 6 O R K h A b C G 8 E H 0 f X J I N s v i d T N d h I 4 8 n i j i O x 0 e p P c 7 5 L N b + a L + z J 5 7 v 4 8 Y O p B / e M 2 l a S m Z o n 7

T 2 <
l a t e x i t s h a 1 _ b a s e 6 4 = " o 7 I Z W F + 6 n i F W 4 J F + e x e 9 k W f 5 u Z A = " > A A A F y H i c b Z T f b 9 M w E M e 9 s c I o v z Z 4 5 C W i m o S E N L X T N H h B W r e F r Q j S 0 q T d 0 F J N j u O 0 o Y k d 2 c 6 6 K s o L f w G v 8 D f w D / H f c M m 6 q U l j q e r J / t x 9 z 5 c 7 O 1 H g S 9 V s / l t b f 7 B R e / h o 8 3 H 9 y d N n z 1 9 s b b 8 c S h 4 L Q g e E B 1 x c O F j S w G d 0 o H w V 0 I t I U B w 6 A T 1 3 p s o H s g J 9 r Z T 3 p 3 7 3 a 4 H 0 B A 8 z r M S d n + l 9 / Q 7 M u N m E C r o a z x w U w 8 k 4 r F A 9 7 X c H v a P v B X I s e B x p z r w C P 2 s P O 8 Z p g Z 7 g 6 / z u K + x x 2 9 Q L r w 0 d Y 6 j H Y q x A T y 0 i m C A S 8 b 6 3 y a 7 Z q D P d 2 W w e 7 + 9 / 2 G 4 dH i 5 d u E 7 1 G b 9 B b 1 E L v 0 S E 6 Q z 0 0 Q A T 9 Q L / Q b / S n 9 r k W 1 W a 1 + S 2 6 vr b w e Y U K q / b z P y D 8 D H 4 = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " c J t n 4 j 4 e o Z o / 0 z z 3 y M J s B o M D a 0 I = " > A A A F z 3 i c b Z T f b 9 M w E M e 9 s c I o v z Z 4 5 C W i m j S E N D U V A l 6 Q 1 n V h q w R p 1 y T b 0 F J N j u u 0 U R M n 2 M 6 6 K g r i l b + A V 3 j n H + K / 4 Z J 1 U 5 P G U t W T / b n 7 n i 9 3 d i L f E 7 L Z / L e 2 f m + j d v / B 5 s P 6 o 8 d P n j 7 b 2 n 5 r x m l r T 3 2 3 9 / a k 1 d g / W L x 0 m + g l e o V 2 k Y r e o 3 1 0 j P r I Q g R 9 Q 7 / Q b / S n d l K b 1 b 7 X f t y g 6 2 s L n x e o s G o / / w N S x w 7 O < / l a t e x i t > (TsPar) < l a t e x i t s h a 1 _ b a s e 6 4 = " d 1 v Q D F w y 0 d 7 w m D A p Y f T / J / w o 0 D s = " > A A A F 0 3 i c b Z T f b 9 M w E M e 9 s c I o P 7 b B I y 8 R 1 a Q h p K m p J u A F a d 0 W t k q Q l j b Z h p Z q c l y n j Z r Y k e 1 s q 6 K 8 I F 7 5 C 3 i F N / 4 h / h s u W T c 1 a S x V P d m f u + / 5 c m c 3 C n y p m s 1 / K 6 s P 1 m o P H 6 0 / r j 9 5 + u z 5 x u b W i 1 P J Y 0 G o T X j A x b m L J Q 1 8 R m 3 l q 4 C e R 4 L i 0 A 3 o m T s 9 z M 7 P r q i Q P m e W m k V 0 G O I x 8 z 2 f Y A V b l 5 s b O 4 k j i W b J I + p R k b 6 5 3 G w 0 d 5 v 5 0 p Y N f W 4 0 0 H z 1 L r f W / j o j T u K Q M k U C L O W F 3 o z U M M F C + S S g a d 2 J J Y 0 w m e I x v Q C T 4 Z D K Y Z J n n m r b s D P S P C 7 g x 5 S W 7 y 5 6 J D i U c h a 6 Q I Z Y T W T 5 L N u s P L u R Y I H 6 d k F e e R + G i c • O : A transaction schedule = ( , ≺). • O : For the RC-free queues (Q 1 , . . ., Q ) and residual set R generated by , to (a) minimize the makespan of the RC-free queues; and (b) minimize the number of unscheduled transactions in R .

Figure 3 :
Figure 3: Structure for transaction progress tracking: grey grids are committed transactions; blue grids represent transactions about to execute; yellow grids reserve slots for deferred transactions.

( 5 )
To analyze the individual contribution of T P and T D to the e ectiveness of TS , we tested TS [S] against T P [S] and T D [S] where T P [S] is TS on S with T D disabled, and T D [S] is TS with only T D enabled to execute partitions of S; similarly for S and H .As shown in Fig.4j, over YCSB, on average TS [S], T P [S] and T D YCSB: vary access set accuracy

( 4 )
With T D , T P can handle inaccurate cost estimates of transactions while still improving the throughput of partitioners.

( 5 )
TS is particularly e ective for transactions with skewed or long runtime or I/O latency.It makes both partition-based and CCbased systems robust against runtime and I/O skewness.(6) TS works well with all CC protocols.Its improvement over partitioning and D CC (CC-based) is evident over each CC tested.