For my Ph.D. thesis, I have designed and implemented a novel scheduling algorithm for solving the data consistency problem in a cluster of database replicas that serve as back-ends for a dynamic content web server. The conventional wisdom has been that either strong consistency or scalability is achievable in replicated database clusters, but not both. In particular, traditional eager protocols, which provide strong consistency for database replication, have poor scaling due to frequent conflict delays. On the other hand, traditional database lazy replication provides good scaling, but no consistency guarantees. More recently, multiple levels of loose consistency have been defined in an effort to provide both good scaling and a predefined set of consistency features. However, such consistency models may require non-trivial programmer effort. Furthermore, they cannot improve performance for applications that do require strong consistency (i.e. transactional serializability).
The goal of my dissertation is to provide scaling and data availability through replication but also achieve serializability. The key aspect of my design is to exploit the usual level of indirection in accessing the database data offered by any practical implementation. I interpose a scheduler between the web server and database tier. This scheduler augmented with reliable state allows the server to use a series of performance optimizations based on lazy replication and conflict awareness internally, while externally it conforms to serializability as perceived by the user. This novel technique called conflict-aware scheduling provides scaling, consistency and data availability at the same time, for the replicated dynamic content web site.
In a nut-shell, the scheduler directs operations to those replicas where it knows they will find a copy of the data consistent with a serializable execution. For example, a write on a data item may have been sent to all replicas, but it may have completed only at a subset of them. The scheduler keeps track of the completion status of each of the writes. Using this knowledge, the scheduler makes sure that a read that needs to happen after a particular write (in order to achieve serializability) is sent to a replica where it knows the write has already completed. High data availability is achieved by replicating the scheduler state among several schedulers. Additional fault tolerance is provided by also logging this information to stable storage.
I have implemented a proof-of-concept prototype by using common software for building dynamic content sites such as the Apache web server with the PHP scripting module and the MySQL database. Together with other researchers at Rice, I have also implemented three dynamic content applications for evaluating my techniques: an e-commerce site, an auction site and a bulletin board site [WWC-5 '02]. I have experimented on small size clusters of up to 8 databases, and performed simulations for larger clusters (of up to 60 databases) and faster database engines.
The results show that query-level conflict-aware scheduling brings considerable benefits in terms of both overall throughput scaling and latency reduction compared to an eager protocol which has been the only traditional method to obtain serializability. My dissertation further shows that the cost of this novel scheduling method is minimal in terms of data availability and fault tolerance.
In the following, I will briefly describe how the scheduler maintains strong consistency and the scheduling optimizations used: lazy replication, query-level scheduling, and conflict awareness and avoidance. Last, I will describe further scaling enhancements and other studies I have done on this topic.
The scheduler maintains serializability by a technique I call sequencing. At run-time, the scheduler assigns a unique sequence number to each transaction. Tagging operations with their assigned transaction's sequence number and executing operations of conflicting transactions in sequence number order at all database replicas ensures serializability. Maintaining persistent state at the scheduler allows the writes to execute asynchronously at each replica, as in a lazy replication protocol, without the need for a two-phase commit protocol between replicas.
In the basic version of the protocol, a conservative two-phase locking algorithm is used for concurrency control. This algorithm avoids deadlocks which would otherwise become frequent at large cluster sizes. I have also developed a variant of the conflict-aware algorithm based on a different concurrency control using explicit versioning.
I have designed and implemented a scheduling technique that avoids conflicts in replicated database clusters called conflict-aware scheduling [USITS '03]. This scheduler differs from traditional schedulers in two ways. First, request distribution on a replicated database cluster has been traditionally done solely based on load. The scheduler I designed is the first to optimize conflicts on the back-ends instead of load. Avoiding conflicts addresses the main scaling bottleneck for any solution that maintains serializability: waiting due to conflicts.
Secondly, previous approaches to lazy replication schedule all queries of the transaction on the same replica, and respond to the user only when that replica responds. The scheduler I designed returns the response for a replicated operation (e.g. lock, write, commit) as soon as that operation executes at any replica. Furthermore, for each read query, the scheduler selects the least loaded replica from the set of replicas with no conflicts, that have completed the previous writes in the same transaction.
These optimizations require that the scheduler maintains as part of its internal state, the completion status of replicated operations for all database replicas. I have compared this scheme against a transaction-level conflict-aware scheme and both transaction-level and query-level conflict-oblivious (i.e. plain load-balancing) schemes. I have shown that the improvements of conflict avoidance and per-query load balancing are substantial (up-to a factor of 2) compared to equivalent schedulers without these optimizations.