CRDTs for Large-Scale Storage Systems
Résumé rédigé par
Directeur de thèse:
Unité de recherche UMR 7606 Laboratoire d'informatique de Paris 6
Topic proposed by: Scality/UPMC Thesis advisor: Marc Shapiro (UPMC) Co-advisor: Vianney Rancurel (Scality) Research Unit: Scality/Lip6 Scope: Distributed Systems Topic: CRDTs for large scale storage systems
Storage architectures for large enterprises are evolving towards a hybrid cloud model, mixing private storage (pure SSD solutions, virtualization-on-premise) with cloud-based service provider infrastructures, in order to distribute content, to enable collaboration, and to ensure data protection. Users will be able to both share data through the common cloud space, and to retain replicas in local storage. In this context we need to design data structures suitable for storage, access, update and consistency of massive amounts of data at the object, block or file system level.
Current designs are based on some restrictive assumptions, notably that objects, blocks and files are maintained in indexes (trees or B+-Trees) that are strongly consistent and partition-tolerant (CP). Unfortunately, replicating a CP index across sites is painful. The traditional approaches include locking, journaling and replaying of logs, snapshots and Merkle trees.
All of these are difficult to scale using generic approaches, although it is possible to scale them in some specific instances. For instance, synchronization in a single direction (the Active/Passive model) is relatively simple but very limited. A multi-master (Active/Active) model, where updates are allowed at multiple replicas and synchronization occurs in both directions, is difficult to achieve with the above techniques.
Our previous work has shown that: - In existing storage systems, some operations are not commutative, and must be totally ordered (strong consistency). - It is possible to design large-scale strongly-consistent synchronization protocols (e.g., replicated state machine algorithms based on Paxos), but these remain fragile and slow.Some storage operations commute. This enables a highly-scalable Eventual Consistency approach, executing them in any order. - It is possible to design useful concurrent data structures where all updates commute (CRDTs). - It is possible to combine both strong and eventual consistency into a single system (RedBlue consistency/Generalized Paxos).
Under some assumptions, the operations of storage systems operations can be classified in Abelian groups, which ensures that they commute partially or totally. This indicates that it is possible to design CRDT-like data structures for storage systems, enabling unprecedented scalability.
The objective of the research will be to design new algorithms for object, block and file storage systems. Note that these thee kinds of systems, although related, support different kinds of operations, and have different consistency requirements. The first phase consists of a detailed study of these three types of systems, the different types of operations they support, the corresponding consistency requirements, and the possible trade-offs that might increase their eventual consistency (for instance, in a given folder, “mknod foo” and “mkdir foo” do not commute, but can be made to commute by giving the directory a higher priority than a file). The work will characterize these systems in the PACELC reformulation of CAP theorem.The second phase will be to design suitable data structures for large-scale storage systems suitable for a hybrid cloud architecture, under realistic application scenarios. For instance, the PhD student might study the following structures: Large Container CRDT: a generic data structure that supports:Fast indexing of a large number of keysKey versioning (storing successive versions under the same key)Fast lookup queriesSearch by prefixFast paging queries (query entry number n)Simple replication across sitesSynchronization CRDT: a data structure for large-scale storage system replication. Internally, it uses a CRDT to store its content (instead of a non-scalable binary tree). Features:Maintains a hash cascade for detecting modifications of a Large Container CRDT.Quickly locates the keys that diverge between two Large Container CRDT replicas.Fast updates.These data structures shall be designed in a principled manner, by leveraging monotonic semi-lattices or commutative types, and shall be suitable for various kinds of replications (operation-based or state-based). In the third phase, the student will prototype the solutions and carry out performance measurements using different scenarios with these new data structures, according to different consistency models (strong, causal, eventual, RedBlue).