Coding for Distributed Storage Systems

Today's big data applications depend on distributed storage systems to store a tremendous amount of data, by distributing data into a large number of commodity servers. Given the large number of servers and the commodity hardware that build the server, failures are common in a distributed storage system. Hence, maintaining data availability at such a large scale poses major challenges on the design and implementation of distributed storage systems. To tolerate data losses resulted from various failures, it is essential for distributed storage systems to add redundancy. Replication, the conventional way of adding redundancy, can incur an overwhelming storage overhead when storing a substantial volume of data. Hence, various practical distributed storage systems have been replacing replicated redundancy with parity redundancy computed with erasure coding, which has demonstrated a significant saving of storage overhead with an equal or even higher level of failure tolerance.

Traditional erasure codes, such as Reed-Solomon codes, can tolerate a certain number of failures with an optimal storage overhead. However, they suffer from some other critical system metrics that hinder their further deployment, such as high network traffic and low I/O performance. In this context, my research interests are designing novel erasure codes that overcome the limitations of traditional erasure codes in distributed storage systems, and designing and prototyping systems that novelly deploy erasure coding to achieve high performance not only in the distributed storage system itself but in the big data application running upon it.

Data reconstruction with (near) optimal network traffic

One of the challenges that hinder the deployment of traditional erasure codes in distributed storage systems is the high network traffic incurred to reconstruct unavailable data after failures. While typically unavailable data at different servers are reconstructed separately, it has been shown that failures in a distributed storage system can be correlated, and a significant amount of network traffic can be saved by reconstructing data on different servers at the same time. In particular, a particular class of erasure codes, called minimum-storage cooperative regenerating (MSCR) codes, achieve the optimal network traffic to reconstruct data on multiple servers with the optimal storage overhead. However, existing constructions of MSCR codes only work with specific values of some system parameters.

My research established a theoretical connection between MSCR codes reconstructing two failures with MSR codes, a special class of MSCR codes that reconstruct only one server, and from existing constructions of MSR codes we can henceforth get a construction of MSCR codes to reconstruct unavailable data on two servers. I further studied a more general scenario of reconstructing data on an arbitrary number of servers. I propose Beehive codes, working with a wide range of system parameters, that reconstructs multiple failures with network traffic very close to the optimum. In other words, Beehive codes achieve very similar performance as MSCR codes in terms of network traffic during reconstruction, without strict restrictions on system parameters.

Related publications:

Exploiting topology properties in the reconstruction

Besides proposing novel erasure codes to save the overhead of reconstruction, I have also proposed mechanisms for saving reconstruction overhead for existing erasure codes. Traditional erasure codes, as well as most modern erasure codes specifically proposed for distributed storage systems, require the replacement server that reconstructs unavailable data to download data from a certain number of existing servers. Receiving data from a large number of servers may make the incoming link at the replacement server become the bottleneck, or incur a high volume of disk I/O at existing servers in overall. To tackle such challenges, I have designed a pipelined topology where the number of servers to contact can be reduced from \(n\) to \( O(\sqrt{n}) \), without change existing properties of deployed erasure codes. My research has also proposed a tree-structured topology where bandwidth heterogeneity, especially for the distributed storage system working in the wide-area network, can be exploited to significantly save the time to finish reconstruction and boot data availability.

Related publications:

Handling skewed demand inside distributed storage systems.

Another challenge of deploying erasure coding in a distributed storage system is the high overhead to read or write data with erasure coding, include the computational overhead to encode or decode data, as well as additional overhead in terms of network traffic and disk I/O. Because of this, deploying erasure coding in a distributed storage system can lead to tradeoffs between different system metrics, such as storage overhead vs. decoding overhead and write latency vs. failure tolerance.

On the output side, my research designs algorithms to mitigate the decoding overhead brought by erasure coding. Erasure coding can bring higher decoding overhead with a saving of storage overhead. However, it is challenging to find the optimal value of system parameters to encode all data as the demand of data can be highly skewed. In fact, it is desirable to achieve low decoding overhead for hot data and low storage overhead for cold data. Though existing solutions can classify data into a certain number of tiers by their demand, they all require manually choosing values of system parameters in each tier. My work proposed a framework that can automatically determine the number of tiers and group data into tiers with parameter values chosen according to the demand to minimize the overall decoding overhead. I further design algorithms for two representative erasure codes to save the network traffic when data are migrated between tiers as demand may change over time.

Related publications:

Efficient dissemination of erasure-coded data

On the input side, when data are written into distributed storage system, redundant data are added at the same time. The more failures we need to tolerate, the more redundant data we need to add. Unlike replication, it takes significantly more time in existing distributed storage systems to compute parity redundancy with erasure coding and distribute them into servers. My work designed a novel mechanism and built a prototype system to distribute the original data with parity redundancy into receiving servers, by taking advantage of the failure tolerance of erasure coding. This way, it takes much less time to finish writing data with erasure coding, just slightly more than the time to write replications.

Related publication: