Storing data for distributed system can be a complex affair, there are multiple approach and nuts and bolts that needs tweaking to get data storage just right for specific applications. Choosing a storage approach has impact on performance, storage cost, redundancy, engineering complexity etc.. and these decisions need to be taken in the context of the application, its’ dataset and its’ usage patterns.
There are two main ways to deal with storage redundancy, one approach is through data replication, the other is through what is called as “Erasure Coding”. Both methods have their particular advantages and disadvantages and are better suited to handle different needs.
The simplest approach to having storage redundancy is through data replication, where we would be storing an exact copy of the data should anything happen to the original. In computer storage, such as desktops and servers, this is usually been implemented through something called RAID 1, RAID stands for Redundant Array of Inexpensive Disks.
HDFS has typically been relying on a replication factor for reliability and redundancy. The way this works is by relying on direct copy of HDFS “blocks”, usually stored in different racks.
When a block, or here a bit becomes unavailable, it can be directly copied directly from a replica.
Having full replicas of the data means that if there is an issue sourcing the data from one copy of the disk, it is always possible to retrieve it from one of the copy without interruption.
It also means that we can facilitate multiple concurrent reads on the same datasets. This can be useful in cases where there are popular tables in a data-lake for instance, if there are pieces of data that are inaccessible only then would there be a need to queuing operations not on the rest of the data.
There are drawbacks of this approach however, the software needs to do some work to keep the replica in sync and there is the extra cost of keeping the replicas.
Erasure coding is meant to provide a more space efficient way to create redundancy, rather than relying on straight data replication. It works by generating “parity” bits, that allows to reconstruct the original data should one of the original bits go missing.
The concept of erasure coding has existed for some years in different form, as part of RAID 5 and 6, for special kind of memory (ECC) or for file archives such as .par and .par2 file extensions, popular with newsgroup file downloaders. It has been introduced in Hadoop 3.0 as part of HDFS-EC.
One of the simplest form of parity check is to implement an even/odd parity bit. The parity bit checks if the sum of all the bits is even or odd. We can see that implemented in the picture below.
We have four different source bits, 2 with 1s, 2 with 0s. Their sum is 2, which is even, therefore there is no (0) remainder, which becomes the parity bit. Now let’s imagine one of the four original bits go missing. We can still recover the information contained in that bit using the parity bit.
In the example above, our third original bit is missing, because we know the size of the original bits chunk (4), we know that the sum can only be 1 (if the 3rd bit is 0) or 2 (if the 3rd bit is 1). But because we have a parity bit equal to zero, we know that the sum needs to be even. The sum needs to be 2, and the third bit needs to be 1.
As we can see, erasure coding offer a much more space efficient way to create redundancy than relying on replication. Where with replication, the total space taken was number of bits * replication factor , the total space occupied is now number of bits + (numbers of bits / bits per parity bit). If we compare that to the example above, we had a multiplier of 2 for replication and of 1.25 for erasure coding.
There are a few drawbacks from using this approach however, repairing missing bit is more computationally intensive than a straight copy and it does take some time to reconstruct. Another drawback is that based on one parity bit, only 1 bit can be reconstructed at the time, you would need to increase the number of parity bits if more than one is needed.
Erasure coding provides more a redundancy upon failure, where it is possible to reconstruct the missing information. It does not offer the same ability to have multiple concurrent reads on the same dataset as it is possible to do with replication.
Data locality has an impact on performance on applications, with distributed systems that can contain data far apart from its’ processing layer, performance can be significantly impacted. Approaches such as data-processing collocation, coupled with sharding can bring about significant improvement on speed.
When dealing with data transfers, there are a couple of factors to take into account, the overall transfer speed and the latency. The transfer speed is a factor of the overall size of the message that need to be transferred, while latency is an overhead to initiate the data transfer.
Both of these factors can be bottlenecks when dealing with distributed and cloud computing. The data transfer speed tend to be a major bottleneck when the message size tends to be fairly large as is usually the case with media content such as pictures or video, while the latency issue can lead to bottlenecks when there are a fairly large amount of very short calls made to the datastore.
There are a couple approach to deal with some of these issues, one is to use mini-batching to reduce the relative amount of overhead in the transfers, the other is to collocate data with the application.
With collocation, since the data is stored closer to the application, the latency tend to decrease and since the data is stored locally and does not need to be transferred through the network the overall transfer speed increases. Platforms such as Apache ignite allow for this type of collocated computation.
Collocation particularly plays well a replication factory, in which case it is possible to efficiently parallelize computation on the same “partition” of the dataset.
This also applies to the different datasets that might need to be joined together. This is something that can be addressed through what is called a database shard. A database shard allows to partition the datasets based on a “Shard key” and group them together in the same servers.
In large databases, sharding has an impact, particularly for join operations. Join operations requires the matching and merging datasets based on set conditions, usually relying on a specific ID. If the data was not sharded and collocated, it would imply that the data would need to be constantly transferred between servers, resulting in a lot of inefficiencies and network I/O.
Sharding however introduce an extra level of complexity when not handled natively within a database (such as with MongoDB).
Beside storage locality, compression methods play a big impact on performance and space utilization.
Beside the space savings obtained from compressing data, it can be beneficial to compress data for other reasons. Compression can increase the actual data transfer speed, it can also prove more performant to read a compressed file and decompress it on memory than reading a very large file straight out of disk, particularly when dealing with highly compressible data such as text files.
The most typical compression algorithm used in Big data are GZIP, Snappy, LZO, BZIP2 and LZMA. Each compression algorithm has different performances for CPU utilization, memory requirements, compression ratio as well as compression and decompression time.
The Snappy and LZO algorithms tend to have really fast compression decompression time but relatively low compression performance overall, making them well suited to handle “Hot” data. GZIP has a more balanced overall performance, while BZIP2 and LZMA offer a high compression ratio, making them suited to store archives and historical data.
Datasets can leverage leverage replication to achieve redundancy and higher performance, but at a higher storage cost than erasure coding, this can be particularly beneficial for frequently accessed data. Archive/Historical data on the other hand might be more suited for erasure coding, as a way to store large amount of data at a lower cost. No matter what, both measures have an impact on the total amount of data consumed and it is important to plan for this extra data need.
Collocated processing with the data is an impactful performance measure, which reduces the impact of latency and data transfer. Processing/data collocation is supported by some platforms such as Apache Ignite. Sharding is a technique used that facilitate this move, albeit at a cost of increased complexity.
Compression often makes sense for data in distributed system; there are multiple compression algorithms that can be used for storing and transferring data, each with their own pros and cons and the choice of the algorithm should be situation dependent.