What is Deduplication


According to wikipedia, “Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB. Different applications have different levels of data redundancy. Backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system.”[1]

SDFS leverages data deduplication for primary storage. It acts as a normal file system that can be used for typical IO operations similiar to EXT3, NTFS … etc. The difference is SDFS hashes blocks of data as they are written to the file system and only writes those that are unique to disk. Blocks that are not unique just refernce the data that is already on disk.

Why it does matter

Using deduplication has two big advantages over a normal file system.

  • Reduced Storage Allocation – Deduplication can reduce storage needs by up to 90%-99% for files such VMDKs and backups. Basically situations where you are storing a lot of redundant data can see huge benefits.
  • Efficient Volume Replication – Since only unique data is written disk, only those blocks need to be replicated. This can reduce traffic for replicating data by 90%-99% depending on the application.

Using SDFS adds a couple additional advantages based on its achitecture.

  • Scalability : SDSF can store huge amounts of data and can deduplicate at block sizes as small as 4k.
  • IO Performance : SDFS can be setup in on multiple nodes sharing the same backen object store. Each node can read and write independently at over 2000MB/s.

Great Applications for SDFS and Deduplication

  • Backups
  • Virtual Machines
  • Network shares for unstructured data such as office documents and PSTs
  • Any application with a large amount of deduplicated data

Applications that are not a good fit for Deplication

  • Anything that has totally unique data
  • Pictures
  • Music Files
  • Movies/Videos
  • Encrypted Data


  1. Wikipedia – Data Deduplication