Administration Guide

Index:

Introduction:

This is intended to be a detailed guide for the SDFS file-system and SDVOL volume management. For most purposes, the Quickstart Guide will get you going but if you are interested in advanced topics, this is the place to look.

SDVOL is a distributed and expandable volume manager designed to proved inline deduplicaton and replication to any filesystem.

SDFS is a distributed and expandable filesystem designed to provide inline deduplication and flexiblity for applications. Services such as backup, archiving, NAS storage, and Virtual Machine primary and secondary storage can benefit greatly from SDFS.

SDFS and SDVOL Can be deployed in both a stand alone and distributed, multi node configuration. As a standalone volume and filesystem SDFS and SDVOL provided inline deduplication, replication, and unlimited snapshot capabilities. In a multi node configuration, global, intra-volume, deduplication, block storage redundancy, and bock storage expandability is also added.  In a multinode configuration,  File Systems and Volumes will store and share unique data blocks with other volumes within the cluster. These volumes and file systems can also specify a level of redundancy for data stored in the cluster.

Features:

SDFS and SDVOL are designed to provide maximum performance for read and write activity in addition to the features below.

  • High Availability : All block data can is replicated across up to 7 Block Storage nodes (2 by default).
  • Global Deduplication across all SDFS volumes in a cluster
  • Block Storage Node expandability up to 126 independent nodes
  • Unlimited Snapshot capability without IO impact
  • Very low Intra-Cluster network IO since only globally unique blocks are written
  • Efficient, deduplication aware replication

Architecture:

SDFS’s unique design allows for many advantages over a traditional filesystem. The complete decoupleing of block data from file metadata is the main charateristic of this design. Any number of logical files can reference the same unique data block. The unique data block has no knowledge of what files have reference or where the files are located. Metadata has reference to a hash associated with logical location within a file and, in the case of a cluster, what nodes that block can be fetched from. Since all data is deduplicated and shared between volumes and metadata IO is reduced significantly across the network and on box.
SDFS is comprised of 4 basic components:
  • SDFS file-system service (Volume)
  • IP Cluster Communication
  • Deduplication Storage Engine (DSE)
  • Data Chunks

END TO END PROCESS

SDFS FILE META-DATA

Each logical SDFS file is represented by two different actual pieces of metadata and held in two different files. The first piece of metadata is called the “MetaDataDedupFile” this file is stored in a filesystem structure that directly mimics the filesystem namespace that is presented when the filesystem is mounted. As an example, each of these files is named the same as it appears when it is mounted and looks as if its in the same directory structure under “/opt/sdfs/volumes/<volume-name>/files/. This file contain all of the filesystem attributes associated with the file, including size, atime, ctime, acls, and link to the associated map file.

The second meta-data file is the mapping file. This file contains the list of records, corresponding to locations to where the blocks represent data in the file. Each record contains a hash entry, whether the data was a duplicate, and when the data is stored on remote nodes, what nodes that data can be found on.

each record has the following data structure

| dup (1 byte) | hash (hash algo length) | reserverd 1(byte) | hash locations(s) 8 bytes |

The locations field is an 8 byte array. Byte 1 in the array designates again, whether the record contains a duplicate block or not. Bytes 2-8 indicate a location where the block can be found. Each byte designates a different location.

WRITING DATA

When data is written to SDFS it is sent to the File-System Process for the kernel via the fuse library. SDFS grabs the data from the fuse layer api and breaks the data into fixed chunks. These chunks are associated to fixed positions within the file as they were written. These chunks are immediately cached on a per file basis for active IO read and writes in a fifo buffer. The size of this fifo buffer is set to 1M by default but can be changed via the “max-file-write-buffers” attribute within the SDFS configuration file. Chunks within the fifo buffer are expired at new data enters or if the data is untouched for over 2 seconds. When data expires from the fifo buffer it is moved to a flushing buffer. This flushing buffer is emptied by a pool of threads configured by  “write-threads” attribute. These threads perform the process of computing the hash for the block of data, searching the system to see if the hash/data has already been stored, on what nodes the data is stored(not applicable for standalone), confirming the data has been persisted, and finally writing the record associated with the block to the mapping file.

READING DATA

When data is read from SDFS the requested file position and requested data length is sent through the fuse layer to the SDFS application. The record(s) associated with the file position and length are then looked up in the mapping file and the block data is recalled by either looking up the hash within the record locally or, in the case of a multi-node configuration, by requesting the hash for one of the servers found within the hash locations field for that record. If the data is not found on that server it is then looked up on successive servers until the block is located or the location list is exhausted.

READING DATA FROM AMAZON GLACIER

SDFS can utilize Storage Lifestyle policies with Amazon to support moving data to glacier. When data is read from Amazon Glacier, the DSE first tries the retrieve the data normally, as if its any S3 blob, if this process fails because the blob has been moved to glacies, the DSE then informs the OpenDedupe volume service that the data has been archived. The OpenDedupe Volume Service initiates an glacier archival retrieval process for all blocks associated to chunks unused by file being read. The read operation will be blocked until all blocks have been successfully restored from glacier.

SDFS FILE-SYSTEM SERVICE

The SDFS File-System (FSS) service is synonymous with the operating system concept of a Volume and filesystem as it performs the function of both. It is logical container that is responsible for all file system level activity. This includes filesystem IO activity, chunking data into fixed blocks for deduplication, file system statistics, as well as all enhanced functions such as snapshots. Each File System Service or volume contains a single SDFS namespace, or filesystem instance and is responsible for presenting and storing individual files and folders. The volume is mounted through “mount.sdfs”. This is the primary way users, applications, and services interact with SDFS.

The SDFS file-system service provides a typical POSIX compliant view of deduplicated files and folders to volumes. The SDFS filesystem services store meta-data regarding files and folders. This meta data includes information such as file size, file path, and most other aspects of files and folders other than the actual file data. In addition to meta data the SDFS file-system service also manages file maps that identify data location to dedup/undeduped chunk mappings. The chunks themselves live either within within the local Deduplication Storage Engine or in the cluster depending on the configuration.

SDFS Volumes can be exported through ISCSI or NFS.

If deployed in a multi-node configuration, multiple file-system services can share the same unique block data and use the same set of dedup storage engines. This allows for global deduplication.

IP Cluster Communication

SDFS can be deployed either standalone or in a multi-node configuration. In a standalone configuration the FSS and the DSE are contained within the same process. In a multi-node configuration several FSS communicate across the network to several DSEs.

In a multi-node configuration all the SDFS File-System Service (FSS) instances and Deduplication Storage Engines (DSE) use a network protocol and communication engine provided by Jgroups. By using JGroups SDFS, in a multi-node configuration SDFS is designed to be share nothing and provides:

Automatic node discovery and removal.
LANs or WANs spanning.
Automatic Membership detection and notification about joined/left/crashed cluster nodes

All SDFS testing is done using JGroups over multicast and UDP but it can be configured for use with TCP as well. The JGroups communication layer is configured through a jgroups.cfg.xml. A sample jgroups config is provided with SDFS but can be modified as needed. Many sample configurations are available at https://github.com/belaban/JGroups/blob/master/conf/.

When new SDFS File-System Services or DSEs come online they announce themselves and are auto discovered by the other nodes in the cluster. SDFS File-System Services are discovered so that Garbage collection is performed consistently across all nodes or not at all. This ensures that blocks claimed by services that are off line are not removed. If a service comes offline its reference is still persisted but not associated with an active node. That way, when garbage collection is performed, if a service is present, but not associated with a node, the garbage collection will fail and exit until the next run when the node is back online. File-System Service references can be added and removed manually from the sdfscli utility. You will want to make sure all volumes are listed in the active cluster, whether or not they are online. Otherwise, data will be lost from volumes that are offline when global garbage collection occurs. To add a volume run :

sdfscli --cluster-volume-add <volume-name>

When an DSE comes online it is added to the available list of nodes where unique blocks can be written. Each DSE will have a unique numeric id in the range of [1-127]. This numeric id is configured during dse creation but can be changed after the fact within the DSE configuration file. If the ID is changed after data is written to the DSE all references to that data will be lost until a “cluster-redundancy-check” is performed on all FSS instances. If a duplicate id exists in the cluster the DSE will not start. SDFS File-System services (FSS) will add the available storage on the DSE to its capacity and begin to actively writing unique blocks of data to that node and reference that node id for retrieval. DSE nodes are written to based on a weighted random distribution determined independently on the FSS. This distribution algorithm is designed to weight for even storage distribution usage across all nodes with some randomness that is weighted based on available capacity on the DSEs.

The communication sequence between the File-System service and the DSE in a multi node cluster is similiar to the standalone configuration. Some differences are that all hash lookups are multicast to all DSEs before writes occure. The hash lookups are batched into pools of 40 by individual File-System Services to reduce latency. Each of the DSEs will respond with a corresponding list of yes or no answers to whether they have already stored the block associated with that hash. The File System Service (FSS) will wait for all responses or until a timeout is reached (5 Seconds by default). It will then check to make sure all redundancy requirements are met for the block of data. As an example if a block of data is only stored on one DSE but the volume configuration requires that the block be stored on 2 nodes than the FSS will write to block to a secondary node.

If the responses have met the cluster redundancy requirements then the FSS will store the cluster node numbers and the hash in the map file associated with the specific write request. It the block has not yet been stored or if the block does not meet the cluster redundancy requirements then the block will be written to nodes that have not already stored that block. The determination where that block will be stored is based on distribution algorithm described above. These write are done in unicast to the specific storage nodes.

Deduplication Storage Engine

The Deduplication Storage Engine (DSE) stores, retrieves, and removes all deduped chunks. The deduplication storage engine can be run as part of an SDFS Volume, which is default, or as part of a global deduplication storage pool. Chunks of data are stored on disk, or at a cloud provider, and indexed for retrieval with a custom written hash table. The hash table is broken up into segments that before read only after they fill up. Each DSE database is stored in /chunkstore/hdb/ .

Data Blocks

Unique data chunks are stored together in Data Block by the Dedupe Storage Engine(DSE) either on disk or in the cloud. The dedupe storage engine stores collections of data chunks, in sequence, within data blocks in the the chuckstore directory. By default the data block size is no more than 40MB but can be set to anything up to 2GB. New blocks are closed and are no longer writable when either their size is reached or the block times out waiting for new data. The timeout is 6 seconds by default. Writable blocks are stored in chunkstore/outgoing/.

The DSE creates new blocks as data is unque data is written into the DSE. Each new block is designated by a unique long integer. When unique data is written in, the DSE either compresses/encrypts the chunk and then writes the chunk into a new block in squence. It then stores a reference to the unique chunk hash and the block’s unique id. The block, itself keeps track of where unque chunks are located in a map file associated with each chunk. As blocks reach their size limit or timeout they are then closed for writing and then either uploaded to the cloud and cached locally or moved to a perminent location on disk /chunkstore/[1st three numbers of unique id]/. The map file and the blocks are stored together.

Cloud Storage of Data Blocks

When data is uploaded to the cloud, the DSE creates a new thread to upload the closed block. The number of blocks that can be simultaneously uploaded is 16 by default but can be changed within the xml config (io-threads). For slower connections it may make sense to lower this number or raise it for faster connections.  are stored in the bucked under /blocks/ sub directory and the associated maps are stored in the /keys/ directory.

Data chunks are always read locally when requested by the volume. If cloud storage is leveraged, the data block where a requested chunk is located is retrieved from the cloud storage provider and cached locally. The unique chunk is then read from the local cache and restored.

If data is stored in an Amazon Glacier repository, the DSE informs the volume that the data is archived. The volume will then initiate an archive retrieval process.

Data Chunk

Data Chunks are the unit by which raw data is processes and stored with SDFS. Chunks of data are stored either with the Deduplication Storage Engine or the SDFS file-system service depending on the deduplication process (see below). Chunks are, by default hashed using the Murmur3 Hashing algorithm. SDFS also includes other hashing algorithms but Murmur3 is fast, collision resistant, and requires half the memory of sha256 to store.

Redundancy and High Availability

SDFS provides high availability for all unique block data but not the volume metadata. This means that is any DSE goes down block reads can still be guarenteed from the cluster. In addition all DSE writes can still be supported by the remaining DSE nodes from the cluster. The high availability threshold can be determined by the number of cluster block replicas specified for the FSS minus 1. So, if you specified “cluster-block-replicas=6” at the time the volume was created up to 5 DSE nodes could go down and the volume could still be serviced.

Here are a few other facts regarding unique data redundancy:

  • Unique data is written to mutiple DSE nodes asynchronously.
  • If two volumes specifiy different redundancy requirements and share unique data, redundancy will be met based on the volume with the highest redundancy requirement.
  • If the number of DSEs drops below the replica requirement writes will still occur to the remaining DSE nodes.

There are plans to provide high availability for volume metadata within SDFS as well but even without it, there are ways to ensure high availability of the volume metadata using tools DRDB. With DRDB you could replicate the volume containing the SDFS metadata to a second node. This second node could have the SDFS FSS actively mounted in a standby configuration or unmounted. The SDFS volume configuration file would also need to be the same on both nodes. In this configuration, very limited amounts of metadata will be replicated between nodes since all block data is already in the cluster.

You can periodically ensure that all data is written based on redundancy requirements using the sdfscli command option “cluster-redundancy-check”. This option will ensure that all data meets the redundancy requirements of the volume that it is run on.

Planning SDFS Your Architecture:

Standalone VS. Multi-Node Cluster

We deciding on a standone architecture versus a multi-node architecture consider the following advantages of each :

STANDALONE ADVANTAGES :

  • IO Speeds : Typically higher since all data transfer between the FSS and DSE are just api calls within the same process
  • Simplicity : Setup does not require network considerations or multinode complexity

MULTI-NODE CLUSTER ADVANTAGES :

  • High Availability : All block data can is replicated across multiple DSE nodes. Therefore if any single (or multiple dependin on the configuration) DSE is down the data for all served FSS’s is still available
  • Scalability : DSE Storage can be automatically added, as required to the cluster, without impact or downtime.
  • Global Deduplication : Significant storage savings from deduplication between all FSS’s since all unique data is shared across all nodes in the cluster.
  • Performance : IO performance can increase as more DSE nodes are added to the cluster. SDFS load balances all IO activity between DSE nodes in order to make sure that storage usage and IO load is evenly distributed.

To create a volume you will want to consider the following:

For all volume types

  • Where to store unique blocks – On local disk, on clustered dse, in the cloud. Cloud storage is currently only supported on standalone volumes, typically redundant, slow, and cannot be shared between volumes. Local storage is fast but not redundant and cannot be shared between volumes. Clustered DSEs are redundant, scalable, but typically not as fast as local disk.
  • Total storage to allocate to unique data – SDFS stores unique data in a contiguous file on disk. Local storage and cloud cannot be easily expanded. In a multi-node cluster DSEs can be easily added to expand storage
  • Dedup Chunk Size – Larger chunks mean less system requirements and smaller chunks mean better deduplication. Once you set the block size it cannot be changed. This is true for stand alone and clustered configurations. The dedup chunk size for Virtual Machines should be set to 4K while, for unstructure data, chunk sizes of 32K to 128K will suffice.

For Multi-Node Clusters

  • Redundancy Requirements – SDFS Volumes can be configured to make all unique blocks redundant. Up to 7 copies of any block can stored depending on the number of nodes
  • Network IO Speed – Multi-Node clusters can be chatty with lots of small messages. Consider network requirements based on performance needs. Bandwith requirements will depend on read vs write needs. Read performance will depend completely on raw network bandwidth from the DSE pool. Write speeds will primarily depend on network latency and raw bandwidth.
  • Network Architecture : SDFS, by default uses multicast to communicate between nodes and initial discovery. It the default configuration is chosen, consider having all nodes in the same subnet of making sure multicast is enabled on switches and routers.
  • Number of DSE in cluster – Consider having atleast 2 DSE for any multi-node cluster. This will improve speed and resiliance. Additional DSE nodes can
  • Hardware Requirements – FSS instances are very CPU intesive, since they hash all data writes. DSE nodes are very memory and disk IO intensive since they contain the hash lookup table and block storage for data writes and retreivals.

Fixed and Variable Block Deduplication

SDFS Can perform both fixed and variable block deduplication. Fixed block deduplication takes fixed blocks of data and hashes those blocks. Variable block deduplication attempts to find natural breaks within stream of data an creates variable blocks at those break points.

Fixed block deduplication is performed at volume defined fixed byte buffers within SDFS. These fixed blocks are defined when the volume is created and is set at 4k by default but can be set to a maximum value of 128k. Fixed block deduplication is very useful for active structured data such as running VMDKs or Databases. Fixed block deduplication is simple to perform and can therefore be very fast for most applications.

Variable block deduplication is performed using Rabin Window Borders (http://en.wikipedia.org/wiki/Rabin_fingerprint). SDFS uses fixed buffers of 256K and then runs a rolling hash across that buffer to find natural breaks. The minimum size of a variable block is 4k and the maximum size is 32k. Variable block deduplication is very good at finding duplicated blocks in unstructured data such as uncompressed tar files and documents. Variable Block deduplication typically will create blocks of 10k-16k. This makes Variable block deduplication more salable than fixed block deduplication when it is performed at 4k block sizes. The downside of Variable block deduplication is that it can be computationally intensive and sometimes slower for write processing.

Variable block deduplication can only enabled when the volume is created using the –hash-type=VARIABLE_MURMUR3.

Variable block deduplication can only be performed on local chunkstores or for cloud data.

Creating and Mounting SDFS File Systems:

Both stand alone and clustered SDFS volumes are created through the sdfscli command line. There are many options available within the command line but most of the options are set to their optimal setting. Multiple SDFS Volumes can be hosted on a single host. All volume configurations are stored, by default in /etc/sdfs .

CREATING A STANDALONE SDFS VOLUME

A simple standalone volume named “dedup” with a dedup capacity of 1TB and using variable block deduplication run the following command:

mkfs.sdfs --volume-name=dedup --volume-capacity=1TB

The following will create a volume that has a dedup capacity of 1TB and a unique block size of 32K

mkfs.sdfs --volume-name=dedup --volume-capacity=1TB --io-chunk-size=32

To add the ability for a volume to be a replication source or “master” use the following command.

mkfs.sdfs --volume_name=dedup --volume-capacity=1TB --enable-replication-master --sdfscli-password=<apasswrd>

CREATING A CLUSTER AWARE SDFS VOLUME

Volumes that are created to join a cluster require some additional configuration. In addition to the stand alone options described above, you will also want to make sure that :

  • The block size is the same across all nodes in the cluster
  • You have a shared name for the cluster. This is used to identify that all nodes listening are in the same cluster
  • Your network is bound to the right IP address
  • The Jgroups config is similiar across all nodes in the cluster

To create a simple clustered volume use the following command. This will create a 1TB volume that will store its unique blocks in the cluster, used a cluster name of “sdfs-cluster” and have a block size of 4K.

mkfs.sdfs --chunk-store-local=false --volume-name=test --volume-capacity=1TB

In addition to creating the volume, you will also want to modify the jgroups configuration associated with this volume. By default the configuration references /etc/sdfs/jgroups.cfg.xml but this can be changed within the SDFS xml configuration. Within the jgroups config, you will want to add the attribute bind_addr=”<my ip address that is visible to other nodes>” to the “UDP” xml tag. This will ensure that the SDFS volume is bound to the appropriate NIC/IP.

To create a clustered volume with specific redundancy requirements, the following command will create a volume that will ensure that atleast a copie of each block exists on at least 3 nodes.

mkfs.sdfs --chunk-store-local=false --volume-name=test --volume-capacity=1TB --cluster-block-replicas=3

By default volumes store all data in the folder structure /opt/sdfs/<volume-name>. This may not be optimal and can be changed before a volume is mounted for the first time. In addition, volume configurations are held in the /etc/sdfs folder. Each volume configuration is created when the mkfs.sdfs command is run and stored as an XML file and its naming convention is <volume-name>-volume-cfg.xml.

SDFS Volumes are mounted with the mount.sdfs command. Mounting a volume typically typically is executed by running “mount.sdfs -v <volume-name> -m <mount-point>. As an example “mount.sdfs -v sdfs -m /media/dedup will mount the volume as configured by /etc/sdfs/sdfs-volume-cfg.xml to the path /media/dedup. Volume mounting options are as follows:

-c              sdfs volume will be compacted and then exit
-d              debug output-forcecompact   sdfs volume will be compacted even if it is missing blocks. This option is used in conjunction with -c
-h              displays available options
-m <arg>        mount point for SDFS file system e.g. /media/dedup
-nossl          If set ssl will not be used sdfscli traffic.
-o <arg>        fuse mount options.Will default to direct_io,big_writes,allow_other,fsname=SDFS
-rv <arg>       comma separated list of remote volumes that should also be accounted for when doing garbage collection. If not entered the volume will attempt to identify other volumes in the cluster.
-s              Run single threaded
-v <arg>        sdfs volume to mount e.g. dedup
-vc <arg>       sdfs volume configuration file to mount e.g. /etc/sdfs/dedup-volume-cfg.xml

Volumes are unmounted automatically when the mount.sdfs is killed or the volume is unmounted using  the umount command.

To mount a volume run

Exporting SDFS Volumes:

SDFS can be shared through NFS or ISCSI exports on Linux kernel 2.6.31 and above.

NFS Exports

SDFS is supported and has been tested with NFSv3. NFS opens and closes files with every read or write. File open and closes are expensive for SDFS and as such can degrade performance when running over NFS. SDFS volumes can be optimized for NFS with the option “–io-safe-close=false” when creating the volume. This will leave files open for NFS reads and writes. Files data will still be sync’d with every write command, so data integrity will still be maintained. Files will be closed after an inactivity period has been reached. By default this inactivity period is 15 (900) seconds minutes but can be changed at any time, along with the io-safe-close option within the xml configuration file located in /etc/sdfs/<volume-name>-volume-cfg.xml.

To export an SDFS Volume of FSS via NFS use the fsid=<a unique id> option as part of the syntax in your /etc/exports. As an example, an SDFS volume is mounted at /media/pool0 and you wanted to export it to the world you would use the following syntax in your /etc/exports

/media/pool0	*(rw,async,no_subtree_check,fsid=12)

ISCSI Exports

SDFS is supported and has been tested with LIO using fileio. This means that LIO serves up a file on an SDFS volume as a virtual volume itself. On the SDFS volume the exported volumes are represented as large files within the filesystem. This is a common setup for ISCI exports. The following command squence will server up an 100GB ISCSI volume from a SDFS filesystem mounted at /media/pool0 without any authentication.

tcm_node –fileio fileio_0/test /media/pool0.test.iscsi 107374182400
lio_node –addlun  iqn.2013.org.opendedup.iscsi 1 0 test fileio_0/test
lio_node –addnp iqn.2013.org.opendedup.iscsi 1 0.0.0.0:3260
lio_node –permissive iqn.2013.org.opendedup.iscsi 1
lio_node –disableauth=iqn.2013.org.opendedup.iscsi 1
echo 0 > /sys/kernel/config/target/iscsi/iqn.2013.org.opendedup.iscsi/tpgt_1/attrib/demo_mode_write_protect

Managing SDFS Volumes for Virtual Machines:

It was the original goal of SDFS to be a file system of virtual machines. Again, to get proper deduplication rates for VMDK files set io-chunk-size to “4” when creating the volume. This will match the chunk size of the guest os file system usually. NTFS allow 32k chunk sizes but not on root volumes. It may be advantageous, for Windows guest environments, to have the root volume on one mounted SDFS path at 4k chunk size and data volumes in another SDFS path at 32k chunk sizes. Then format the data ntfs volumes, within the guest, for 32k chunk sizes. This will provide optimal performance.

Managing SDFS Volumes through SDFS command line

Online SDFS management is done through sdfscli. This is a command line executable that allows access to management and information about a particular SDFS volume. The volume in question must be mounted when the command line is executed. below are the command line parameters that can be run. Also help is available for the command line when run as sdfscli –help . The volume itself will listen on as an https service on a port  starting with 6442. By default the volume will only listen on the loopback adapter. This can be changed during volume creation but adding the “enable-replication-master” option. In addition, after creation, this can be changed by modifying the “listen-address” attribute in the sdfscli tag within the xml config.  If multiple volumes are mounted the volume will automatically choose the next highest available port. The tcp port can be determined by running a “df -h” the port will be designated after the “:” within the device name.

usage: sdfs.cmd <options>

–archive-out <arg>                   Creates an archive tar for a particular file or folder and outputs the location.
–change-password <arg>        Change the administrative password.
–cleanstore <minutes>                Clean the dedup storage engine of data that is older than defined minutes and is unclaimed by current files. This command only worksif the dedup storage engine is local and not in network mode
–cluster-dse-info                    Returns Dedup Storage Engine Statitics for all Storage Nodes in the cluster.
–cluster-make-gc-master     Makes this host the current Garbage Collection Coordinator.
–cluster-redundancy-check            makes sure that the storage cluster maintains the required number of copies for each block of data
–cluster-volume-add <arg>            Adds an unassociated volume in the cluster.
–cluster-volume-remove <arg>         Removes an unassociated volume in the cluster.
–cluster-volumes                     Returns A List of SDFS Volumes in the cluster.
–debug                               makes output more verbose
–debug-info                          Returns Debug Information.
–dedup-file <true|false>             Deduplicates all file blocks if set to true, otherwise it will only dedup blocks that are already stored in the DSE.
–file-path=<file to flush>
–dse-info                            Returns Dedup Storage Engine Statitics.
–expandvolume <arg>                  Expand the local volume, online, to a size in MB,GB, or TB
–file-info                           Returns io file attributes such as dedup rate and file io statistics.  e.g. –file-info –file-path=<path to file or folder>
–file-path <RELATIVE PATH>           The relative path to the file or folder to take action on.
–flush-all-buffers                   Flushes all buffers within an SDFS file system.
–flush-file-buffers                  Flushes to buffer of a praticular file.
–help                                Display these options.
–import-archive <arg>                Imports an archive created using archive out.
–replication-master=<server-ip>
–replication-master-password=<server-password>
–nossl                               If set, tries to connect to volume without ssl
–password <arg>                      password to authenticate to SDFS CLI Interface for volume.
–perfmon-on <true|false>             Turn on or off the volume performance monitor.
–port <arg>                          SDFS CLI Interface tcp listening port for volume.
–replication-batch-size <arg>        The size,in MB, of the batch that the replication client will request from the replication master. If ignored or set to <1 it will default to the what ever is on the replication client volume as the default. This is currently 30 MB. This will default to “-1”
–replication-batch-size=<size in MB>
–replication-master-port <arg>       The server port associated with the archive imported. This will default to “6442”
–server <arg>                        SDFS host location.
–snapshot                            Creates a snapshot for a particular file or folder
–snapshot-path <RELATIVE PATH>       The relative path to the destination of the snapshot.
–volume-info                         Returns SDFS Volume Statistics.

File-system Snapshots:

SDFS provides snapshot functions for files and folders. The snapshot command is “sdfscli –snapshot –snapshot-path=<relative-target-path> –file-path=<relative-source-path>”. The destination path is relative to the mount point of the sdfs filesystem.

As an example to snap a file “/mounted-sdfs-volume/source.bin” to /mounted-sdfs-volume/folder/target.bin you would run the following command:

sdfscli –snapshot –snapshot-path=folder/target.bin –file-path=source.bin

The snapshot command makes a copy of the MetaDataDedupFile and the map file and associates this copy with the snapshot path. This means that no actual data is copied and unlimited snapshots can be created, without performance impact to the target or source, since they not associated or linked in any way.

SDFS File System Service Volume Storage Reporting, Utilization, and Compacting A Volume

Reporting

An FSS reports is size and usage to the operating system, by default, based on the capacity and current usage of the data stored within the DSE. This means that a volume could have a much larger amount of logical data than, if all the file sizes in the filesystem was added up, than is reported to the OS. This method of reporting is the most accurate as it reports actual physical capacity and with deduplication you should see the current size as much smaller that the logical data would report.

It is also possible for FSS report the logical capacity and utilization of the volume. This means that SDFS will report the logical capacity, as specified during volume creation as “–volume-capacity” and current usage based on the logical size or the files as reported during an “ls” command. To change the reporting the following parameters will need to be changed in the sdfs xml configuration when the volume is not mounted.

To set the FSS to report capacity based on the –volume-capacity

use-dse-capacity=”false”

To set the FSS to report current utilization based on logical capacity

use-dse-size=”false”

Volume Size can reported many ways. Both operating system tools and the sdfscli can be used to view capacity and usage of a FSS.  A quick way to view the way the filesystem sees a SDFS Volume is running “df -h”. The os will report the volume name as a concatenation of the config and the port the the sdfscli service is listening on. The port is important because it can be used connect to multiple volumes from the sdfscli using the “–port” option. The size and used columns either report the capacity and usage for the DSE or the logical capacity and usage of the volume, depending on configuration.

Volume usage statistics are also reported by the sdfs command line. For both standalone and volumes in a cluster configuration the command line “sdfscli.sh –volume-info” can be executed. This will output statistics about the local volume.

To view stats on all volumes in the cluster use  “sdfscli.sh –cluster-volume-info”

Utilization

SDFS Volumes grow over time as more unique data is stored. As unique data is de-referenced from volumes it is deleted from the DSE, if not claimed, during the garbage collection process. Deleted blocks are overwritten over time. Typically this allows for efficient use of storage but if data is aggressively added or deleted a volume can have a lot of empty space and fragmentation where unique blocks that have been deleted used to reside. This option is only available for standalone volumes.

Compaction

If a volume has over allocated the space used on disk, it can be compacted offline. The compacting process  will require enough space to hold the current DSE size and the new “compact” DSE. The process for compacting is as follows:

  1. Unmount the volume
  2. Remount the volume with the -c option. e.g. mount.sdfs -v pool0 -c -m /media/pool0. The compact process first claims all unique data used by files and then creates a new DSE with only the data used by the files currently on the volume. Once finished the volume will be unmounted.
  3. Remount the volume as normal. The first mount after a compact will run through an integrity check to make sure all data is written and available within the DSE

Dedup Storage Engine (DSE):

The Dedup Storage Engine (DSE) provides services to store, retrieve, and remove deduplicated chunks of data. By default, each SDFS Volume contains its and manages its own DSE. When a DSE runs as part of a volume it is referred to running in local mode. A DSE can also be configured to run as a clustered network service and support many volumes across multiple physical servers. When a DSE is running as a clustered network service it is referred to running in clustered dse node.

A DSE running in local mode is configured automatically, when a volume is create, and works well most environments. In local mode, all settings can be changed after volume creation. Keep in mind, if directory locations are changed data will need to be migrated to the new directories as well. In local mode the DSE cannot be resized once created.

CLUSTERED DSE NODES

A clustered DSE node is an independent service that runs on a host. Multiple DSE nodes can run on a single host. A clustered DSE node is uniquely identified by its cluster node id. Each node must have a unique id.It will join a SDFS cluster automatically and register its ID. If another node in the cluster contains that ID, the DSE will not start.

A clustered DSE node uses the IP Cluster Communication layer and does not contain any online management capability. Once it is started it will begin to write, read, and expunge data according to the demand from attached volumes.

The advantage for running the DSE in clusterd mode is that it allows for global deduplicaton, better availability and better scalability. The disadvantage is complexity of configuration and possibly performance.

CREATING A CLUSTER MODE DSE

Creating a clustered DSE requires that it has network IP connectivity to the other nodes in the cluster. You will want to ensure this before proceeding. To ceate a DSE that has a capacity of 1TB and will accept requests for 128K blocks run the following command.

mkdse --dse-name=sn1 --dse-capacity=1TB --cluster-node-id=1 --listen-ip=10.0.0.24

To start a second DSE run this on the next node

mkdse --dse-name=sn1 --dse-capacity=1TB --cluster-node-id=2 --listen-ip=10.0.0.25

This command will create a configuration document that is stored in /etc/sdfs/sn1-dse-cfg.xml. Before you start the server, you will want to edit the jgroups config associated with the dse. By default the configuration references /etc/sdfs/jgroups.cfg.xml but this can be changed within the xml configuration. Within the jgroups config, you will want to add the attribute bind_addr=”<my ip address that is visible to other nodes>” to the “UDP” xml tag. This will ensure that the SDFS volume is bound to the appropriate NIC/IP.

Running a DSE in network mode is executed through the startDSEService.sh script. It requires the path the DSE configuration file.

To run the DSE execute :

startDSEService.sh -c /etc/sdfs/sn1-dse-cfg.xml

To enable a volume to use a DSE requires they can all talk with each other over the network. To ensure a volume can see a DSE run  sdfscli –cluster-dse-info on a volume in the cluster. It will output the DSE stores available for reads and writes.

The list of commands options for startDSEService.sh are :

-c <arg>    sdfs cluster configuration file to start this storage node
-d          debug output
 -h          displays available options
 -rv <arg>   comma separated list of remote volumes that should also be accounted for when doing garbage collection. If not entered the volume will attempt to identify other volumes in the cluster.

Dedup Storage Engine – Cloud Based Deduplication:

The DSE can be configured to store data to the Amazon S3 cloud storage service or Azure. When enabled, all unique blocks will be stored to a bucket of your choosing. Each block is stored and an individual blob in the bucked. Data can be encrypted before transit, and at rest with the S3 cloud using AES-256 bit encryption. In addition, all data is compressed by default before sent to the cloud.

The purpose of deduplicating data before sending it to cloud storage to minimize storage and maximize  write performance. The concept behind deduplication is to only store unique blocks of data. If only unique data is sent to cloud storage, bandwidth can be optimized and cloud storage can be reduced. Opendedup approaches cloud storage differently than a traditional cloud based file system. The volume data such as the name space and file meta-data are stored locally on the the system where the SDFS volume is mounted. Only the unique chunks of data are stored at the cloud storage provider. This ensures maximum performance by allowing all file system functions to be performed locally except for data reads and writes. In addition, local read and write caching should make writing smaller files transparent to the user or service writing to the volume.

Cloud based storage has been enabled for S3 Amazon Web service and Azure. To create a volume using cloud storage take a look at the cloud storage guide here.

Dedup Storage Engine Memory:

The SDFS Filesystem itself uses about 1GB of RAM for internal processing and caching. For hash table caching and chunk storaged kernel memory is used. It is advisable to have enough memory to store the entire hashtable so that SDFS does not have to scan swap space or the file system to lookup hashes. SDFS allocates ra

To calculate memory requirements keep in mind that each stored chunk takes up approximately 400 MB of RAM per 1 TB of unique storage.

Dedup Storage Engine Crash Recovery:

If a Dedup Storage Engine crashes, a recovery process will be initiated on the next start. In a clusterned node setup, a DSE determines if it was not shut down gracefully if <relative-path>/chunkstore/hdb/.lock exists on startup. In a standalone setup the SDFS volume determines crash if the closed-gracefully=”false” within the configuration xml. If a crash is determined, the Dedup Storage Engine will go through the recovery process.

The DSE recovery process re-hashes all of the blocks stored on disk, or get the hashes from the cloud. It then verifies the hashes are in the hash database and adds an entry if none exists. This process will claim all hashes, even those previously dereferenced hashes that were removed during garbage collection.

In a standalone setup, the volume will then perform a garbage collection to de-reference any orphaned block. In a clustered configuration the garbage collection will need to be performed manually.

Data Chunks:

The chunk size must match for both the SDFS Volume and the Deduplication Storage Engine. The default for SDFS is to store chunks at 4K size. The chunk size must be set at volume and Deduplication Storage Engine creation. When Volumes are created with their own local Deduplication Storage Engine chunk sizes are matched up automatically, but, when the Deduplication Storage Engine is run as a network service this must be set before the data is stored within the engine.

Within a SDFS volume chunksize is set upon creation with the option –io-chunk-size. The option –io-chunk-size sets the size of chunks that are hashed and can only be changed before the file system is mounted to for the first time. The default setting is 4K but can be set as high as 128K. The size of chucks determine the efficient at which files will be deduplicated at the cost of RAM. As an example a 4K chunk size SDFS provides perfect deduplication for Virtual Machines (VMDKs) because it matches the cluster size of most guest os file systems but can cost as much as 6GB of RAM per 1TB to store. In contrast setting the chunk size to 128K is perfect of archived, unstructured data, such as rsync backups, and will allow you to store as much as 32TB of data with the same 6GB of memory.

To create a volume that will store VMs (VMDK files) create a volume using 32K chunk size as follows:
sudo ./mkfs.sdfs –volume-name=sdfs_vol1 –volume-capacity=150GB –io-chunk-size=32

As stated, when running SDFS Volumes with a local DSE chunksizes are matched automatically, but if running the DSE as a network service, than a parameter with the DSE configuration XML file will need to be set before any data is stored. The parameter is:

page-size=”<chunk-size in bytes>”.

As an example to set a 4k chunk size the option would need to be set to:

page-size=”4096″

File and Folder Placement:

Deduplication is IO Intensive. SDFS, by default writes data to /opt/sdfs. SDFS does a lot of writes went persisting data and a lot of random IO when reading data. For high IO intensive applications it is suggested that you split at least the chunk-store-data-location and chunk-store-hashdb-location onto fast and separate physical disks. From experience these are the most IO intensive stores and could take advantage of faster IO.

Other options and extended attributes:

SDFS uses extended attributes to manipulate the SDFS file system and files contained within. It is also used to report on IO performance. To get a list of commands and readable IO statistics run “getfattr -d *” within the mount point of the sdfs file system.

sdfscli –file-info –file-path=<relative-path to file or folder>

SDFS Volume Replication:

SDFS now provides asynchronous master/slave volume and subvolume replication through the sdfsreplicate service and script. SDFS volume replication  takes a snapshot of the disignated master volume or subfolder and then replicated meta-data and unique blocks to the secondary, or slave, SDFS volume. Only unique blocks that are not already stored on the slave volume are replicated so data transfer should be minimal. The benefits of SDFS Replication are:

* Fast replication – SDFS can replicate large volume sets quickly.
* Reduced bandwidth – Only unique data is replicated between volumes
* Build in scheduling – The sdfsreplicate service has a built in scheduling engine based on cron style syntax.
* Sub-volume replication – The sdfsreplicate service can replicate volumes or subfolders to slave volumes. In addition, replication can be set to  be targeted to sub-volumes on the slave.
* Sub-volume targest on the slave allow for wildcard naming such as and appended timestamp or the hostname of the master.

The steps SDFS uses to perform  asynchronous replication are the following:

1. The sdfsreplicate service, on the slave volume host, requests a snapshot of the master volume or subfolder over the manster’s tcp management channel (typically port 6442).
2. The master volume creates a snapshot of all SDFS metadata and data maps.
3. The master volume tar and zips the snapshot metadata and data maps
4. The sdfsreplicate service, on the slave volume host, downloads the snap shot tar over the manster’s tcp management channel (typically port 6442)
5. The slave volume unzips and imports the tar to its volume structure
6. The slave volume imports data associated with the master snapshot to its dedup storage engine from the master volume over the master’s management cli channel. This defaults to TCP port 6442.

The steps required to setup master/slave replication are the following:

1. Configure your SDFS master volume to allow replication. This is done by creating a SDFS volume with the command line parameter    “–enable-replication-master”.  e.g. mkfs.sdfs –volume-name=vol0 –volume-capacity=1TB –io-chunk-size=4 –chunk-store-size=200GB      –enable-replication-master
2. Configure the replication.props configuration file on the slave. An example of this script is included in the etc/sdfs directory    and includes the following parameters:
#Replication master settings
#IP address of the server where the master volume is located
replication.master=master-ip
#Number of copies of the replicated folder to keep. This will use First In First Out.
#It must be used in combination with the replication.slave.folder option %d. If set to -1 it is ignored
replication.copies=-1
#the password of the master. This defaults to “admin”
replication.master.password=admin
#The sdfscli port on the master server. This defaults to 6442
replication.master.port=6442
#The folder within the volume that should be replicated. If you would like to replicate the entire volume use “/”
replication.master.folder=/
#Replication slave settings#The local ip address that the sdfscli is listening on for the slavevolume.
replication.slave=localhost
#the password used on the sdfscli for the slave volume. This defaults to admin
replication.slave.password=admin
#The tcp port the sdfscli is listening on for the slave
replication.slave.port=6442
#The folder where you would like to replicate to wild cards are %d (date as yyMMddHHmmss) %h (remote host)
#the slave folder to replicated to e.g. backup-%h-%d will output “backup-<master-name>-<timestamp>
replication.slave.folder=backup-%h-%d
#The batch size the replication slave requests data from the server in MB. This defaults to 30MB but can be anything up to 128 MB.
replication.batchsize=-1
#Replication service settings#The folder where the SDFS master snapshot will be downloaded to on the slave. The snapshot tar archive is deleted after import.
archive.staging=/tmp
#The log file that will output replication status
logfile=/var/log/sdfs/replication.log
#Schedule cron = as a cron job, single = run one time
schedule.type=cron
#Every 30 minutes take a look at http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/tutorial-lesson-06 for scheduling tutorial
schedule.cron=0 0/30 * * * ?
#The folder where job history will be persisted. This defaults to a folder call “replhistory” under the same directory where this file is located.
#job.history.folder=/etc/sdfs/replhistory

3. Run the sdfsreplicate script on the slave. This will either run once and exit if schedule.type=single or will run continuously with    schedule.type=cron

e.g. sdfsreplicate /etc/sdfs/replication.props

Data Chunk Removal:

SDFS uses two methods to remove unused data from an DedupStorage Engine(DSE). If the SDFS volume has its own dedup storage engine, which it does by default. Unused,or orphaned, chunks are removed as the size of the DSE increases at 10% increments and at specified schedule (defaults to midnight). The specified schedule can me configured at creation with the io-claim-chunks-schedule option. Otherwise it can be configured afterwards within the sdfscli command option –set-gc-schedule. Take a look at cron format for more details. to review the accepted cron syntax. Below details the process for garbage collection.

1. SDFS Volume scans all files, claims, and informs the DSE what chunks are currently in use. This happens when chunks are first stored and then every time the ChunkStore grows by 10% or at a specified time period.
2. The DSE checks for data that has not been claimed by the file system and time stamps all data that has been claimed by the volume.
3. The chunks that have not been claimed by the volume are de-referenced and put into a pool for re-use as new  data is written to the dse.

In a Multi-Node cluster SDFS does all of the above in a distributed fashion and is coordinated by an FSS that is promoted to a GC Master. This master monitor the dse size, and initialized garbage collection based on the schedule and grown parameters listed above. The GC master is determined based on if it is the first FSS node in the cluster or if one exists already. If the GC Master FSS goes down another FSS node will take over the duties. The GC Master can be determined with the following command.

sdfscli.sh --cluster-get-gc-master

To make an FSS the GC Master run the following command on the node.

sdfscli.sh --cluster-make-gc-master

When Garbage Collection is initiated, the GC master determines if all volumes/FSS nodes are online. The SDFS cluster keeps track of all volumes that have been added or gone offline. This is done to ensure that claimed data is not expunged for volumes that are currently offline. If they are not online then the garbage collection will fail until all nodes are back online or the volume is manually removed from the cluster by executing :

sdfscli --cluster-volume-remove <volume-name>

It is very important that the cluster is always aware of offline volumes so their data is not removed during garbage collection. Offline volumes can be added to the cluster using the following command :

sdfscli --cluster-volume-add <volume-name>
The Dedup Storage Engine can be cleaned manually by running :
sdfscli --cleanstore=<minutes>

The size of the chunks.chk will not diminish but rather SDFS will re-allocate space already written to, but unclaimed.

All of this is configurable and can be changed after a volume is written to. 

TroubleShooting:

There are a few common errors with simple fixes.

1. OutOfMemoryError – This is caused by the size of the DedupStorageEngine memory requirements being larger than the heap size allocated for the JVM. To fix this edit the mount.sdfs script and increase the -Xmx2g to something larger (e.g. -Xmx3g).

2. java.io.IOException : Too Many Open Files – This is caused by there not being enough available file handles for underlying filesystem processes. To fix this add the following lines to /etc/security/limits.conf and the relogin/restart your system.

* soft nofile 65535
* hard nofile 6553