Managing and Mounting SDFS Volumes
Mounting SDFS Volumes as NFS Shares
Managing SDFS Volumes for Virtual Machines
Managing SDFS Volumes through Extended Attributes
File-system Snapshots for SDFS Volumes
Dedup Storage Engine
Cloud Based Deduplication
Dedup Storage Engine Memory
File and Folder Placement
Other options and Extended Attributes
This is intended to be a detailed guide for the SDFS file-system. For most purposes, the Quickstart Guide will get you going but if you are interested in advanced topics, this is the place to look.
SDFS is a filesystem designed to provide inline deduplication and flexiblity for applications. Services such as backup, archiving, NAS storage, and Virtual Machine primary and secondary storage can benefit greatly from SDFS.
According to wikipedia, "Data deduplication is a specific form of compression where redundant data is eliminated, typically to improve storage utilization. In the deduplication process, duplicate data is deleted, leaving only one copy of the data to be stored. However, indexing of all data is still retained should that data ever be required. Deduplication is able to reduce the required storage capacity since only the unique data is stored. For example, a typical email system might contain 100 instances of the same one megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB storage space. With data deduplication, only one instance of the attachment is actually stored; each subsequent instance is just referenced back to the one saved copy. In this example, a 100 MB storage demand could be reduced to only 1 MB. Different applications have different levels of data redundancy. Backup applications generally benefit the most from de-duplication due to the nature of repeated full backups of an existing file system."
Virtual Machines can also benefit greatly from deduplication as most operating system specific binaries are similiar across different guest instances. In many cases deduplication with SDFS can provide between 80% and 90% percent storage reduction.
SDFS is comprised of 4 basic components:
SDFS file-system service
Deduplication Storage Engine (DSE)
The SDFS Volume is a mounted file-system presented to the operating system. This is the primary way applications and services interact with SDFS. SDFS Volumes can be shared through SAMBA or NFS.
The SDFS file-system service provides a typical POSIX compliant view of deduplicated files and folders to volumes. The SDFS filesystem services store meta-data regarding files and folders. This meta data includes information such as file size, file path, and most other aspects of files and folders other than the actual file data. Each SDFS Volume manages its own SDFS file-system service. In addition to meta data the SDFS file-system service also manages file maps that identify data location to dedup/undeduped chunk mappings. The chunks themselves live either within within the Deduplication Storage Engine.
The Deduplication Storage Engine (DSE) stores, retrieves, and removes all deduped chunks. Chunks of data are stored on disk and indexed for retrieval with an in-memory custom written hashtable. The deduplication storage engine can be run as part of an SDFS Volume, which is default, or as a network service.
Data Chunks are the unit by which raw data is processes and stored with SDFS. Chunks of data are stored either with the Deduplication Storage Engine or the SDFS file-system service depending on the deduplication process (see below). Chunks are hashed using the Tiger Hash computation.
SDFS Volumes are created with the the command mkfs.sdfs. There are plenty of options to create a file system and to see them run "./mkfs.sdfs --help" or take a look at them here.
By default volumes store all data in the folder structure /opt/sdfs/<volume-name>. This may not be optimal and can be changed before a volume is mounted for the first time. In addition, volume configurations are held in the /etc/sdfs folder. Each volume configuration is created when the mkfs.sdfs command is run and stored as an XML file and its naming convention is <volume-name>-volume-cfg.xml. A typical volume configuration is as follows:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><subsystem-config version="1.1.0"><locations dedup-db-store="/opt/sdfs/volumes/pool0/ddb" io-log="/opt/sdfs/volumes/pool0/ioperf.log"/><io chunk-size="4" claim-hash-schedule="0 0 0/2 * * ?" dedup-files="true" file-read-cache="5" hash-size="16" log-level="1" max-file-inactive="900" max-file-write-buffers="1" max-open-files="1024" meta-file-cache="1024" multi-read-timeout="1000" safe-close="false" safe-sync="false" system-read-cache="1000" write-threads="18"/><permissions default-file="0644" default-folder="0755" default-group="0" default-owner="0"/><volume capacity="4TB" closed-gracefully="true" current-size="0" maximum-percentage-full="-1.0" path="/opt/sdfs/volumes/pool0/files"/><launch-params class-path="/usr/share/sdfs/lib/truezip-samples-7.3.2-jar-with-dependencies.jar:/usr/share/sdfs/lib/commons-collections-3.2.1.jar:/usr/share/sdfs/lib/sdfs.jar:/usr/share/sdfs/lib/jacksum.jar:/usr/share/sdfs/lib/slf4j-log4j12-1.5.10.jar:/usr/share/sdfs/lib/slf4j-api-1.5.10.jar:/usr/share/sdfs/lib/simple-4.1.21.jar:/usr/share/sdfs/lib/commons-io-1.4.jar:/usr/share/sdfs/lib/clhm-release-1.0-lru.jar:/usr/share/sdfs/lib/trove-3.0.0a3.jar:/usr/share/sdfs/lib/quartz-1.8.3.jar:/usr/share/sdfs/lib/log4j-1.2.15.jar:/usr/share/sdfs/lib/bcprov-jdk16-143.jar:/usr/share/sdfs/lib/commons-codec-1.3.jar:/usr/share/sdfs/lib/commons-httpclient-3.1.jar:/usr/share/sdfs/lib/commons-logging-1.1.1.jar:/usr/share/sdfs/lib/java-xmlbuilder-1.jar:/usr/share/sdfs/lib/jets3t-0.7.4.jar:/usr/share/sdfs/lib/commons-cli-1.2.jar" java-options="-Djava.library.path=/usr/share/sdfs/bin/ -Dorg.apache.commons.logging.Log=fuse.logging.FuseLog -Dfuse.logging.level=INFO -server -XX:+UseG1GC -Xmx3000m -Xmn400m" java-path="/usr/share/sdfs/jre1.7.0/bin/java"/><sdfscli enable="true" enable-auth="true" listen-address="0.0.0.0" password="cbe709382e6ee7cf22116ffb1cb645df0be48b52bb378ddd895d9a0d561714b0" port="6442" salt="o3iVPw"/><local-Dedup Storage Engine allocation-size="214748364800" chunk-gc-schedule="0 0 0/4 * * ?" chunk-store="/opt/sdfs/volumes/pool0/Dedup Storage Engine/chunks" chunk-store-dirty-timeout="1000" chunk-store-read-cache="5" Dedup Storage Engine-class="org.opendedup.sdfs.filestore.FileDedup Storage Engine" enabled="true" encrypt="false" encryption-key="2CUKKdfav6PM28p71hX-r7sG@hutaV5bDFX" eviction-age="6" gc-class="org.opendedup.sdfs.filestore.gc.PFullGC" hash-db-store="/opt/sdfs/volumes/pool0/Dedup Storage Engine/hdb" pre-allocate="false" read-ahead-pages="8"><network enable="true" hostname="0.0.0.0" port="2222" upstream-enabled="false" upstream-host="" upstream-host-port="2222" upstream-password="admin" use-udp="false"/></local-Dedup Storage Engine></subsystem-config>
SDFS Volumes are mounted with the mount.sdfs command. Mounting a volume typically typically is executed by running "mount.sdfs -v <volume-name> -m <mount-point>. As an example "mount.sdfs -v sdfs -m /media/dedup will mount the volume as configured by /etc/sdfs/sdfs-volume-cfg.xml to the path /media/dedup. Volume mounting options are as follows:
-m <arg> mount point for SDFS file system
-o <arg> fuse mount options.
Will default to:
-r <arg> path to Dedup Storage Engine routing file.
Will default to:
-v <arg> sdfs volume to mount
-vc <arg> sdfs volume configuration file to mount
Volumes are unmounted automatically when the mount.sdfs is killed or the volume is unmounted using the umount command.
SDFS can be shared through NFS exports on linux kernel 2.6.31 and above. It can be shared on kernel levels below that but performance will suffer at you will need to disable the fuse direct_io option when mounting the sdfs filesystem. NFS opens and closes files with every read or write. File open and closes are expensive for SDFS and as such can degrade performance when running over NFS. SDFS volumes can be optimized for NFS with the option "--io-safe-close=false" when creating the volume. This will leave files open for NFS reads and writes. Files data will still be sync'd with every write command, so data integrity will still be maintained. Files will be closed after an inactivity period has been reached. By default this inactivity period is 15 (900) seconds minutes but can be changed at any time, along with the io-safe-close option within the xml configuration file located in /etc/sdfs/<volume-name>-volume-cfg.xml.
It was the origional goal of SDFS to be a file system of virutal machines. Again, to get proper deduplication rates for VMDK files set io-chunk-size to "4" when creating the volume. This will match the chunk size of the guest os file system usually. NTFS allow 32k chunk sizes but not on root volumes. It may be advantageous, for Windows guest environments, to have the root volume on one mounted SDFS path at 4k chunk size and data volumes in another SDFS path at 32k chunk sizes. Then format the data ntfs volumes, within the guest, for 32k chunk sizes. This will provide optimal performance.
This has been depricated for the sdfsclie command line. SDFS provides IO metrics and SDFS specific file/folder management through extended file attributes. Extended file attributes are can be accessed through the setfattr and getfattr commands available on most linux distributions. The command line for getting extended attributes for a specific file or folder are "getfattr -d <file-or-folder-name>".
Metrics attributes are as follows:
user.sdfs.ActualBytesWritten - The actual number of bytes written to the SDFS volume for this specific file. This does not include dedup chunks that were not written because they were already there.
user.sdfs.BytesRead - The actual number of bytes read for the file.
user.sdfs.DuplicateData - The amount of data within the file that is duplicate
user.sdfs.UniqueData - The amount of data within the file that is unique
user.sdfs.VMDK - Whether the file is a VMDK
user.sdfs.VirtualBytesWritten - The amount of data that would have been written to the volume if it were not deduplicated. This include dedup and non-dedup data.
user.sdfs.dedupAll - Whether all data will be deduped or not, true means all data will be deduped. See Inline vs Batch mode for more details.
user.sdfs.dfGUID - The guid for the dedup file map. This can used to determin the location of the map file on disk
user.sdfs.file.isopen - True if the file is open
user.sdfs.fileGUID - The meta data file GUID.
The command to manage SDFS volumes through settings extended attributes are as follows :
user.cmd.cleanstore - cleans the DSE, if local, of data older than the defined minutes e.g. setfattr -n user.cmd.cleanstore -v 5555:<minutes> <mount point> (See Data Chunks for more detail)
user.cmd.dedupAll -sets the file to dedup all chunks or not. Set to true if you would like to dedup all chunks <unique-command-id:true or false>
user.cmd.file.flush - Flush write cache for specificed file <unique-command-id>
user.cmd.flush.all - Flush write cache for all files
user.cmd.ids.clearstatus - clear all command id status
user.cmd.ids.status - get the status if a specific command e.g. to get the status of command id 54333 run getfattr -n user.cmd.ids.status.54333
user.cmd.nextid - returns a random GUID
user.cmd.optimize - optimize the file by specifiying a specific length <unique-command-id:length-in-bytes>
user.cmd.snapshot - Take a Snapshot of a File or Folder <unique-command-id:snapshotdst> (See File-System Snapshots)
user.cmd.vmdk.make - Creates an simple flat vmdk in this directory <unique-command-id:vmdkname:size(TB|GB|MB)>. The command must be executed on a directory. e.g.setfattr -n user.cmd.vmdk.make -v 5556:bigvserver:500GB /dir
Setting extended attributes takes the following command line structure :
setfattr -n <attribute> -v <attribute-value> <file or folder>
setfattr -n user.cmd.vmdk.make -v 5556:win2k8-0:40GB dedup/vmfs/
Online SDFS management is done throught sdfscli. This is a command line executable that allows access to management and information about a particular SDFS volume. The volume in question must be mounted when the command line is executed. below are the command line parameters that can be run. Also help is available for the command line when run as sdfscli --help . The volume itself will listen on a tcp port. If multiple volumes are mounted you will want to select the port option when running the command. The tcp port can be determined by running a "df -h" the port will be designated after the ":" within the device name.
usage: sdfs.cmd <options>
SDFS provides snapshot functions for files and folders. To snapshot a file or folder you will need the setfattr command available on your system. On ubuntu this is available through the Attr package. The snapshot command is "sdfscli --snapshot --snapshot-path=<relative-target-path> --file-path=<relative-source-path>". The destination path is relative to the mount point of the sdfs filesystem.
As an exampe to snap a file "/mounted-sdfs-volume/source.bin" to /mounted-sdfs-volume/folder/target.bin you would run the following command:
sdfscli --snapshot --snapshot-path=folder/target.bin --file-path=source.bin
SDFS Volumes grow over time as more unique data is stored. As unique data is deleted from the volume, the space used by the deleted data is reused over time. Typically this allows for efficient use of storage but if data is aggressively added or deleted a volume can have a lot of empty space that where unique blocks that have been deleted used to reside.
If a volume has overallocated the space used on disk, it can be compacted offline. The compacting process will require enough space to hold the current DSE size and the new "compact" DSE. The processe for compacting is as follows:
- Unmount the volume
- Remount the volume with the -c option. e.g. mount.sdfs -v pool0 -c -m /media/pool0. The compact process first claims all unique data used by files and then creates a new DSE with only the data used by the files currently on the volume. Once finished the volume will be unmounted.
- Remount the volume as normal. The first mount after a compact will run through an integrity check to make sure all data is written and available within the DSE
The Dedup Storage Engine (DSE) provides services to store, retreive, and remove deduplicated chunks of data. By default, each SDFS Volume contains its and manages its own DSE. When a DSE runs as part of a volume it is referred to running in local mode. A DSE can also be configured to run as a network service and support many volumes across multiple physical servers. When a DSE is running as a network service it is referred to running in network mode.
A DSE running in local mode is configured automatically, when a volume is create, and works well most environments. In local mode, all settings can be changed after volume creation. Keep in mind, if directory locations are changed data will need to be migrated to the new directories as well.
The advantage for running the DSE in network mode is that it allows for global deduplicaton and better scalability. The disadvantage is complexity of configuration.
Running a DSE in network mode is executed through the startDSE.sh script. It requires the path the DSE configuration file. A sample DSE configuration files is located at "etc/hashserver-config.xml" in the directory of the SDFS package. The parameter are almost identical to those in a local volume configuration and look as follows:
<network port="2222" hostname="0.0.0.0" use-udp="false"/>
<locations chunk-store="/opt/dchunks/Dedup Storage Engine/chunks" hash-db-store="/opt/ddb/hdb"/>
<chunk-store pre-allocate="false" chunk-gc-schedule="0 0 0/2 * * ?" eviction-age="4" allocation-size="161061273600" page-size="4096" read-ahead-pages="8"/>
To run the DSE execute :
- Make sure the chunk size are exactly the same for volume and DSE. The DSE config determines chunk size with the attribute "page-size" in bytes.
- Edit the volume configuration file and change the attribute enable="true" to enable="false" within the "local-Dedup Storage Engine" element.
- Edit the server element within routing config file to point at the DSE. A sample routing config file is located at etc/routing-config.xml
- Mount the volume with the "-r" tage.g. ./mount.sdfs -r etc/routing-config.xml -v sdfs -m /media/dedup
The DSE can be configured to store data to the Amazon S3 cloud storage service. When enabled, all unique blocks will be stored to an S3 bucket of your choosing. Data can be encrypted before transit, and at rest with the S3 cloud using AES-256 bit encryption. In addition, all data is compressed by default before sent to the cloud.
The purpose of deduplicating data before sending it to cloud storage to minimize storage and maximize write performance. The concept behind deduplication is to only store unique blocks of data. If only unique data is sent to cloud storage, bandwidth can be optimized and cloud storage can be reduced. Opendedup approaches cloud storage differently than a traditional cloud based file system. The volume data such as the name space and file meta-data are stored locally on the the system where the SDFS volume is mounted. Only the unique chunks of data are stored at the cloud storage provider. This ensures maximum performance by allowing all file system functions to be performed locally except for data reads and writes. In addition, local read and write caching should make writing smaller files transparent to the user or service writing to the volume.
Cloud based storage has been enabled for S3 Amazon Web service. To create a volume using cloud storage take a look at the cloud storage guide here.
The SDFS Filesystem itself uses about 2GB of RAM for internal processing and caching. For hash table caching and chunk storaged kernel memory is used. It is advisable to have enough memory to store the entire hashtable so that SDFS does not have to scan swap space or the file system to lookup hashes. For volumes that require high IOPs expand the memory edit the "-Xmx2g" within mount.sdfs or the startDSE.sh script to something better for your environment. For most environments -Xmx3g is sufficient.
To calculate memory requirements keep in mind that each stored chunk takes up approximately 25 bytes of RAM. To calculate how much RAM you will need for a specific volume divide the volume size (in bytes) by the chunk size (in bytes) and multiply that times 25.
Memory Requirements Calculation:
(volume size/chunk size)*25 Data
The chunk size must match for both the SDFS Volume and the Deduplication Storage Engine. The default for SDFS is to store chunks at 128k size. This size provides optimal memory performance and IO throughput. The chunk size must be set at volume and Deduplication Storage Engine creation. When Volumes are created with their own local Deduplication Storage Engine chunk sizes are matched up automatically, but, when the Deduplication Storage Engine is run as a network service this must be set before the data is stored within the engine.
Within a SDFS volume chunksize is set upon creation with the option --io-chunk-size. The option --io-chunk-size sets the size of chunks that are hashed and can only be changed before the file system is mounted to for the first time. The default setting is 128k but can be set as low as 4k. The size of chucks determine the efficient at which files will be deduplicated at the cost of RAM. As an example a 4k chunk size is perfect for Virtual Machines (VMDKs) because it matches the cluster size of most guest os file systems but can cost as much as 8GB of RAM per 1TB to store. In contrast setting the chunk size to 128k is perfect of archived, unstructured data, such as rsync backups, and will allow you to store as much as 32TB of data with the same 8GB of memory.
To create a volume that will store VMs (VMDK files) create a volume using 4k chunk size as follows:
sudo ./mkfs.sdfs --volume-name=sdfs_vol1 --volume-capacity=150GB --io-chunk-size=4k --io-max-file-write-buffers=32
As stated, when running SDFS Volumes with a local DSE chunksizes are matched automatically, but if running the DSE as a network service, than a parameter with the DSE configuration XML file will need to be set before any data is stored. The parameter is:
page-size="<chunk-size in bytes>".
As an example to set a 4k chunk size the option would need to be set to:
Deduplication is IO Intensive. SDFS, by default writes data to /opt/sdfs. SDFS does a lot of writes went persisting data and a lot of random io when reading data. For high IO intensive applications it is suggested that you split at least the chunk-store-data-location and chunk-store-hashdb-location onto fast and separate physical disks. From experience these are the most io intensive stores and could take advantage of faster IO.
SDFS uses extended attributes to manipulate the SDFS file system and files contained within. It is also used to report on IO performance. To get a list of commands and readable IO statistics run "getfattr -d *" within the mount point of the sdfs file system.
sdfscli --file-info --file-path=<relative-path to file or folder>
SDFS now provides asynchronous master/slave volume and subvolume replication throught the sdfsreplicate service and script. SDFS volume replication takes a snapshot of the disignated master volume or subfolder and then replicated meta-data and unique blocks to the secondary, or slave, SDFS volume. Only unique blocks that are not already stored on the slave volume are replicated so data transfer should be minimal. The benefits of SDFS Replication are:
The steps SDFS uses to perform asynchronous replication are the following:
The steps required to setup master/slave replication are the following:
#Number of copies of the replicated folder to keep. This will use First In First Out.
#the password of the master. This defaults to "admin"
#The sdfscli port on the master server. This defaults to 6442
#The sdfs master DSE port. This defaults to 2222
#The folder within the volume that should be replicated. If you would like to replicate the entire volume use "/"
#Replication slave settings#The local ip address that the sdfscli is listening on for the slavevolume.
#The tcp port the sdfscli is listening on for the slave
#The folder where you would like to replicate to wild cards are %d (date as yyMMddHHmmss) %h (remote host)
#Replication service settings#The folder where the SDFS master snapshot will be downloaded to on the slave. The snapshot tar archive is deleted after import.
#The log file that will output replication status
#Schedule cron = as a cron job, single = run one time
#Every 30 minutes take a look at http://www.quartz-scheduler.org/documentation/quartz-2.x/tutorials/tutorial-lesson-06 for scheduling tutorial
#The folder where job history will be persisted. This defaults to a folder call "replhistory" under the same directory where this file is located.
3. Run the sdfsreplicate script on the slave. This will either run once and exit if schedule.type=single or will run continuously with schedule.type=cron
e.g. ./sdfsreplicate /etc/sdfs/replication.props
SDFS uses two methods to remove unsued data from an DedupStorage Engine(DSE). If the SDFS volume has its own dedup storage engine, which it does by default. Unused,or ophaned, chunks are removed as the size of the DSE increases at 10% increments and at specified schedule (defaults to midnight). The specified schedule can me configured at creation with the io-claim-chunks-schedule option. Otherwise it can be configured afterwards within the XML by changing claim-hash-schedule. Below details the process for garbarge collection:
If the DSE is decoupled from the SDFS volume a batch process to remove unused blocks of hashed data.This process is used because the file-system is decoupled from the back end storage (Dedup Storage Engine) where the actual data is held. As hashed data becomes stale they are removed from the Dedup Storage Engine. The process for determining and removing stale chunks is as follows.
- SDFS file-system informs the Dedup Storage Engine what chunks are currently in use. This happens when chunks are first created and then every 2 hours on the hour after that.
- The DSE checks for data that has not been claimed in the last 8 hours upon mount and then every 4 hours after that.
- The chunks that have not been claimed in the last 10 hours upon mount and 6 hours after that are put into a pool and overwritten as new data is written to the Dedup Storage Engine.The Dedup Storage Engine can be cleaned manually by running :
The size of the chunks.chk will not diminish but rather SDFS will re-allocate space already written to, but unclaimed.
As stated above, the volume claims chunks on chunks every two hours when the DSE is decoupled from the SDFS Volume. This can be configured to happen more or less frequently by editing the SDFS configuration file and modifing the "claim-hash-schedule" attribute. This should always occure more frequently than the "eviction-age" attribute set for the DSE ("chunk-store" tag).
The DSE claim schedule can be modified through the "chunk-gc-schedule" attribute. Again, this should occure more frequently than the "eviction-age" attribute set for the DSE ("chunk-store" tag).
Finally, the "eviction-age" is set based on hours and by default it is 6. This can be changed but should be greater than the "claim-hash-schedule" and "chunk-gc-schedule".
All of this is configurable and can be changed after a volume is written to. Take a look at cron format for more details.
There are a few common errors with simple fixes.
1. OutOfMemoryError - This is caused by the size of the DedupStorageEngine memory requirements being larger than the heap size allocated for the JVM. To fix this edit the mount.sdfs script and increase the -Xmx2g to something larger (e.g. -Xmx3g).
2. java.io.IOException : Too Many Open Files - This is caused by there not being enough available file handles for underlying filesystem processes. To fix this add the following lines to /etc/security/limits.conf and the relogin/restart your system.