What is deduplication technology?

Deduplication (sometimes called Single-Instance Storage, Capacity Optimization or Factoring) is a data reduction technology intended to eliminate redundant (duplicate) data on a storage system by saving only one instance of each data item, in order to reduce disk space and network bandwidth. Deduplication technologies rely on an index which tracks the data in the repository and allows for the identification of data redundancy. The management software will look at the new data, compare it to the data that already exists on the system, and then store only the data which doesn't match existing data.

For example, suppose that a company has 100 members and the mailbox of each member has around 1GB. However, most of the emails are the same: emails distributed and forwarded among company staff or emails sent to several staff members from outside. That's 100 GB of disk space consumed to store basically the same information. Data deduplication ensures that only the unique data is saved to disk. Subsequent iterations of the data are only saved as references which point to the saved copy, so that end-users still see their own files in place.

There are three kinds of deduplication technology:

  • File deduplication. Only one copy of each identical file is stored. This technology is also known as Single File Instance technology.
  • Block-level deduplication. Divide the information into blocks and only one copy of each identical block is stored.
  • Byte-level deduplication. Analyze the content of the information to be deduplicated at byte-level and store only the unique data. This is the only technology which guarantees fully redundant elimination.

This means that different deduplication technologies can provide different granular control by removing redundant portions of files down to the block level or even to the byte level.

When evaluating a deduplication product, it's important to understand the granularity offered by the platform.

[ Please, click here to download the Lortu Deduplication White Paper ]

Benefits of deduplication technology.

By not storing duplicate pieces of data, potentially huge savings in disk space result. For instance byte-level deduplication technologies can reduce the total amount of stored data by a ratio of 50:1 or more, depending on the environment. In other words, if you are keeping a terabyte of disk backups today, tomorrow that number reduces to 20GB. And the 980GB of storage that is left over means you can defer additional storage purchases for years before you will need to add more disks to your storage capacity.

This also means that if you free up more storage capacity, you can choose to keep data online because it can be sent via secure WAN to remote sites for disaster recovery purposes or replication.

How does deduplication differ from other similar technologies?

Data deduplication differs from compression in that compression looks only for repeating patterns of information and reduces them. For example, a compressed file cannot be compressed again because it has huge entropy. Data deduplication reduces the unique data regardless of its internal format. It just compares the content of the file with previous versions and extracts the new unique data. This provides a much greater data reduction capability than compression. In fact, most of the products apply compression algorithms after deduplicating the data to get an even higher data reduction.

Deduplication also differs from incremental or differential backups in that only the byte-level changes are backed up. Incremental backups scan selected files for changes. If there is a change in the file, even of a single bit, the whole file is saved in the newest backup file. If that file is 500 MB, it saves the whole file to the new backup. Data-deduplication technology will only store the pieces of data that have changed, not the entire file.

[ Please, click here to download the Lortu Deduplication White Paper ]

How does Lortu deduplication technology differ from other deduplication technologies?

There are several approaches to implement deduplication, and even though each approach has its own pros and cons, some are much better than others.

The main differences between the approaches:

Post-process deduplication vs. in-line deduplication:

The main advantage of post-process deduplication as opposed to in-line deduplication is a higher backup throughput and smaller backup time window. This is because the information is first stored in the appliance and then deduplicated later without interfering with the backup process.

Lortu provides post-process deduplication.

Byte-level differencing vs. pattern matching (storing a hash for each pattern or block):

Pattern matching is less scalable than differencing as the data to be deduplicated grows, because the table with hashes uses more memory and CPU as it has to manage more data. However, its greater drawback is the restore time.

If backup time is critical, the restore time is much more critical. Since the patterns are spread over the full disk in very small blocks of information, the system requires reads of one or two clusters for each small pattern. This means that restore time can be more than 10 times slower than copying the non-deduplicated information. With byte-level differencing, the information is stored in much larger blocks, and usually the restore time is very close to copying the non-deduplicated information.

Also pattern matching technology requires several weeks before the deduplication process can be effective. With byte-level differencing the deduplication is very effective from the second backup, and effectiveness improves as new files are included in the vault

Lortu provides byte-level differencing deduplication.

Data agnostic vs. content-aware approach:

Data agnostic technologies work with any kind of information or file format. The drawback of the content-aware approach is that the technology needs to understand the format of the files. If the file format is different than expected (a new version of the application for instance), or if the application isn't supported by the technology, the deduplication process is not possible.

Lortu deduplication technology is agnostic to the data. It can deduplicate data of any kind, file format or file type.

[ Please, click here to download the Lortu Deduplication White Paper ]