Boafft: Distributed Deduplication for Big Data Storage in the Cloud

Shengmei Luo, Guangyan Zhang, Chengwen Wu, Samee U. Khan, Keqin Li

Research output: Contribution to journalArticlepeer-review

8 Citations (Scopus)


As data progressively grows within data centers, the cloud storage systems continuously facechallenges in saving storage capacity and providing capabilities necessary to move big data within an acceptable time frame. In this paper, we present the Boafft, a cloud storage system with distributed deduplication. The Boafft achieves scalable throughput and capacity usingmultiple data servers to deduplicate data in parallel, with a minimal loss of deduplication ratio. Firstly, the Boafft uses an efficient data routing algorithm based on data similarity that reduces the network overhead by quickly identifying the storage location. Secondly, the Boafft maintains an in-memory similarity indexing in each data server that helps avoid a large number of random disk reads and writes, which in turn accelerates local data deduplication. Thirdly, the Boafft constructs hot fingerprint cache in each data server based on access frequency, so as to improve the data deduplication ratio. Our comparative analysis with EMC's stateful routing algorithm reveals that the Boafft can provide a comparatively high deduplication ratio with a low network bandwidth overhead. Moreover, the Boafft makes better usage of the storage space, with higher read/write bandwidth and good load balance.

Original languageEnglish
Article number7364228
Pages (from-to)1199-1211
Number of pages13
JournalIEEE Transactions on Cloud Computing
Issue number4
Publication statusPublished - 2020 Oct 1

All Science Journal Classification (ASJC) codes

  • Software
  • Information Systems
  • Hardware and Architecture
  • Computer Science Applications
  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Boafft: Distributed Deduplication for Big Data Storage in the Cloud'. Together they form a unique fingerprint.

Cite this