Object-based data de-duplication method for OpenXML compound files

Fang Yan; Yuanzhang Li; Quanxin Zhang; Yu'an Tan

doi:10.7544/issn1000-1239.2015.20140093

Object-based data de-duplication method for OpenXML compound files

Fang Yan, Yuanzhang Li^*, Quanxin Zhang, Yu'an Tan

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

4 Citations (Scopus)

Abstract

Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn't achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general.

Original language	English
Pages (from-to)	1546-1557
Number of pages	12
Journal	Jisuanji Yanjiu yu Fazhan/Computer Research and Development
Volume	52
Issue number	7
DOIs	https://doi.org/10.7544/issn1000-1239.2015.20140093
Publication status	Published - 1 Jul 2015

Keywords

Compound file
Content defined chunking (CDC)
Data de-duplication
Object
OpenXML standard
Unstructured data

Access to Document

10.7544/issn1000-1239.2015.20140093

Cite this

Yan, F., Li, Y., Zhang, Q., & Tan, Y. (2015). Object-based data de-duplication method for OpenXML compound files. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 52(7), 1546-1557. https://doi.org/10.7544/issn1000-1239.2015.20140093

@article{7e2bd5450e1d4e0998de3ae9dd2cbcab,

title = "Object-based data de-duplication method for OpenXML compound files",

abstract = "Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn't achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general.",

keywords = "Compound file, Content defined chunking (CDC), Data de-duplication, Object, OpenXML standard, Unstructured data",

author = "Fang Yan and Yuanzhang Li and Quanxin Zhang and Yu'an Tan",

year = "2015",

month = jul,

day = "1",

doi = "10.7544/issn1000-1239.2015.20140093",

language = "English",

volume = "52",

pages = "1546--1557",

journal = "Jisuanji Yanjiu yu Fazhan/Computer Research and Development",

issn = "1000-1239",

publisher = "Science Press",

number = "7",

}

TY - JOUR

T1 - Object-based data de-duplication method for OpenXML compound files

AU - Yan, Fang

AU - Li, Yuanzhang

AU - Zhang, Quanxin

AU - Tan, Yu'an

PY - 2015/7/1

Y1 - 2015/7/1

N2 - Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn't achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general.

AB - Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn't achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general.

KW - Compound file

KW - Content defined chunking (CDC)

KW - Data de-duplication

KW - Object

KW - OpenXML standard

KW - Unstructured data

UR - http://www.scopus.com/inward/record.url?scp=84941985223&partnerID=8YFLogxK

U2 - 10.7544/issn1000-1239.2015.20140093

DO - 10.7544/issn1000-1239.2015.20140093

M3 - Article

AN - SCOPUS:84941985223

SN - 1000-1239

VL - 52

SP - 1546

EP - 1557

JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

IS - 7

ER -

Object-based data de-duplication method for OpenXML compound files

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this