A Method of Object-based de-duplication

Fang Yan; Yu An Tan

doi:10.4304/jnw.6.12.1705-1712

A Method of Object-based de-duplication

Fang Yan^*, Yu An Tan

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

13 引用（Scopus）

摘要

Today, the world is increasingly awash in more and more unstructured data, not only because of the Internet, but also because data that used to be collected on paper or media such as film, DVDs and compact discs has moved online [1]. Most of this data is unstructured and in diverse formats such as e-mail, documents, graphics, images, and videos. In managing unstructured data complexity and scalability, object storage has a clear advantage. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data. It can detect common embedded data for the first backup across completely unrelated files and even when physical block layout changes. However, almost all of the current researches on data de-duplication do not consider the content of different file types, and they do not have any knowledge of the backup data format. It has been proven that such method cannot achieve optimal performance for compound files. In our proposed system, we will first extract objects from files, Object_IDs are then obtained by applying hash function to the objects. The resulted Object_IDs are used to build as indexing keys in B+ tree like index structure, thus, we avoid the need for a full object index, the searching time for the duplicate objects reduces to O(log n).We introduce a new concept of a duplicate object resolver. The object resolver mediates access to all the objects and is a central point for managing all the metadata and indexes for all the objects. All objects are addressable by their IDs which is unique in the universe. The resolver stores metadata with triple format. This improved metadata management strategy allows us to set, add and resolve object properties with high flexibility, and allows the repeated use of the same metadata among duplicate object.

源语言	英语
页（从-至）	1705-1712
页数	8
期刊	Journal of Networks
卷	6
期	12
DOI	https://doi.org/10.4304/jnw.6.12.1705-1712
出版状态	已出版 - 2011
已对外发布	是

访问文件

10.4304/jnw.6.12.1705-1712

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{8509dc86a7c84c5185047b7efd8be83e,

title = "A Method of Object-based de-duplication",

abstract = "Today, the world is increasingly awash in more and more unstructured data, not only because of the Internet, but also because data that used to be collected on paper or media such as film, DVDs and compact discs has moved online [1]. Most of this data is unstructured and in diverse formats such as e-mail, documents, graphics, images, and videos. In managing unstructured data complexity and scalability, object storage has a clear advantage. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data. It can detect common embedded data for the first backup across completely unrelated files and even when physical block layout changes. However, almost all of the current researches on data de-duplication do not consider the content of different file types, and they do not have any knowledge of the backup data format. It has been proven that such method cannot achieve optimal performance for compound files. In our proposed system, we will first extract objects from files, Object_IDs are then obtained by applying hash function to the objects. The resulted Object_IDs are used to build as indexing keys in B+ tree like index structure, thus, we avoid the need for a full object index, the searching time for the duplicate objects reduces to O(log n).We introduce a new concept of a duplicate object resolver. The object resolver mediates access to all the objects and is a central point for managing all the metadata and indexes for all the objects. All objects are addressable by their IDs which is unique in the universe. The resolver stores metadata with triple format. This improved metadata management strategy allows us to set, add and resolve object properties with high flexibility, and allows the repeated use of the same metadata among duplicate object.",

keywords = "Backup, Data de-duplication, Metadata, Object index, Object-based",

author = "Fang Yan and Tan, {Yu An}",

year = "2011",

doi = "10.4304/jnw.6.12.1705-1712",

language = "English",

volume = "6",

pages = "1705--1712",

journal = "Journal of Networks",

issn = "1796-2056",

publisher = "Academy Publisher",

number = "12",

}

TY - JOUR

T1 - A Method of Object-based de-duplication

AU - Yan, Fang

AU - Tan, Yu An

PY - 2011

Y1 - 2011

N2 - Today, the world is increasingly awash in more and more unstructured data, not only because of the Internet, but also because data that used to be collected on paper or media such as film, DVDs and compact discs has moved online [1]. Most of this data is unstructured and in diverse formats such as e-mail, documents, graphics, images, and videos. In managing unstructured data complexity and scalability, object storage has a clear advantage. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data. It can detect common embedded data for the first backup across completely unrelated files and even when physical block layout changes. However, almost all of the current researches on data de-duplication do not consider the content of different file types, and they do not have any knowledge of the backup data format. It has been proven that such method cannot achieve optimal performance for compound files. In our proposed system, we will first extract objects from files, Object_IDs are then obtained by applying hash function to the objects. The resulted Object_IDs are used to build as indexing keys in B+ tree like index structure, thus, we avoid the need for a full object index, the searching time for the duplicate objects reduces to O(log n).We introduce a new concept of a duplicate object resolver. The object resolver mediates access to all the objects and is a central point for managing all the metadata and indexes for all the objects. All objects are addressable by their IDs which is unique in the universe. The resolver stores metadata with triple format. This improved metadata management strategy allows us to set, add and resolve object properties with high flexibility, and allows the repeated use of the same metadata among duplicate object.

AB - Today, the world is increasingly awash in more and more unstructured data, not only because of the Internet, but also because data that used to be collected on paper or media such as film, DVDs and compact discs has moved online [1]. Most of this data is unstructured and in diverse formats such as e-mail, documents, graphics, images, and videos. In managing unstructured data complexity and scalability, object storage has a clear advantage. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data. It can detect common embedded data for the first backup across completely unrelated files and even when physical block layout changes. However, almost all of the current researches on data de-duplication do not consider the content of different file types, and they do not have any knowledge of the backup data format. It has been proven that such method cannot achieve optimal performance for compound files. In our proposed system, we will first extract objects from files, Object_IDs are then obtained by applying hash function to the objects. The resulted Object_IDs are used to build as indexing keys in B+ tree like index structure, thus, we avoid the need for a full object index, the searching time for the duplicate objects reduces to O(log n).We introduce a new concept of a duplicate object resolver. The object resolver mediates access to all the objects and is a central point for managing all the metadata and indexes for all the objects. All objects are addressable by their IDs which is unique in the universe. The resolver stores metadata with triple format. This improved metadata management strategy allows us to set, add and resolve object properties with high flexibility, and allows the repeated use of the same metadata among duplicate object.

KW - Backup

KW - Data de-duplication

KW - Metadata

KW - Object index

KW - Object-based

UR - http://www.scopus.com/inward/record.url?scp=83455164534&partnerID=8YFLogxK

U2 - 10.4304/jnw.6.12.1705-1712

DO - 10.4304/jnw.6.12.1705-1712

M3 - Article

AN - SCOPUS:83455164534

SN - 1796-2056

VL - 6

SP - 1705

EP - 1712

JO - Journal of Networks

JF - Journal of Networks

IS - 12

ER -

A Method of Object-based de-duplication

摘要

访问文件

其它文件与链接

指纹

引用此