Extraction of informative blocks from web pages

Yu Juan Cao; Zhen Dong Niu; Liu Ling Dai; Yu Ming Zhao

doi:10.1109/ALPIT.2008.106

Extraction of informative blocks from web pages

Yu Juan Cao^*, Zhen Dong Niu, Liu Ling Dai, Yu Ming Zhao

^*此作品的通讯作者

计算机学院

Beijing Institute of Technology

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

12 引用（Scopus）

摘要

Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.

源语言	英语
主期刊名	Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology
页	544-549
页数	6
DOI	https://doi.org/10.1109/ALPIT.2008.106
出版状态	已出版 - 2008
活动	ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology - Liaoning, 中国期限: 23 7月 2008 → 25 7月 2008

出版系列

姓名	Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology

会议

会议	ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology
国家/地区	中国
市	Liaoning
时期	23/07/08 → 25/07/08

访问文件

10.1109/ALPIT.2008.106

其它文件与链接

链接到 Scopus 的出版物

引用此

Cao, Y. J., Niu, Z. D., Dai, L. L., & Zhao, Y. M. (2008). Extraction of informative blocks from web pages. 在 Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology (页码 544-549). 文章 4584425 (Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology). https://doi.org/10.1109/ALPIT.2008.106

@inproceedings{032eaa6950f44149a2e8059dca1ce815,

title = "Extraction of informative blocks from web pages",

abstract = "Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.",

keywords = "Data mining, Information extraction, LSI, SVM, VIPS, Web, Web page segmentation",

author = "Cao, {Yu Juan} and Niu, {Zhen Dong} and Dai, {Liu Ling} and Zhao, {Yu Ming}",

year = "2008",

doi = "10.1109/ALPIT.2008.106",

language = "English",

isbn = "9780769532738",

series = "Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology",

pages = "544--549",

booktitle = "Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology",

note = "ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology ; Conference date: 23-07-2008 Through 25-07-2008",

}

Cao, YJ, Niu, ZD, Dai, LL & Zhao, YM 2008, Extraction of informative blocks from web pages. 在 Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology., 4584425, Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology, 页码 544-549, ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology, Liaoning, 中国, 23/07/08. https://doi.org/10.1109/ALPIT.2008.106

Extraction of informative blocks from web pages. / Cao, Yu Juan; Niu, Zhen Dong; Dai, Liu Ling 等.
Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology. 2008. 页码 544-549 4584425 (Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Extraction of informative blocks from web pages

AU - Cao, Yu Juan

AU - Niu, Zhen Dong

AU - Dai, Liu Ling

AU - Zhao, Yu Ming

PY - 2008

Y1 - 2008

N2 - Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.

AB - Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.

KW - Data mining

KW - Information extraction

KW - LSI

KW - SVM

KW - VIPS

KW - Web

KW - Web page segmentation

UR - http://www.scopus.com/inward/record.url?scp=51949096948&partnerID=8YFLogxK

U2 - 10.1109/ALPIT.2008.106

DO - 10.1109/ALPIT.2008.106

M3 - Conference contribution

AN - SCOPUS:51949096948

SN - 9780769532738

T3 - Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology

SP - 544

EP - 549

BT - Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology

T2 - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology

Y2 - 23 July 2008 through 25 July 2008

ER -

Cao YJ, Niu ZD, Dai LL, Zhao YM. Extraction of informative blocks from web pages. 在 Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology. 2008. 页码 544-549. 4584425. (Proceedings - ALPIT 2008, 7th International Conference on Advanced Language Processing and Web Information Technology). doi: 10.1109/ALPIT.2008.106

Extraction of informative blocks from web pages

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此