Exploiting link structure for web page genre identification

Handle URI:
http://hdl.handle.net/10754/566107
Title:
Exploiting link structure for web page genre identification
Authors:
Zhu, Jia; Xie, Qing ( 0000-0003-4530-588X ) ; Yu, Shoou I.; Wong, Wai Hung
Abstract:
As the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information. © 2015 The Author(s)
KAUST Department:
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division; Computer Science Program
Publisher:
Springer Science + Business Media
Journal:
Data Mining and Knowledge Discovery
Issue Date:
7-Jul-2015
DOI:
10.1007/s10618-015-0428-8
Type:
Article
ISSN:
1384-5810; 1573-756X
Appears in Collections:
Articles; Computer Science Program; Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division

Full metadata record

DC FieldValue Language
dc.contributor.authorZhu, Jiaen
dc.contributor.authorXie, Qingen
dc.contributor.authorYu, Shoou I.en
dc.contributor.authorWong, Wai Hungen
dc.date.accessioned2015-08-12T09:28:30Zen
dc.date.available2015-08-12T09:28:30Zen
dc.date.issued2015-07-07en
dc.identifier.issn1384-5810en
dc.identifier.issn1573-756Xen
dc.identifier.doi10.1007/s10618-015-0428-8en
dc.identifier.urihttp://hdl.handle.net/10754/566107en
dc.description.abstractAs the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information. © 2015 The Author(s)en
dc.publisherSpringer Science + Business Mediaen
dc.rightsThe final publication is available at Springer via http://dx.doi.org/10.1007/s10618-015-0428-8en
dc.subjectGenre identificationen
dc.subjectMultiple classifiersen
dc.subjectNeighboring pages selectionen
dc.titleExploiting link structure for web page genre identificationen
dc.typeArticleen
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Divisionen
dc.contributor.departmentComputer Science Programen
dc.identifier.journalData Mining and Knowledge Discoveryen
kaust.authorXie, Qingen
All Items in KAUST are protected by copyright, with all rights reserved, unless otherwise indicated.