Show simple item record

dc.contributor.authorZhu, Jia
dc.contributor.authorXie, Qing
dc.contributor.authorYu, Shoou I.
dc.contributor.authorWong, Wai Hung
dc.date.accessioned2015-08-12T09:28:30Z
dc.date.available2015-08-12T09:28:30Z
dc.date.issued2015-07-07
dc.identifier.citationZhu, J., Xie, Q., Yu, S.-I., & Wong, W. H. (2015). Exploiting link structure for web page genre identification. Data Mining and Knowledge Discovery, 30(3), 550–575. doi:10.1007/s10618-015-0428-8
dc.identifier.issn1384-5810
dc.identifier.issn1573-756X
dc.identifier.doi10.1007/s10618-015-0428-8
dc.identifier.urihttp://hdl.handle.net/10754/566107
dc.description.abstractAs the World Wide Web develops at an unprecedented pace, identifying web page genre has recently attracted increasing attention because of its importance in web search. A common approach for identifying genre is to use textual features that can be extracted directly from a web page, that is, On-Page features. The extracted features are subsequently inputted into a machine learning algorithm that will perform classification. However, these approaches may be ineffective when the web page contains limited textual information (e.g., the page is full of images). In this study, we address genre identification of web pages under the aforementioned situation. We propose a framework that uses On-Page features while simultaneously considering information in neighboring pages, that is, the pages that are connected to the original page by backward and forward links. We first introduce a graph-based model called GenreSim, which selects an appropriate set of neighboring pages. We then construct a multiple classifier combination module that utilizes information from the selected neighboring pages and On-Page features to improve performance in genre identification. Experiments are conducted on well-known corpora, and favorable results indicate that our proposed framework is effective, particularly in identifying web pages with limited textual information. © 2015 The Author(s)
dc.publisherSpringer Nature
dc.rightsThe final publication is available at Springer via http://dx.doi.org/10.1007/s10618-015-0428-8
dc.subjectGenre identification
dc.subjectMultiple classifiers
dc.subjectNeighboring pages selection
dc.titleExploiting link structure for web page genre identification
dc.typeArticle
dc.contributor.departmentComputer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
dc.contributor.departmentComputer Science Program
dc.identifier.journalData Mining and Knowledge Discovery
kaust.personXie, Qing
refterms.dateFOA2016-07-07T00:00:00Z
dc.date.published-online2015-07-07
dc.date.published-print2016-05


Files in this item

Thumbnail
Name:
manuscript.pdf
Size:
3.141Mb
Format:
PDF
Description:
Article - Accepted Manuscript

This item appears in the following Collection(s)

Show simple item record