research-article

TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Authors:
Hadi Mohammadzadeh

University of Ulm, Ulm, Germany

University of Ulm, Ulm, Germany
View Profile

,
Thomas Gottron

Universität Koblenz-Landau, Koblenz, Germany

Universität Koblenz-Landau, Koblenz, Germany
View Profile

,
Franz Schweiggert

University of Ulm, Ulm, Germany

University of Ulm, Ulm, Germany
View Profile

,
Gerhard Heyer

Universität Leipzig, Leipzig, Germany

Universität Leipzig, Leipzig, Germany
View Profile

WIDM '12: Proceedings of the twelfth international workshop on Web information and data managementNovember 2012Pages 65–72https://doi.org/10.1145/2389936.2389950

Published:02 November 2012Publication History

WIDM '12: Proceedings of the twelfth international workshop on Web information and data management

Pages 65–72

ABSTRACT

Automatically extracting the headline of online web articles has many applications in web mining and information retrieval. In this paper, we developed a content-based and domain-and language-independent approach, TitleFinder, for unsupervised extraction of the headline of web articles. TitleFinder starts by using a heuristic to select a candidate headline. In a second step the contents of each text fragment in the HTML file are compared to the candidate headline. We implemented four types of similarity for this comparison: two variations of the cosine similarity based on tf and tf-idf weighting schemata, an overlap scoring similarity and an aggregated metric combining the scores of the previous three similarities. Our method achieves high performance in terms of effectiveness and efficiency and outperforms approaches operating on structural and visual features on a test set consisting of 11,218 news web pages from 15 different domains.

References

S. Changuel, N. Labroche, and B. Bouchon-Meunier. A general learning method for automatic title extraction from html pages. In 6th International Conference of Machine Learning and Data Mining in Pattern Recognition, pages 704--718. Springer, 2009. Google ScholarDigital Library
C. Fairon, H. Naets, A. Kilgarriff, and G.-M. de Schryver, editors. WAC3: Proceedings of the 3rd web as corpus workshop, incorporating cleaneval. Presses universitaires de Louvain, Sept. 2007.Google Scholar
J. Fan, P. Luo, and P. Joshi. Title identification of web article pages using html and visual features. Proc. SPIE 7879, 78790K (2011).Google ScholarCross Ref
J. Fan, P. Luo, S. H. Lim, S. Liu, J. Parag, and J. Liu. Article clipper: a system for web article extraction. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 743--746. ACM, 2011. Google ScholarDigital Library
T. Gottron. Evaluating content extraction on HTML documents. In ITA '07: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pages 123--132, Sept. 2007.Google Scholar
T. Gottron. Bridging the gap: From multi document template detection to single document content extraction. In EuroIMSA '08: Proceedings of the IASTED Conference on Internet and Multimedia Systems and Applications 2008, pages 66--71. ACTA Press, Calgary, Mar. 2008. Google ScholarDigital Library
T. Gottron. Content code blurring: A new approach to content extraction. In DEXA '08: 19th International Workshop on Database and Expert Systems Applications, IEEE Computer Society, pages 29--33. IEEE Computer Society, Sept. 2008. Google ScholarDigital Library
Y. Hu, H. Li, Y. Cao, L. Teng, D. Meyerzon, and Q. Zheng. Automatic extraction of titles from general documents using machine learning. ACM/IEEE Joint Conference on Digital Libraries, JCDL 2005, pages 145--154, 2005. Google ScholarDigital Library
Y. Hu, G. Xin, R. Song, G. Hu, S. Shi, Y. Cao, and H. Li. Title extraction from bodies of html documents and its application to web page retrieval. In SIGIR 2005: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information, pages 250--257. ACM, August 2005. Google ScholarDigital Library
H. Ibrahim, K. Darwish, and A.-R. Madany. Automatic extraction of textual elements from news web pages. In Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 2008.Google Scholar
C. Manning, P. Raghavan, and H. Schütze. An Introduction to Information Retrieval. 2009. Google ScholarDigital Library
H. Mohammadzadeh, T. Gottron, F. Schweiggert, and G. Nakhaeizadeh. A fast and accurate approach for main content extraction based on character encoding. In TIR'11: Proccedings of the 8th International Workshop on Text-based Information Retrieval (DEXA'11). IEEE Computer Society, pages 167--171, 2011. Google ScholarDigital Library
H. Mohammadzadeh, T. Gottron, F. Schweiggert, and G. Nakhaeizadeh. The impact of source code normalization on main content extraction. In WEBIST'12: 8th International Conference on Web Information Systems and Technologies, pages 677--682, 2012.Google Scholar
J. Moreno, K. Deschacht, and M. Moens. Language independent content extraction from web pages. In Proceeding of the 9th Dutch-Belgian Information Retrieval Workshop, pages 50--55, 2009.Google Scholar
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18:613--620, 1975. Google ScholarDigital Library
F. Sun, D. Song, and L. Liao. Dom based content extraction via text density. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, SIGIR '11, pages 245--254, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
T. Weninger, W. H. Hsu, and J. Han. Cetr: content extraction via tag ratios. In Proceedings of the 19th International Conference on World Wide Web, pages 971--980. ACM, 2010. Google ScholarDigital Library
Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C.-Y. Lin, and H. Li. Web page title extraction and its application. Inf. Process. Manage., 43(5):1332--1347, 2007. Google ScholarDigital Library
Z. Zhang, M. Sun, and S. Liu. Automatic content based title extraction for chinese documents using support vector machine. In Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pages 553--558. IEEE, 2005.Google ScholarCross Ref

Index Terms

Recommendations

Fuzzy logic-based approach to develop hybrid similarity measure for efficient information retrieval

A similarity measure is used in information retrieval systems to retrieve and rank the relevant documents. In this paper, a new fuzzy-based approach to develop hybrid similarity measure is proposed and implemented. The proposed approach overcomes the ...
Read More
Learning similarity with cosine similarity ensemble

This paper proposes a cosine similarity ensemble (CSE) method to learn similarity.CSE is a selective ensemble and combines multiple cosine similarity learners.A learner redefines the pattern vectors and determines its threshold adaptively.Experimental ...
Read More
Fuzzy logic based multi document summarization with improved sentence scoring and redundancy removal technique
Highlights
- Statistical feature based extractive approach for multi-document summarization.
Abstract
Nowadays abundant amount of information is available on Internet which makes it difficult for the users to locate desired information. Automatic methods are needed to efficiently sieve and scavenge useful information from the Internet. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIDM '12: Proceedings of the twelfth international workshop on Web information and data management
November 2012
90 pages
ISBN:9781450317207
DOI:10.1145/2389936
Program Chairs:
George H.L. Fletcher
Eindhoven University of Technology, The Netherlands
,
Prasenjit Mitra
The Pennsylvania State University, USA
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cosine similarity
headline extraction
html web pages
information retrieval
overlap scoring similarity
title extraction
vector space model
Qualifiers
- research-article
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 274
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

WIDM '12: Proceedings of the twelfth international workshop on Web information and data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fuzzy logic-based approach to develop hybrid similarity measure for efficient information retrieval

Learning similarity with cosine similarity ensemble

Fuzzy logic based multi document summarization with improved sentence scoring and redundancy removal technique

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

WIDM '12: Proceedings of the twelfth international workshop on Web information and data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fuzzy logic-based approach to develop hybrid similarity measure for efficient information retrieval

Learning similarity with cosine similarity ensemble

Fuzzy logic based multi document summarization with improved sentence scoring and redundancy removal technique

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media