Abstract
Due to various Web authoring tools, the new web standards, and improved web accessibility, a wide variety of Web contents are being produced very quickly. In such an environment, in order to provide appropriate Web services to users’ needs it is important to quickly and accurately extract relevant information from Web documents and remove irrelevant contents such as advertisements. In this paper, we propose a method that extracts main content accurately from HTML Web documents. In the method, a decision tree is built and used to classify each block of text whether it is a part of the main content. For classification we use contextual features around text blocks including word density, link density, HTML tag distribution, and distances between text blocks. We experimented with our method using a published data set and a data set that we collected. The experiment results show that our method performs 19% better in F-measure compared to the existing best performing method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
(Februry 2013), http://en.wikipedia.org/wiki/Document_Object_Model
Deng, C., Shipeng, Y., Ji-Rong, W., Wei-Ying, M.: VIPS: a Vision-based Page Segmentation Algorithm. Microsoft Technical Report(MSR-TR-2003-79) (2003)
Suhit, G., Gail, E.K., David, N., Peter, G.: DOM-based Content Extraction of HTML Documents. In: 12th International Conference on World Wide Web, pp. 207–214 (2003)
Suhit, G., Gail, E.K., Peter, G., Michael, F.C., Justin, S.: Automating Content Extraction of HTML Documents. World Wide Web 8(2), 179–224 (2005)
Jeff, P., Dan, R.: Extracting Article Text from the Web with Maximum Subsequence Segmentation. In: The 18th International Conference on World Wide Web, pp. 971–980 (2009)
Stefan, E.: A lightweight and efficient tool for cleaning Web pages. In: The 6th International Conference on Language Resources and Evaluation (2008)
Stefan, E.: StupidOS: A high-precision approach to boilerplate removal. In: Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop, pp. 123–133 (2007)
Young, S., Hasan, J., Farshad, F.: Autonomic Wrapper Induction using Minimal Type System from Web Data. In: Artificial Intelligence, pp. 130–135 (2005)
Christian, K., Peter, F., Wolfgang, N.: Boilerplate Detection using Shallow Text Features. In: The Third ACM International Conference on Web Search and Data Mining, pp. 441–450 (2010)
Jian, F., Ping, L., Suk Hwan, L., Sam, L., Parag, J., Jerry, L.: Article Clipper- A System for Web Article Extraction. In: 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 743–746 (2011)
Tim, W., William, H.H., Jiawei, H.: CETR - Content Extraction via Tag Ratios. In: 19th International Conference on World Wide Web, pp. 971–980 (2010)
Tim, W., William, H.H.: Text Extraction from the Web via Text-to-Tag Ratio. In: The 19th International Conference on Database and Expert Systems Application, pp. 23–28 (2008)
(July 2012), http://tomazkovacic.com/
W3C (February 2013), http://www.w3.org/TR/html401/
Jiawei, H., Micheline, K.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2006)
Ian, H.W., Eibe, F.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier (2005)
Waikato Univ. (February 2013), http://www.cs.waikato.ac.nz/ml/weka/
Andy, C., Marc G.: (February 2012), http://nekohtml.sourceforge.net/
L3S Research Center (February 2013), http://www.l3s.de/~kohlschuetter/boilerplate/
(February 2013), http://121.78.244.168:8090/ice/index.jsp
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kim, M., Kim, Y., Song, W., Khil, A. (2013). Main Content Extraction from Web Documents Using Text Block Context. In: Decker, H., Lhotská, L., Link, S., Basl, J., Tjoa, A.M. (eds) Database and Expert Systems Applications. DEXA 2013. Lecture Notes in Computer Science, vol 8056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40173-2_10
Download citation
DOI: https://doi.org/10.1007/978-3-642-40173-2_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40172-5
Online ISBN: 978-3-642-40173-2
eBook Packages: Computer ScienceComputer Science (R0)