Automatic Web Information Extraction Based on Rules

Hu, Fanghuai; Ruan, Tong; Shao, Zhiqing; Ding, Jun

doi:10.1007/978-3-642-24434-6_21

Automatic Web Information Extraction Based on Rules

Fanghuai Hu¹⁹,
Tong Ruan¹⁹,
Zhiqing Shao¹⁹ &
…
Jun Ding¹⁹

Conference paper

1449 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6997))

Abstract

Web Information Extraction is the initial step of effective web mining. In this article a few heuristic rules which describe the characteristics of the main content of web pages are summarized. The rules are constructed by some pre-defined terms and metrics, which can be considered as reusable and extensible for different kinds of HTML pages. Afterwards, a probabilistic model which utilizes the rules and metrics is suggested and the corresponding algorithm is implemented. The algorithm is tested on 1000 randomly selected web pages. The experiment shows that the algorithm is more precise and more applicable to the diverse structure of different web sites than other algorithms.

This research is supported by the National Science and Technology Pillar Program of China under grant number 2009BAH46B03 and National Nature Science Foundation of China under grant number 61003126.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. J. Auton. Agent. Multi-Ag. 4, 93–114 (2001)
Article Google Scholar
Pan, A., Raposo, J., Alvarez, M., Hidalgo, J., Vina, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: IFIP Working Conference on Engineering Information Systems in the Internet Context, pp. 265–283. Kluwer Academic Publishers, Norwell (2002)
Chapter Google Scholar
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.: A Survey of Web Information Extraction Systems. J. IEEE T. Knowl. Data En. 18, 1411–1428 (2006)
Article Google Scholar
Ji, X.W., Zeng, J.P., Shang, S.Y., Wu, C.R.: Tag Tree Template for Web Information and Schema Extraction. J. Expert Syst. Appl. 37, 8492–8498 (2010)
Article Google Scholar
Crescenzi, V., Mecca, G.: Automatic Information Extraction from Large Websites. J. ACM 51, 731–779 (2004)
Article MathSciNet MATH Google Scholar
Wang, J.Y., Lochovsky, F.H.: Data-Rich Section Extraction from HTML Pages. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering, pp. 313–322. IEEE Computer Soc., Los Alamitos (2002)
Google Scholar
Lin, S.H., Ho, J.M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York (2002)
Google Scholar
Mengel, S., Jing, Y.: Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 219–226. Springer, Heidelberg (2009)
Chapter Google Scholar
Weninger, T., Hsu, W.H.: Text Extraction from the Web via Text-to-Tag Ratio. In: 19th International Workshop on Database and Expert Systems Application, pp. 23–28. IEEE Computer Soc., Los Alamitos (2008)
Google Scholar
Document Object Model, http://www.w3.org/DOM/
Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. J. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Grishman, R., Sundheim, B.: Message Understanding Conference 6: A Brief History. In: Proceedings of the 16th Conference on Computational Linguistics. Association for Computational Linguistics, Stroudsburg (1996)
Google Scholar
MUC-7 Information Extraction Task Definition, http://www-nlpir.nist.gov/related_projects/muc/proceedings/ie_task.html
Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, East China University of Science and Technology, China
Fanghuai Hu, Tong Ruan, Zhiqing Shao & Jun Ding

Authors

Fanghuai Hu
View author publications
You can also search for this author in PubMed Google Scholar
Tong Ruan
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqing Shao
View author publications
You can also search for this author in PubMed Google Scholar
Jun Ding
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Information Engineering Laborarory, CSIRO ICT Centre, Australia
Athman Bouguettaya
Digital Enterprise Research Institute (DERI), National University of Ireland, IDA Business Park, Lower Dangan,, Galway, Ireland
Manfred Hauswirth
College of Computing, Georgia Institute of Technology, 266 Ferst Drive, 30332-0765, Atlanta, GA, USA
Ling Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, F., Ruan, T., Shao, Z., Ding, J. (2011). Automatic Web Information Extraction Based on Rules. In: Bouguettaya, A., Hauswirth, M., Liu, L. (eds) Web Information System Engineering – WISE 2011. WISE 2011. Lecture Notes in Computer Science, vol 6997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24434-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-24434-6_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24433-9
Online ISBN: 978-3-642-24434-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics