Abstract
Web Information Extraction is the initial step of effective web mining. In this article a few heuristic rules which describe the characteristics of the main content of web pages are summarized. The rules are constructed by some pre-defined terms and metrics, which can be considered as reusable and extensible for different kinds of HTML pages. Afterwards, a probabilistic model which utilizes the rules and metrics is suggested and the corresponding algorithm is implemented. The algorithm is tested on 1000 randomly selected web pages. The experiment shows that the algorithm is more precise and more applicable to the diverse structure of different web sites than other algorithms.
This research is supported by the National Science and Technology Pillar Program of China under grant number 2009BAH46B03 and National Nature Science Foundation of China under grant number 61003126.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, San Francisco (2002)
Muslea, I., Minton, S., Knoblock, C.: Hierarchical Wrapper Induction for Semistructured Information Sources. J. Auton. Agent. Multi-Ag. 4, 93–114 (2001)
Pan, A., Raposo, J., Alvarez, M., Hidalgo, J., Vina, A.: Semi-Automatic Wrapper Generation for Commercial Web Sources. In: IFIP Working Conference on Engineering Information Systems in the Internet Context, pp. 265–283. Kluwer Academic Publishers, Norwell (2002)
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.: A Survey of Web Information Extraction Systems. J. IEEE T. Knowl. Data En. 18, 1411–1428 (2006)
Ji, X.W., Zeng, J.P., Shang, S.Y., Wu, C.R.: Tag Tree Template for Web Information and Schema Extraction. J. Expert Syst. Appl. 37, 8492–8498 (2010)
Crescenzi, V., Mecca, G.: Automatic Information Extraction from Large Websites. J. ACM 51, 731–779 (2004)
Wang, J.Y., Lochovsky, F.H.: Data-Rich Section Extraction from HTML Pages. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering, pp. 313–322. IEEE Computer Soc., Los Alamitos (2002)
Lin, S.H., Ho, J.M.: Discovering Informative Content Blocks from Web Documents. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York (2002)
Mengel, S., Jing, Y.: Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model. In: Vossen, G., Long, D.D.E., Yu, J.X. (eds.) WISE 2009. LNCS, vol. 5802, pp. 219–226. Springer, Heidelberg (2009)
Weninger, T., Hsu, W.H.: Text Extraction from the Web via Text-to-Tag Ratio. In: 19th International Workshop on Database and Expert Systems Application, pp. 23–28. IEEE Computer Soc., Los Alamitos (2008)
Document Object Model, http://www.w3.org/DOM/
Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval. J. Inf. Process. Manag. 24(5), 513–523 (1988)
Grishman, R., Sundheim, B.: Message Understanding Conference 6: A Brief History. In: Proceedings of the 16th Conference on Computational Linguistics. Association for Computational Linguistics, Stroudsburg (1996)
MUC-7 Information Extraction Task Definition, http://www-nlpir.nist.gov/related_projects/muc/proceedings/ie_task.html
Lewis, D.D.: Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hu, F., Ruan, T., Shao, Z., Ding, J. (2011). Automatic Web Information Extraction Based on Rules. In: Bouguettaya, A., Hauswirth, M., Liu, L. (eds) Web Information System Engineering – WISE 2011. WISE 2011. Lecture Notes in Computer Science, vol 6997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24434-6_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-24434-6_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24433-9
Online ISBN: 978-3-642-24434-6
eBook Packages: Computer ScienceComputer Science (R0)