Abstract
Information extraction (IE) has been emerged as a novel discipline in computer science. In IE, intelligent algorithms are employed to extract the required data, and structure them so that they are appropriate for query. In most IE systems, a web-page structure, e.g. HTML tags are used to recognize the looked-for information. In this article, an algorithm is developed to recognize the main region of web-pages containing the looked-for information, by means of an ontology, a web-page structure and goodness-of-fit χ 2 test. After recognizing the main region, the existing records of the region are recognized, and then each record is put in a text file.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Casella, G., Berger, R.L.: Statistical Inference, 2nd edn. Duxbury Press (2001)
Papadakis, N.K., Skoutas, D., Raftopoulos, K.: STAVIES: A System for Information Extraction from Unknown Web Data Source through Automatic Web Wrapper Generation Using Clustering Techniques. IEEE Transaction on Knowledge and Data Engineering 17(12), 1638–1652 (2005)
Ye, S., Chua, T.S.: Learning Object Models from Semistructured Web Documents. IEEE Transaction on Knowledge and Data Engineering 18(3), 334–349 (2006)
Chang, C.H., Gigis, M.R.: A Survey of Web Information Extraction Systems. IEEE Transaction on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Liu, B., Zhai, Y.: NET—A System for Extracting Web Data from Flat and Nested Data Records. In: Proc. Sixth Int’l Conf. Web Information Systems Eng, pp. 487–495 (2005)
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proc. Int’l Conf. Knowledge Discovery in Databases and Data Mining (KDD), pp. 601–606 (2003)
Zhang, N., Chen, H., Wang, Y., Chen, S.J., Xiong, M.F.: Odaies: Ontology-driven Adaptive Web Information Extarction Systems. In: Proc. IEEE/WIC International Conference on Intelligent Agent Technology (IAT 2003), pp. 454–460 (2003)
Daconta, M.C., Obrst, L.J., Smith, K.T.: The Semantic Web: A Guide to the Future of XML, Web Service, and Knowledge Management. Wiley publishing, Inc., Chichester (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Keshavarzi, A., Rahmani, A.M., Mohsenzadeh, M., Keshavarzi, R. (2008). Recognition of Data Records in Semi-structured Web-Pages Using Ontology and χ 2 Statistical Distribution. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_71
Download citation
DOI: https://doi.org/10.1007/978-3-540-88192-6_71
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88191-9
Online ISBN: 978-3-540-88192-6
eBook Packages: Computer ScienceComputer Science (R0)