An Experiment on Using Datamining Techniques to Extract Information from the Web

Fernández, Gretel; Sleiman, Hassan A.

doi:10.1007/978-3-642-19931-8_21

Gretel Fernández⁷ &
Hassan A. Sleiman⁷

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 90))

467 Accesses
1 Citations

Abstract

The Web has become an indispensable source of knowledge due to the large amount of information available on it. Unfortunately, accessing this information by automated processes is not a simple task. Besides the semantic web and Web services, information extraction is a solution for this problem. Our proposal is to explore how datamining techniques to infer classifiers can help us build classifiers to extract information from web pages. In this paper we present a first approach to our technique that consisted on evaluating how these techniques work on DOM tree features. Our first results show a good precision and recall which indicates a new promising technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng. 64(2), 491–509 (2008), http://dx.doi.org/10.1016/j.datak.2007.10.002
Article Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003), http://doi.acm.org/10.1145/872757.872799
Chang, C.H., Kuo, S.C.: Olera: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19(6), 56–64 (2004), http://dx.doi.org/10.1109/MIS.2004.71
Article Google Scholar
Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for the semantic web. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 312–326. Springer, Heidelberg (2004)
Chapter Google Scholar
Corchuelo, R., Arjona, J.L., Ruiz, D.: Wrapping web data islands. J. UCS 14(11), 1808–1810 (2008)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001), http://www.vldb.org/conf/2001/P109.pdf
Freitag, D.: Information extraction from HTML: Application of a general machine learning approach. In: AAAI/IAAI, pp. 517–523 (1998)
Google Scholar
Gómez-Pérez, A., Benjamins, V.R. (eds.): Proceedings of Knowledge Engineering and Knowledge Management Ontologies and the Semantic Web, 13th International Conference, EKAW 2002, Siguenza, Spain, October 1-4 (2002)
Google Scholar
Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM - semi-automatic CREAtion of metadata. In: Gómez-Pérez, Benjamins (eds.) [8], pp. 358–372, http://link.springer.de/link/service/series/0558/bibs/2473/24730358.htm
Hogue, A., Karger, D.R.: Thresher: Automating the unwrapping of semantic content from the World Wide Web. In: WWW, pp. 86–95 (2005), http://doi.acm.org/10.1145/1060745.1060762
Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(8), 521–538 (1998), http://dx.doi.org/10.1016/S0306-43799800027-1
Article Google Scholar
Kayed, M., Chang, C.H.: Fivatech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering 22, 249–263 (2010),doi http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.82
Article Google Scholar
Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. J. Web Sem. 2(1), 49–79 (2004), http://dx.doi.org/10.1016/j.websem.2004.07.005
Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002), http://dx.doi.org/10.1016/S0169-023X0100047-7
Article MATH Google Scholar
Li, W., Dong, Y., Wang, R., Tian, H.: Information extraction from semi-structured WEB page based on DOM tree and its application in scientific literature statistical analysis system. In: Anonymous (ed.) SSME (2009), http://doi.ieeecomputersociety.org/10.1109/SSME.2009.59
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. (2010), http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.109
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Article Google Scholar
Papadakis, N., Skoutas, D., Raftopoulos, K., Varvarigou, T.A.: Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. IEEE Trans. Knowl. Data Eng. 17(12), 1638–1652 (2005), http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.203
Article Google Scholar
Simon, K., Lausen, G.: Viper: augmenting automatic information extraction with visual perceptions. In: CIKM, pp. 381–388 (2005), http://doi.acm.org/10.1145/1099554.1099672
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Article MATH Google Scholar
Vallejo, C.G., Troyano, J.A., Ortega, F.J.: InstanceRank: Bringing order to datasets. Pattern Recogn. Lett. 31, 133–142 (2010), http://portal.acm.org/citation.cfm?id=1663654.1663889
Article Google Scholar
Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: Ontology driven semi-automatic and automatic support for semantic markup. In: Gómez-Pérez, Benjamins (eds.) [8], pp. 379–391, http://link.springer.de/link/service/series/0558/bibs/2473/24730379.htm
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003), http://doi.acm.org/10.1145/775152.775179
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/~ml/weka/book.html
MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Sevilla, spain
Gretel Fernández & Hassan A. Sleiman

Authors

Gretel Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Hassan A. Sleiman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departamento de Informática y Automática, Facultad de Ciencias, Universidad de Salamanca, Plaza de la Merced S/N, 37008, Salamanca, Spain
Juan M. Corchado
Escuela Universitaria de Informática, Universidad Pontificia de Salamanca, Compañía 5, 37002, Salamanca, Spain
Javier Bajo Pérez
The Maersk Mc-Kinney Moller Institute, University of Southern Denmark, Campusvej 55, DK-5230, Odense M, Denmark
Kasper Hallenborg
Insitute of Manangement Engineering, Poznan University of Technology, Strzelecka 11, 60-965, Poznan, Poland
Paulina Golinska
ETSI Informática, Avda. Reina Mercedes, s/n, 41012, Sevilla, Spain
Rafael Corchuelo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fernández, G., Sleiman, H.A. (2011). An Experiment on Using Datamining Techniques to Extract Information from the Web. In: Corchado, J.M., Pérez, J.B., Hallenborg, K., Golinska, P., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19931-8_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-19931-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19930-1
Online ISBN: 978-3-642-19931-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics