Skip to main content

An Experiment on Using Datamining Techniques to Extract Information from the Web

  • Conference paper
Book cover Trends in Practical Applications of Agents and Multiagent Systems

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 90))

Abstract

The Web has become an indispensable source of knowledge due to the large amount of information available on it. Unfortunately, accessing this information by automated processes is not a simple task. Besides the semantic web and Web services, information extraction is a solution for this problem. Our proposal is to explore how datamining techniques to infer classifiers can help us build classifiers to extract information from web pages. In this paper we present a first approach to our technique that consisted on evaluating how these techniques work on DOM tree features. Our first results show a good precision and recall which indicates a new promising technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng. 64(2), 491–509 (2008), http://dx.doi.org/10.1016/j.datak.2007.10.002

    Article  Google Scholar 

  2. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003), http://doi.acm.org/10.1145/872757.872799

  3. Chang, C.H., Kuo, S.C.: Olera: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19(6), 56–64 (2004), http://dx.doi.org/10.1109/MIS.2004.71

    Article  Google Scholar 

  4. Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for the semantic web. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 312–326. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  5. Corchuelo, R., Arjona, J.L., Ruiz, D.: Wrapping web data islands. J. UCS 14(11), 1808–1810 (2008)

    Google Scholar 

  6. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001), http://www.vldb.org/conf/2001/P109.pdf

  7. Freitag, D.: Information extraction from HTML: Application of a general machine learning approach. In: AAAI/IAAI, pp. 517–523 (1998)

    Google Scholar 

  8. Gómez-Pérez, A., Benjamins, V.R. (eds.): Proceedings of Knowledge Engineering and Knowledge Management Ontologies and the Semantic Web, 13th International Conference, EKAW 2002, Siguenza, Spain, October 1-4 (2002)

    Google Scholar 

  9. Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM - semi-automatic CREAtion of metadata. In: Gómez-Pérez, Benjamins (eds.) [8], pp. 358–372, http://link.springer.de/link/service/series/0558/bibs/2473/24730358.htm

  10. Hogue, A., Karger, D.R.: Thresher: Automating the unwrapping of semantic content from the World Wide Web. In: WWW, pp. 86–95 (2005), http://doi.acm.org/10.1145/1060745.1060762

  11. Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(8), 521–538 (1998), http://dx.doi.org/10.1016/S0306-43799800027-1

    Article  Google Scholar 

  12. Kayed, M., Chang, C.H.: Fivatech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering 22, 249–263 (2010),doi http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.82

    Article  Google Scholar 

  13. Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. J. Web Sem. 2(1), 49–79 (2004), http://dx.doi.org/10.1016/j.websem.2004.07.005

    Google Scholar 

  14. Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002), http://dx.doi.org/10.1016/S0169-023X0100047-7

    Article  MATH  Google Scholar 

  15. Li, W., Dong, Y., Wang, R., Tian, H.: Information extraction from semi-structured WEB page based on DOM tree and its application in scientific literature statistical analysis system. In: Anonymous (ed.) SSME (2009), http://doi.ieeecomputersociety.org/10.1109/SSME.2009.59

  16. Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. (2010), http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.109

  17. Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)

    Article  Google Scholar 

  18. Papadakis, N., Skoutas, D., Raftopoulos, K., Varvarigou, T.A.: Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. IEEE Trans. Knowl. Data Eng. 17(12), 1638–1652 (2005), http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.203

    Article  Google Scholar 

  19. Simon, K., Lausen, G.: Viper: augmenting automatic information extraction with visual perceptions. In: CIKM, pp. 381–388 (2005), http://doi.acm.org/10.1145/1099554.1099672

  20. Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)

    Article  MATH  Google Scholar 

  21. Vallejo, C.G., Troyano, J.A., Ortega, F.J.: InstanceRank: Bringing order to datasets. Pattern Recogn. Lett. 31, 133–142 (2010), http://portal.acm.org/citation.cfm?id=1663654.1663889

    Article  Google Scholar 

  22. Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: Ontology driven semi-automatic and automatic support for semantic markup. In: Gómez-Pérez, Benjamins (eds.) [8], pp. 379–391, http://link.springer.de/link/service/series/0558/bibs/2473/24730379.htm

  23. Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003), http://doi.acm.org/10.1145/775152.775179

  24. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/~ml/weka/book.html

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fernández, G., Sleiman, H.A. (2011). An Experiment on Using Datamining Techniques to Extract Information from the Web. In: Corchado, J.M., Pérez, J.B., Hallenborg, K., Golinska, P., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19931-8_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-19931-8_21

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-19930-1

  • Online ISBN: 978-3-642-19931-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics