Abstract
The Web has become an indispensable source of knowledge due to the large amount of information available on it. Unfortunately, accessing this information by automated processes is not a simple task. Besides the semantic web and Web services, information extraction is a solution for this problem. Our proposal is to explore how datamining techniques to infer classifiers can help us build classifiers to extract information from web pages. In this paper we present a first approach to our technique that consisted on evaluating how these techniques work on DOM tree features. Our first results show a good precision and recall which indicates a new promising technique.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng. 64(2), 491–509 (2008), http://dx.doi.org/10.1016/j.datak.2007.10.002
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: SIGMOD Conference, pp. 337–348 (2003), http://doi.acm.org/10.1145/872757.872799
Chang, C.H., Kuo, S.C.: Olera: Semisupervised web-data extraction with visual support. IEEE Intelligent Systems 19(6), 56–64 (2004), http://dx.doi.org/10.1109/MIS.2004.71
Ciravegna, F., Chapman, S., Dingli, A., Wilks, Y.: Learning to harvest information for the semantic web. In: Bussler, C.J., Davies, J., Fensel, D., Studer, R. (eds.) ESWS 2004. LNCS, vol. 3053, pp. 312–326. Springer, Heidelberg (2004)
Corchuelo, R., Arjona, J.L., Ruiz, D.: Wrapping web data islands. J. UCS 14(11), 1808–1810 (2008)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: VLDB, pp. 109–118 (2001), http://www.vldb.org/conf/2001/P109.pdf
Freitag, D.: Information extraction from HTML: Application of a general machine learning approach. In: AAAI/IAAI, pp. 517–523 (1998)
Gómez-Pérez, A., Benjamins, V.R. (eds.): Proceedings of Knowledge Engineering and Knowledge Management Ontologies and the Semantic Web, 13th International Conference, EKAW 2002, Siguenza, Spain, October 1-4 (2002)
Handschuh, S., Staab, S., Ciravegna, F.: S-CREAM - semi-automatic CREAtion of metadata. In: Gómez-Pérez, Benjamins (eds.) [8], pp. 358–372, http://link.springer.de/link/service/series/0558/bibs/2473/24730358.htm
Hogue, A., Karger, D.R.: Thresher: Automating the unwrapping of semantic content from the World Wide Web. In: WWW, pp. 86–95 (2005), http://doi.acm.org/10.1145/1060745.1060762
Hsu, C.N., Dung, M.T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf. Syst. 23(8), 521–538 (1998), http://dx.doi.org/10.1016/S0306-43799800027-1
Kayed, M., Chang, C.H.: Fivatech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering 22, 249–263 (2010),doi http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.82
Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic annotation, indexing, and retrieval. J. Web Sem. 2(1), 49–79 (2004), http://dx.doi.org/10.1016/j.websem.2004.07.005
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S.: DEByE - data extraction by example. Data Knowl. Eng. 40(2), 121–154 (2002), http://dx.doi.org/10.1016/S0169-023X0100047-7
Li, W., Dong, Y., Wang, R., Tian, H.: Information extraction from semi-structured WEB page based on DOM tree and its application in scientific literature statistical analysis system. In: Anonymous (ed.) SSME (2009), http://doi.ieeecomputersociety.org/10.1109/SSME.2009.59
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. (2010), http://doi.ieeecomputersociety.org/10.1109/TKDE.2009.109
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4(1/2), 93–114 (2001)
Papadakis, N., Skoutas, D., Raftopoulos, K., Varvarigou, T.A.: Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. IEEE Trans. Knowl. Data Eng. 17(12), 1638–1652 (2005), http://doi.ieeecomputersociety.org/10.1109/TKDE.2005.203
Simon, K., Lausen, G.: Viper: augmenting automatic information extraction with visual perceptions. In: CIKM, pp. 381–388 (2005), http://doi.acm.org/10.1145/1099554.1099672
Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning 34(1-3), 233–272 (1999)
Vallejo, C.G., Troyano, J.A., Ortega, F.J.: InstanceRank: Bringing order to datasets. Pattern Recogn. Lett. 31, 133–142 (2010), http://portal.acm.org/citation.cfm?id=1663654.1663889
Vargas-Vera, M., Motta, E., Domingue, J., Lanzoni, M., Stutt, A., Ciravegna, F.: MnM: Ontology driven semi-automatic and automatic support for semantic markup. In: Gómez-Pérez, Benjamins (eds.) [8], pp. 379–391, http://link.springer.de/link/service/series/0558/bibs/2473/24730379.htm
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. In: WWW, pp. 187–196 (2003), http://doi.acm.org/10.1145/775152.775179
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005), http://www.cs.waikato.ac.nz/~ml/weka/book.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fernández, G., Sleiman, H.A. (2011). An Experiment on Using Datamining Techniques to Extract Information from the Web. In: Corchado, J.M., Pérez, J.B., Hallenborg, K., Golinska, P., Corchuelo, R. (eds) Trends in Practical Applications of Agents and Multiagent Systems. Advances in Intelligent and Soft Computing, vol 90. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19931-8_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-19931-8_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19930-1
Online ISBN: 978-3-642-19931-8
eBook Packages: EngineeringEngineering (R0)