Abstract
In recent years, more and more information appeared on the web. Extracting information from the web and converting them into regular format become significantly important work. After observing a number of web sites, we found that most of useful information is contained in the web sources, which have a large number of similarly structured web documents. So in this paper we present an approach for discovering the useful information sources from the web and extracting information from them. A useful web information source discovering method and a novel information extraction method are proposed. We also develop a prototype system WIEAS (Web Information Extraction, Analysis And Services) to implement our idea, and use the information extracted by WIEAS to provide plentiful services.
Supported by the National Grand Fundamental Research 973 Program of China under Grant No. G1999032705; the National High Technology Development 863 Program of China under Grant No. 2002AA4Z3440.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ashish, N., Knoblock, C.: Semi-automatic Wrapper Generation for Internet Information Sources. In: Proceedings of the IFCIS International Conference on Cooperative Information Systems, CoopIS, pp. 160–169 (1997)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of VLDB, pp. 119–128 (2001)
Crescenzi, W., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc of VLDB, pp. 109-118 (2001)
Freitag, D., McCallum, A.: Information extraction with hmms and shrinkage. In: Proc. of the AAAI 1999 Workshop on Machine Learning for Information Extraction, pp. 31- 36 (1999)
Freitag, D., Kushmerick, N.: Boosted Wrapper Induction. In: Proc. of the 17th AAAI, pp. 577-583 (2000)
Kosala, R., Bruynooghe, M., Blockeel, H., Van den Bussche, J.: Information Extraction by Means of Generalized k-testable Tree Automata Inference Algorithm. In: Proc. of the 4th iiWAS, pp. 105–109 (2002)
Kosala, R., Van den Bussche, J., Bruynooghe, M., Blockeel, H.: Information Extraction in Structured Documents using Tree Automata Induction. In: Elomaa, T., Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, p. 299. Springer, Heidelberg (2002)
Kosala, R., Bruynooghe, M., Blockeel, H., Van den Bussche, J.: Information Extraction from web documents based on local unranked tree automaton inference. In: Proc. of IJCAI, pp. 403-408 (2003)
Li, L., Tang, S., Yang, D., Wang, T., Su, Z.: EGA: An Algorithm for Automatic Semi-Structured Web Documents Extraction. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA 2004. LNCS, vol. 2973, pp. 787–798. Springer, Heidelberg (2004)
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled Wrapper Construction System for Web Information Sources. In: Proc. of ICDE, pp. 611-621 (2000)
Kushmerick, N.: Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Journal 118(1-2), 15–68 (2000)
Nierman, A., Jagadish, H.V.: Evaluating Structural Similarity in XML Documents. In: Proc. of WebDB, pp. 61-66 (2002)
Sahuguet, A., Azavant, F.: Building Intelligent Web Applications Using Lightweight Wrappers. Data and Knowledge Engineering 36(3), 283–316 (2001)
Wang, T., Tang, S., Yang, D., et al.: COMMIX: Towards Effective Web Information Extraction, Integration and Query Answering. In: Proc of SIGMOD, p. 620 (2002)
Wang, T., Tang, S., Yang, D.: Extracting Local Schema from Semistructured Data Based on Graph-Oriented Semantic Model. Journal of Computer Science and Technology 16(6), 560–566 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, L., Tang, S., Yang, D., Wang, T., Deng, Z., Su, Z. (2004). WIEAS: Helping to Discover Web Information Sources and Extract Data from Them . In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds) Advanced Web Technologies and Applications. APWeb 2004. Lecture Notes in Computer Science, vol 3007. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24655-8_79
Download citation
DOI: https://doi.org/10.1007/978-3-540-24655-8_79
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21371-0
Online ISBN: 978-3-540-24655-8
eBook Packages: Springer Book Archive