Abstract
Pattern discovery has become a fundamental technique for modern information extraction tasks. This paper presents a new twophase pattern (2PP) discovery technique for information extraction. 2PP consists of orthographic pattern discovery (OPD) and semantic pattern discovery (SPD). The OPD determines the structural features from an identified region of a document and the SPD discovers a dominant semantic pattern for the region via inference, apposition and analogy. 2PP applies discovered pattern back into the region to extract required data items through pattern matching. Experimental evaluation on a large number of identified regions indicates that our 2PP technique achieves effective results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Adlberg, B.: Nodose - A tool for semi-automatically extracting structured and semistructured data from text documents. In: SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, Seattle, Washington, USA, June 1998, pp. 283–294. ACM, New York (1998)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the ACM SIGMOD, International Conference on Management of Data, San Diego, California, June 2003, pp. 337–348 (2003)
Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 119–128 (2001)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of 27th International Conference on Very Large Data Bases, Roma, Italy, September 2001, pp. 109–118 (2001)
Freitag, D.: Multistrategy learning for information extraction. In: Proceedings of 15th International Conference on Machine Learning, Madison, Wisconsin, USA, July 1998, pp. 161–169 (1998)
Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, IJCAI 1997, Nagoya, Japan, August 1997, pp. 729–737 (1997)
Liu, L., Pu, C., Han, W.: Xwrap: An xml-enabled wrapper construction system for web information sources. In: ICDE 2000, In Proceedings of the 16th International conference on Data Engineering, San Diego, California, February 28-March 03, pp. 611–621. IEEE Computer Society, Los Alamitos (2000)
Ma, L.: Information Extraction from Unstructured Documents. PhD thesis, School of Computer Science and Engineering, University of New South Wales (2003)
Ma, L., Shepherd, J., Zhang, Y.: Extracting information from semistructured data. In: Meng, X., Su, J., Wang, Y. (eds.) WAIM 2002. LNCS, vol. 2419, p. 132. Springer, Heidelberg (2002)
Ma, L., Shepherd, J., Zhang, Y.: Enhancing text classification using synopses extraction. In: WISE 2003, 4th International Conference on Web Information Systems Engineering, Roma, Italy, December 2003, pp. 115–124 (2003)
Muslea, I., Minton, S., Knoblock, C.: A hierarchical approach to wrapper induction. In: Etzioni, O., Müller, J.P., Bradshaw, J.M. (eds.) Proceedings of the Third International Conference on Autonomous Agents (Agents 1999), Seattle, WA, USA, pp. 190–197 (1999)
Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proceedings of Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, pp. 202–208 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ma, L., Shepherd, J. (2004). Information Extraction via Automatic Pattern Discovery in Identified Region. In: Galindo, F., Takizawa, M., Traunmüller, R. (eds) Database and Expert Systems Applications. DEXA 2004. Lecture Notes in Computer Science, vol 3180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30075-5_23
Download citation
DOI: https://doi.org/10.1007/978-3-540-30075-5_23
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22936-0
Online ISBN: 978-3-540-30075-5
eBook Packages: Springer Book Archive