skip to main content
10.1145/1030397.1030439acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

Supervised learning for the legacy document conversion

Published: 28 October 2004 Publication History

Abstract

We consider the problem of document conversion from the rendering-oriented HTML markup into a semantic-oriented XML annotation defined by user-specific DTDs or XML Schema descriptions. We represent both source and target documents as rooted ordered trees so the conversion can be achieved by applying a set of tree transformations. We apply the supervised learning framework to the conversion task according to which the tree transformations are learned from a set of training examples. %Because of the complexity of tree-to-tree transformations, We develop a two-step approach to the conversion problem, that first labels leaves in the source trees and then recomposes target trees from the leaf labels. We present two solutions based of the leaf classification with the target terminals and paths. Moreover, we develop three methods for the leaf classification. All methods and solutions have been tested on two real collections.

References

[1]
J. Ullman, A. Aho, and R. Seti. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.
[2]
A. Aho and J. Ullman. The Theory of Parsing, Translation, and Compiling. Prentice Hall, Englewood Cliffs, NJ, 1972.
[3]
Oronzo Altamura, Floriana Esposito, and Donato Malerba. Transforming paper documents into XML format with WISDOM++. IJDAR, 4(1):2--17, 2001.
[4]
N. Ashish and C. Knoblock. Wrapper generation for semi-structured internet sources. In Proc. ACM SIGMOD Workshop on Management of Semistructured Data, 1997.
[5]
Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39--71, 1996.
[6]
P. Brown, V. Della Pietra, P. deSouza, J. Lai, and R. Mercer. Class-based n-gram models of natural language. In Computational Linguistics, 18(4), pages 467--480, 1992.
[7]
Neel Sundaresan, Christina Yip Chung, and Michael Gertz. Reverse engineering for web data: From visual to semantic structures. In 18th International Conference on Data Engineering (ICDE'02), San Jose, California, 2002.
[8]
T. G. Dietterich. Machine learning for sequential data: A review. In T. Caelli, editor, Lecture Notes in Computer Science. Springer-Verlag, 2002.
[9]
M. Penttonen, E. Kuikka, and P. Leinonen. Towards automating of document structure transformations. In Proc. ACM Symposium on Document Engineering, pages 103--110, 2002.
[10]
D. Freitag. Information extraction from html: Application of a general machine learning approach. In Proc. AAAI/IAAI, pages 517--523, 1998.
[11]
I4I - The WORD is XML. www.i4i.com/life sciences.htm.
[12]
T. Joachims. A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. In Proceedings of the 14th International Conference on Machine Learning ICML97, pages 143--151, 1997.
[13]
Andrew McCallum, Dayne Freitag, and Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation. In Proc. 17th International Conf. on Machine Learning, pages 591--598. Morgan Kaufmann, San Francisco, CA, 2000.
[14]
Tova Milo, Dan Suciu, and Victor Vianu. Typechecking for XML transformers. In Proceedings of the Nineteenth ACM SIGMOD Symposium on Principles of Database Systems, pages 11--22. ACM, 2000.
[15]
Frank Neven. Automata Theory for XML Researchers. SIGMOD Record, 31(3):39--46, 2002.
[16]
OmniPage Pro 14 Office. http://www.scansoft.com/omnipage/.
[17]
Yannis Papakonstantinou and Victor Vianu. DTD Inference for Views of XML Data. In Proc. of 19 ACM Symposium on Principles of Database Systems (PODS), Dallas, Texas, USA, pages 35--46, 2000.
[18]
W2X Convertor. www.turnkey.com.au/site/xice/xice/convert.html.
[19]
Y. Wang, I. T. Phillips, and R. Haralick. From image to SGML/XML representation: One method. In International Workshop on Document Layout Interpretation and Its Applications (DLIAP'99), Bangalore, India, September 1999.
[20]
D. Wood. Standard Generalized Markup Language: Mathematical and philosophical issues. Lecture Notes in Computer Science, 1000:344--365, 1995.
[21]
Word and YAWC: A Poor Mans' XML Publishing Environment. www.idealliance.org/papers/xmle02/dx_xmle02/html/abstract/02-06-04.html.
[22]
Y. Sakakibara. Recent Advances of Grammatical Inference. Theoretical Computer Science, 185(1):15--45, October 1997.
[23]
Jeonghee Yi and Neel Sundaresan. A classifier for semi-structured documents. In Proceedings of the Sixth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pages 340--344. ACM Press, 2000.
[24]
M.J. Zaki and C. Aggarwal. XRULES: An effective structural classifier for XML data. In 9th International Conference on Knowledge Discovery and Data Mining, Washington, DC, 2003.

Cited By

View all
  • (2011)LAGFuture Generation Computer Systems10.1016/j.future.2010.07.00427:1(32-39)Online publication date: 1-Jan-2011
  • (2010)Evolution of XPath lists for document data selectionProceedings of the 11th international conference on Parallel problem solving from nature: Part II10.5555/1887255.1887293(341-350)Online publication date: 11-Sep-2010
  • (2010)Evolution of XPath Lists for Document Data SelectionParallel Problem Solving from Nature, PPSN XI10.1007/978-3-642-15871-1_35(341-350)Online publication date: 2010
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DocEng '04: Proceedings of the 2004 ACM symposium on Document engineering
October 2004
252 pages
ISBN:1581139381
DOI:10.1145/1030397
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. XML markup
  2. legacy document conversion
  3. machine learning

Qualifiers

  • Article

Conference

DocEng04
Sponsor:
DocEng04: ACM Symposium on Document Engineering
October 28 - 30, 2004
Wisconsin, Milwaukee, USA

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2011)LAGFuture Generation Computer Systems10.1016/j.future.2010.07.00427:1(32-39)Online publication date: 1-Jan-2011
  • (2010)Evolution of XPath lists for document data selectionProceedings of the 11th international conference on Parallel problem solving from nature: Part II10.5555/1887255.1887293(341-350)Online publication date: 11-Sep-2010
  • (2010)Evolution of XPath Lists for Document Data SelectionParallel Problem Solving from Nature, PPSN XI10.1007/978-3-642-15871-1_35(341-350)Online publication date: 2010
  • (2007)From layout to semanticLarge Scale Semantic Access to Content (Text, Image, Video, and Sound)10.5555/1931390.1931432(433-448)Online publication date: 30-May-2007
  • (2007)XML Structure MappingComparative Evaluation of XML Information Retrieval Systems10.1007/978-3-540-73888-6_49(540-551)Online publication date: 2007
  • (2007)A Taxonomy for XML Retrieval Use CasesComparative Evaluation of XML Information Retrieval Systems10.1007/978-3-540-73888-6_39(413-422)Online publication date: 2007
  • (2007)Probabilistic Model for Structured Document MappingProceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition10.1007/978-3-540-73499-4_64(854-867)Online publication date: 18-Jul-2007

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media