Probabilistic Model for Structured Document Mapping

Wisniewski, Guillaume; Maes, Francis; Denoyer, Ludovic; Gallinari, Patrick

doi:10.1007/978-3-540-73499-4_64

Guillaume Wisniewski¹,
Francis Maes¹,
Ludovic Denoyer¹ &
…
Patrick Gallinari¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4571))

Included in the following conference series:

International Workshop on Machine Learning and Data Mining in Pattern Recognition

3634 Accesses
2 Citations

Abstract

We address the problem of learning automatically to map heterogeneous semi-structured documents onto a mediated target XML schema. We adopt a machine learning approach where the mapping between input and target documents is learned from a training corpus of documents. We first introduce a general stochastic model of semi structured documents generation and transformation. This model relies on the concept of meta-document which is a latent variable providing a link between input and target documents. It allows us to learn the correspondences when the input documents are expressed in a large variety of schemas. We then detail an instance of the general model for the particular task of HTML to XML conversion. This instance is tested on three different corpora using two different inference methods: a dynamic programming method and an approximate LaSO-based method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chung, C.Y., Gertz, M., Sundaresan, N.: Reverse engineering for web data: From visual to semantic structures. In: ICDE (2002)
Google Scholar
Zhang, S., Dyreson, C.: Polymorphic xml restructuring. In: IIWeb 2006: Workshop on Information Integration on the Web (2006)
Google Scholar
Wisniewski, G., Gallinari, P.: From layout to semantic: a reranking model for mapping web documents to mediated xml representations. In: Proceedings of the 8th RIAO International Conference on Large-Scale Semantic Access to Content (2007)
Google Scholar
Chidlovskii, B., Fuselier, J.: Supervised learning for the legacy document conversion. In: DocEng 2004. Proceedings of the 2004 ACM symposium on Document engineering, New York, NY, USA, pp. 220–228. ACM Press, New York (2004)
Chapter Google Scholar
Doan, A., Halevy, A.: Semantic integration research in the database community: A brief survey. AI Magazine, Special Issue on Semantic Integration (2005)
Google Scholar
Denoyer, L., Gallinari, P.: Bayesian network model for semi-structured document classification. Information Processing and Management (2004)
Google Scholar
Denoyer, L.: Xml document mining challenge. Technical report, LIP6 (2005)
Google Scholar
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Comput. Linguist. 22, 39–71 (1996)
Google Scholar
Malouf, R.: A comparison of algorithms for maximum entropy parameter estimation. In: COLING-02. proceeding of the 6th conference on Natural language learning, Morristown, NJ, USA, pp. 1–7. Association for Computational Linguistics (2002)
Google Scholar
Chidlovskii, B., Fuselier, J.: A Probabilistic Learning Method for XML Annotation of Documents. In: IJCAI (2005)
Google Scholar
Denoyer, L., Wisniewski, G., Gallinari, P.: Document structure matching for heterogeneous corpora. In: Workshop SIGIR 2004. Workshop on IR and XML, Sheffield (2004)
Google Scholar
Daumé III, H., Marcu, D.: Learning as search optimization: approximate large margin methods for structured prediction. In: ICML 2005. Proceedings of the 22nd international conference on Machine learning, New York, NY, USA, pp. 169–176. ACM Press, New York (2005)
Chapter Google Scholar
Fuhr, N., Govert, N., Kazai, G., Lalmas, M.: Inex: Initiative for the evaluation of xml retrieval. In: SIGIR 2002 Workshop on XML and Information Retrieval (2002)
Google Scholar
Doan, A., Domingos, P., Halevy, A.: Learning to match the schemas of data sources: A multistrategy approach. Mach. Learn. 50, 279–301 (2003)
Article MATH Google Scholar
McCallum, A.: Information extraction: distilling structured data from unstructured text. Queue 3, 48–57 (2005)
Article Google Scholar
Young-Lai, M., Tompa, F.W.: Stochastic grammatical inference of text database structure. Machine Learning (2000)
Google Scholar
Fine, S., Singer, Y., Tishby, N.: The hierarchical hidden markov model: Analysis and applications. Machine Learning 32, 41–62 (1998)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

LIP6 — University of Paris 6 104 avenue du prsident Kennedy 75015, Paris
Guillaume Wisniewski, Francis Maes, Ludovic Denoyer & Patrick Gallinari

Authors

Guillaume Wisniewski
View author publications
You can also search for this author in PubMed Google Scholar
Francis Maes
View author publications
You can also search for this author in PubMed Google Scholar
Ludovic Denoyer
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Gallinari
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wisniewski, G., Maes, F., Denoyer, L., Gallinari, P. (2007). Probabilistic Model for Structured Document Mapping. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2007. Lecture Notes in Computer Science(), vol 4571. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73499-4_64

Download citation

DOI: https://doi.org/10.1007/978-3-540-73499-4_64
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73498-7
Online ISBN: 978-3-540-73499-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics