Transforming Large Collections of Scientific Publications to XML

Stamerjohanns, Heinrich; Kohlhase, Michael; Ginev, Deyan; David, Catalin; Miller, Bruce

doi:10.1007/s11786-010-0024-7

Transforming Large Collections of Scientific Publications to XML

Published: 27 February 2010

Volume 3, pages 299–307, (2010)
Cite this article

Mathematics in Computer Science Aims and scope Submit manuscript

Heinrich Stamerjohanns¹,
Michael Kohlhase¹,
Deyan Ginev¹,
Catalin David¹ &
…
Bruce Miller²

140 Accesses
17 Citations
Explore all metrics

Abstract

We describe an experiment transforming large collections of L^aT_EX documents to more machine-understandable representations. Concretely, we are translating the collection of scientific publications of the Cornell e-Print Archive (arχiv) using LaTeXML, a L^aT_EX to XML converter currently under development. While the long-term goal is a large body of scientific documents available for semantic analysis, search indexing and other experimentation, the immediate goals are tools for creating such corpora. The first task of our arXMLiv project is to develop LaTeXML bindings for the (thousands of) L^aT_EX classes and packages used in the arχiv collection, as well as methods for coping with the eccentricities that T_EX encourages. We have created a distributed build system that runs LaTeXML over the collection, in part or entirely, while collecting statistics about missing bindings and other errors. This guides debugging and development efforts, leading to iterative improvements in both the tools and the quality of the converted corpus. The build system thus serves as both a production conversion engine and software test harness. We have now processed the complete arχiv collection through 2006 consisting of more than 400,000 documents (a complete run is a processor-year-size undertaking), continuously improving our success rate. We are now able to convert more than 90% of these documents to XHTML+MathML. We consider over 60% to be successes, converted with no or minor warnings. While the remaining 30% can also be converted, their quality is doubtful, due to unsupported macros or conversion errors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ausbrooks, R., Carlisle, S.B.D., Chavchanidze, G., Dalmas, S., Devitt, S., Diaz, A., Dooley, S., Hunter, R., Ion, P., Kohlhase, M., Lazrek, A., Libbrecht, P., Miller, B., Miner, R., Sargent, M., Smith, B., Soiffer, N., Sutor, R., Watt, S.: Mathematical Markup Language (MathML) version 3.0. W3C Working Draft of 24. Sep. 2009, World Wide Web Consortium, 2009
Anghelache, R.: Hermes—a semantic XML + MathML + Unicode e-publishing/self-archiving tool for L^aT_EX authored scientific articles. http://hermes.roua.org/ (2007)
arXiv.org e-Print archive. http://www.arxiv.org. December 2007
Buswell, S., Caprotti, O., Carlisle, D.P., Dewar, M.C., Gaetano, M., Kohlhase, M.: The Open Math standard, version 2.0. Technical report, The Open Math Society (2004)
Ginev, D., Jucovschi, C., Anca, S., Grigore, M., David, C., Kohlhase, M.: An architecture for linguistic and semantic analysis on the arXMLiv corpus. In: Applications of semantic technologies (AST) workshop at informatik (2009)
Kohlhase, M.: OMDoc—an open markup format for mathematical documents [Version 1.2]. Number 4180 in LNAI. Springer, Berlin (2006)
Kohlhase, M.: Using L^aT_EX as a semantic markup format. Mathematics in Computer Science, pp. 279–304 (2008)
Kohlhase, M., Şucan, I.: A search engine for mathematical formulae. In: Ida, T., Calmet, J., Wang, D. (eds.) Proceedings of Artificial Intelligence and Symbolic Computation, AISC’2006, number 4120 in LNAI, pp. 241–253. Springer, Berlin (2006)
Mathplayer: Speech instructions and examples. http://www.dessci.com/en/products/mathplayer/tech/accessibility.htm
Math Web Search. http://kwarc.info/projects/mws/. December 2008
Miller, B.: LaTeXML: A L^aT_EX to xml converter. Web Manual at http://dlmf.nist.gov/LaTeXML/. September 2007
Stamerjohanns, H., Ginev, D., David, C., Misev, D., Zamdzhiev, V., Kohlhase, M.: Mathml-aware article conversion from L^aT_EX, a comparison study. In: Sojka, P. (ed.) Towards Digital Mathematics Library, DML 2009 Workshop. Masaryk University, Brno (2009)
Stamerjohanns, H., Kohlhase, M.: Transforming the arχiv to XML. In: Autexier, S., Campbell, J., Rubio, J., Sorge, V., Suzuki, M., Wiedijk, F. (eds.) Intelligent Computer Mathematics, 9th International Conference, AISC 2008 15th Symposium, Calculemus 2008 7th International Conference, MKM 2008 Birmingham, UK, 28 July–1 August 2008, Proceedings, number 5144 in LNAI, pp. 574–582. Springer, Berlin (2008)
TeX4ht: LaTeX and TeX for hypertext. http://www.tug.org/applications/tex4ht/mn.html
van den Brand, M., Stuber, J.: Extracting mathematical semantics from latex documents. In: Proceedings of International Workshop on Principles and Practice of Semantic Web Reasoning (PPSWR 2003), number 2901 in LNCS, pp. 160–173, Mumbai, India. Springer, Berlin (2003)
Zentralblatt MATH. http://www.zentralblatt-math.org. October 2009

Download references

Author information

Authors and Affiliations

Department of Computer Science, Jacobs University, Bremen, Germany
Heinrich Stamerjohanns, Michael Kohlhase, Deyan Ginev & Catalin David
National Institute of Standards and Technology, Gaithersburg, MD, USA
Bruce Miller

Authors

Heinrich Stamerjohanns
View author publications
You can also search for this author in PubMed Google Scholar
Michael Kohlhase
View author publications
You can also search for this author in PubMed Google Scholar
Deyan Ginev
View author publications
You can also search for this author in PubMed Google Scholar
Catalin David
View author publications
You can also search for this author in PubMed Google Scholar
Bruce Miller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Heinrich Stamerjohanns.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stamerjohanns, H., Kohlhase, M., Ginev, D. et al. Transforming Large Collections of Scientific Publications to XML. Math.Comput.Sci. 3, 299–307 (2010). https://doi.org/10.1007/s11786-010-0024-7

Download citation

Received: 26 January 2009
Revised: 30 July 2009
Accepted: 08 January 2010
Published: 27 February 2010
Issue Date: May 2010
DOI: https://doi.org/10.1007/s11786-010-0024-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Transforming Large Collections of Scientific Publications to XML

Abstract

Access this article

Similar content being viewed by others

Robocrystallographer: automated crystal structure text descriptions and analysis

DB-GPT: Large Language Model Meets Database

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Transforming Large Collections of Scientific Publications to XML

Abstract

Access this article

Similar content being viewed by others

Robocrystallographer: automated crystal structure text descriptions and analysis

DB-GPT: Large Language Model Meets Database

Large Language Model Assisted Software Engineering: Prospects, Challenges, and a Case Study

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation