skip to main content
10.1145/2390148.2390157acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
poster

Exploiting semantic annotations in math information retrieval

Published:02 November 2012Publication History

ABSTRACT

This paper describes exploitation of semantic annotations in the design and architecture of MIaS (Math Indexer and Searcher) system for mathematics retrieval. Basing on the claim that navigational and research search are `killer' applications for digital library such as the European Digital Mathematics Library, EuDML, we argue for an approach based on Natural Language Processing techniques as used in corpus management systems such as the Sketch Engine, that will reach web scalability and avoid inference problems. The main ideas are 1) to augment surface texts (including math formulae) with additional linked representations bearing semantic information (expanded formulae as text, canonicalized text and subformulae) for indexing, including support for indexing structural information (expressed as Content MathML or other tree structures) and 2) use semantic user preferences to order found documents.

The semantic enhancements of the MIaS system are being implemented as a math-aware search engine based on the state-of-the-art system Apache Lucene, with support for [MathML] tree indexing. Scalability issues have been checked against more than 400,000 arXiv documents.

References

  1. Josef B. Baker, Alan P. Sexton, and Volker Sorge. MaxTract: Converting PDF to ŁaTeX, MathML and Text. In AISC/\hskip0ptDML/\hskip0ptMKM/\hskip0ptCalculemus,Vol. 7362 of LNAI, pp. 422--426. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Marco Baroni and Adam Kilgarriff. Large linguistically-\penalty-200 processed webcorpora for multiple languages. In Proc. of the 11th Conference of the EACL'06, pp. 87--90, Stroudsburg, PA, USA, 2006. ACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. José Borbinha, Thierry Bouche, Aleksander Nowinski, and Petr Sojka. Project EuDML--A First Year Demonstration. In Proc. of 10th MKM 2011, Vol. 6824 of LNAI,pp. 281--284, Berlin, Germany, July 2011.Springer\discretionary-Verlag. http://dx.doi.org/10.1007/978-3-642-22673-1_21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Allison J.B. Chaney and David M. Blei. Visualizing topic models. In Intl. AAAI Conference on Social Media and Weblogs,Department of Computer Science, Princeton University, Princeton, NJ, USA,March 2012.Google ScholarGoogle Scholar
  5. Adam Kilgarriff, Pavel Rychlý, Pavel Smrz, and David Tugwell. The Sketch Engine. In Proc. of the 11th EURALEX International Congress,pp. 105--116, Lorient, France, 2004.Google ScholarGoogle Scholar
  6. Martin Lívska, Petr Sojka, Michal R\ru\vzi\vcka, and Petr Mravec. Web Interface and Collection for Mathematical Retrieval: WebMIaS and MREC. In Proc. of DML 2011. Bertinoro, Italy, July 20--21st, 2011, pp. 77--84. Masaryk University, July 2011. http://hdl.handle.net/10338.dmlcz/702604.Google ScholarGoogle Scholar
  7. Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. Cross lingual text classification by mining multilingual topics from wikipedia. In Proc. of the 4th ACM international conference on Web search and data mining, WSDM ’11, pp. 375--384, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Radimv Rehurek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proc. of LREC 2010 workshop New Challenges for NLPFrameworks, pp. 45--50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en, software available at http://nlp.fi.muni.cz/projekty/gensim.Google ScholarGoogle Scholar
  9. Petr Sojka and Martin Líska. Indexing and Searching Mathematics in Digital Libraries -- Architecture, Design and Scalability Issues. In Proc. of 10th MKM 2011,Vol. 6824 of LNAI, pp. 228--243, Berlin, Germany,2011. Springer\discretionary-Verlag. http://dx.doi.org/10.1007/978--3--642--22673--1_16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Petr Sojka and Martin Líska. The Art of Mathematics Retrieval. In Proceedings of the ACM Conference on Document Engineering, DocEng 2011, pp. 57--60, Mountain View, CA, 2011. ACM. http://doi.acm.org/10.1145/2034691.2034703. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Masakazu Suzuki, Fumikazu Tamari, Ryoji Fukuda, Seiichi Uchida, and Toshihiro Kanahori. INFTY\,--\,An integrated OCR system for mathematical documents. In Proc. of ACM Symposium on Document Engineering 2003, pp.mbox95--104, Grenoble, France, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Exploiting semantic annotations in math information retrieval

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader