ABSTRACT
This paper describes exploitation of semantic annotations in the design and architecture of MIaS (Math Indexer and Searcher) system for mathematics retrieval. Basing on the claim that navigational and research search are `killer' applications for digital library such as the European Digital Mathematics Library, EuDML, we argue for an approach based on Natural Language Processing techniques as used in corpus management systems such as the Sketch Engine, that will reach web scalability and avoid inference problems. The main ideas are 1) to augment surface texts (including math formulae) with additional linked representations bearing semantic information (expanded formulae as text, canonicalized text and subformulae) for indexing, including support for indexing structural information (expressed as Content MathML or other tree structures) and 2) use semantic user preferences to order found documents.
The semantic enhancements of the MIaS system are being implemented as a math-aware search engine based on the state-of-the-art system Apache Lucene, with support for [MathML] tree indexing. Scalability issues have been checked against more than 400,000 arXiv documents.
- Josef B. Baker, Alan P. Sexton, and Volker Sorge. MaxTract: Converting PDF to ŁaTeX, MathML and Text. In AISC/\hskip0ptDML/\hskip0ptMKM/\hskip0ptCalculemus,Vol. 7362 of LNAI, pp. 422--426. Springer, 2012. Google ScholarDigital Library
- Marco Baroni and Adam Kilgarriff. Large linguistically-\penalty-200 processed webcorpora for multiple languages. In Proc. of the 11th Conference of the EACL'06, pp. 87--90, Stroudsburg, PA, USA, 2006. ACL. Google ScholarDigital Library
- José Borbinha, Thierry Bouche, Aleksander Nowinski, and Petr Sojka. Project EuDML--A First Year Demonstration. In Proc. of 10th MKM 2011, Vol. 6824 of LNAI,pp. 281--284, Berlin, Germany, July 2011.Springer\discretionary-Verlag. http://dx.doi.org/10.1007/978-3-642-22673-1_21. Google ScholarDigital Library
- Allison J.B. Chaney and David M. Blei. Visualizing topic models. In Intl. AAAI Conference on Social Media and Weblogs,Department of Computer Science, Princeton University, Princeton, NJ, USA,March 2012.Google Scholar
- Adam Kilgarriff, Pavel Rychlý, Pavel Smrz, and David Tugwell. The Sketch Engine. In Proc. of the 11th EURALEX International Congress,pp. 105--116, Lorient, France, 2004.Google Scholar
- Martin Lívska, Petr Sojka, Michal R\ru\vzi\vcka, and Petr Mravec. Web Interface and Collection for Mathematical Retrieval: WebMIaS and MREC. In Proc. of DML 2011. Bertinoro, Italy, July 20--21st, 2011, pp. 77--84. Masaryk University, July 2011. http://hdl.handle.net/10338.dmlcz/702604.Google Scholar
- Xiaochuan Ni, Jian-Tao Sun, Jian Hu, and Zheng Chen. Cross lingual text classification by mining multilingual topics from wikipedia. In Proc. of the 4th ACM international conference on Web search and data mining, WSDM ’11, pp. 375--384, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Radimv Rehurek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proc. of LREC 2010 workshop New Challenges for NLPFrameworks, pp. 45--50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz/publication/884893/en, software available at http://nlp.fi.muni.cz/projekty/gensim.Google Scholar
- Petr Sojka and Martin Líska. Indexing and Searching Mathematics in Digital Libraries -- Architecture, Design and Scalability Issues. In Proc. of 10th MKM 2011,Vol. 6824 of LNAI, pp. 228--243, Berlin, Germany,2011. Springer\discretionary-Verlag. http://dx.doi.org/10.1007/978--3--642--22673--1_16. Google ScholarDigital Library
- Petr Sojka and Martin Líska. The Art of Mathematics Retrieval. In Proceedings of the ACM Conference on Document Engineering, DocEng 2011, pp. 57--60, Mountain View, CA, 2011. ACM. http://doi.acm.org/10.1145/2034691.2034703. Google ScholarDigital Library
- Masakazu Suzuki, Fumikazu Tamari, Ryoji Fukuda, Seiichi Uchida, and Toshihiro Kanahori. INFTY\,--\,An integrated OCR system for mathematical documents. In Proc. of ACM Symposium on Document Engineering 2003, pp.mbox95--104, Grenoble, France, 2003. ACM. Google ScholarDigital Library
Index Terms
- Exploiting semantic annotations in math information retrieval
Recommendations
The art of mathematics retrieval
DocEng '11: Proceedings of the 11th ACM symposium on Document engineeringThe design and architecture of MIaS (Math Indexer and Searcher), a system for mathematics retrieval is presented, and design decisions are discussed. We argue for an approach based on Presentation MathML using a similarity of math subformulae. The ...
Indexing and searching mathematics in digital libraries: architecture, design and scalability issues
MKM'11: Proceedings of the 18th Calculemus and 10th international conference on Intelligent computer mathematicsThis paper surveys approaches and systems for searching mathematical formulae in mathematical corpora and on the web. The design and architecture of our MIaS (Math Indexer and Searcher) system is presented, and our design decisions are discussed in ...
Choosing Math Features for BM25 Ranking with Tangent-L
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018Combining text and mathematics when searching in a corpus with extensive mathematical notation remains an open problem. Recent results for Tangent-3 on the math and text retrieval task at NTCIR-12, for example, have room for improvement, even though ...
Comments