Skip to main content

An Approach to Mathematical Search Through Query Formulation and Data Normalization

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4573))

Abstract

This article describes an approach to searching for mathematical notation. The approach aims at a search system that can be effectively and economically deployed, and that produces good results with a large portion of the mathematical content freely available on the World Wide Web today. The basic concept is to linearize mathematical notation as a sequence of text tokens, which are then indexed by a traditional text search engine. However, naive generalization of the ”phrase query” of text search to mathematical expressions performs poorly. For adequate precision and recall in the mathematical context, more complex combinations of atomic queries are required. Our approach is to query for a weighted collection of significant subexpressions, where weights depend on expression complexity, nesting depth, expression length, and special boosting of well-known expressions.

To make this approach perform well with the technical content that is readily obtainable on the World Wide Web, either directly or through conversion, it is necessary to extensively normalize mathematical expression data to eliminate accidently or irrelevant encoding differences. To do this, a multi-pass normalization process is applied. In successive stages, MathML and XML errors are corrected, character data is canonicalized, white space and other insignificant data is removed, and heuristics are applied to disambiguated expressions. Following these preliminary stages, the MathML tree structure is canonicalized via an augmented precedence parsing step. Finally, mathematical synonyms and some variable names are canonicalized.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apache Foundation: Lucene Project, http://lucene.apache.org

  2. Apache Foundation: Nutch Project, http://lucene.apache.org/nutch

  3. Asperti, A., Guidi, F., Coen, C.S., Tassi, E., Zacchiroli, S.: A Content Based Mathematical Search Engine. In: Filliâtre, J.-C., Paulin-Mohring, C., Werner, B. (eds.) TYPES 2004. LNCS, vol. 3839, pp. 17–32. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Asperti, A., Selmi, M.: Efficient Retrieval of Mathematical Statements. In: Asperti, A., Bancerek, G., Trybulec, A. (eds.) MKM 2004. LNCS, vol. 3119, pp. 17–31. Springer, Heidelberg (2004)

    Google Scholar 

  5. Grzegorz, B.: Information Retrieval and Rendering with MML Query. In: Borwein, J.M., Farmer, W.M. (eds.) MKM 2006. LNCS (LNAI), vol. 4108, pp. 266–279. Springer, Heidelberg (2006)

    Google Scholar 

  6. Bancerek, G., Rudniki, P.: Information Retrieval in MML. In: Asperti, A., Buchberger, B., Davenport, J.H. (eds.) MKM 2003. LNCS, vol. 2594, pp. 119–132. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  7. Cairns, P.: Informalising Formal Mathematics: searching the mizar library with Latent Semantics. In: Asperti, A., Bancerek, G., Trybulec, A. (eds.) MKM 2004. LNCS, vol. 3119, pp. 17–31. Springer, Heidelberg (2004)

    Google Scholar 

  8. Braniuk, R. et al.: Connexions, http://cnx.org

  9. Cornell University Library: The arXiv, http://arxiv.org

  10. Design Science, Mathdex, http://www.mathdex.com

  11. Harvey, D.: blahtex, http://www.blahtex.org/

  12. Miller, B.R., Youssef, A.: Technical Aspects of the Digital Library of Mathematical Functions. In: Annals of Mathematics and Artificial Intelligence, vol. 38(1-3), pp. 121–136. Springer, Netherlands (2003)

    Google Scholar 

  13. Miller, B.: DLMF, LaTeXML and some lessons learned. In: The Evolution of Mathematical Communication in the Age of Digital Libraries, IMA “Hot Topic” Workshop (2006), http://www.ima.umn.edu/2006-2007/SW12.8-9.06/abstracts.html

  14. Ogilvie, P., Callan, J.: Using Language models for flat text queries in XML retrieval. In: Proceedings of INEX 2003, pp. 12–18 (2003)

    Google Scholar 

  15. Tetsuya, S.: Average Gain Ratio: A Simple Retrieval Performance Measure for Evaluation with Multiple Relevance Levels, ACM SIGIR (2003)

    Google Scholar 

  16. Salton, G., Fox, E., Wu, H.: Extended Boolean Information Retrieval. Communication of the ACM 26(11), 1022–1036 (1983)

    Article  MATH  MathSciNet  Google Scholar 

  17. Trott, M.: Trott’s Corner Mathematical Searching of The Wolfram Functions Site. The Mathematica Journal 9(4), 713–726 (2005)

    MathSciNet  Google Scholar 

  18. Weisstein, E.: Wolfram MathWorld, http://mathworld.wolfram.com

Download references

Author information

Authors and Affiliations

Authors

Editor information

Manuel Kauers Manfred Kerber Robert Miner Wolfgang Windsteiger

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Miner, R., Munavalli, R. (2007). An Approach to Mathematical Search Through Query Formulation and Data Normalization. In: Kauers, M., Kerber, M., Miner, R., Windsteiger, W. (eds) Towards Mechanized Mathematical Assistants. MKM Calculemus 2007 2007. Lecture Notes in Computer Science(), vol 4573. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73086-6_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-73086-6_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-73083-5

  • Online ISBN: 978-3-540-73086-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics