Abstract
We present a search engine called TexSpaSearch that can search text documents with associated locations in space. We defined three search queries denoted as Q1(t), Q2(t, r) and Q3(p, r) for finding documents containing text t intersecting a disc centered at position p with radius r. Testing was performed using the UNB Connell Memorial Herbarium database whose records normally contain the location where plant specimens were collected along with associated textual data. The sample herbarium database of size \(N= 40,791\) records with associated locations was indexed using a novel R*-tree and suffix tree data structure to achieve efficient search for the defined queries. Significant preprocessing was performed to transform the database into the index data structure used by TexSpaSearch. Testing was performed with 20 example Q1 text only queries to compare TexSpaSearch to a Google Search Appliance, as well as a significant number of example Q2 and Q3 queries. TexSpaSearch search results are ranked by a modified Lucene scoring algorithm, and combined with a spatial rank for Q2 search. A theoretical analysis shows that TexSpaSearch requires \(O(A^{2}\overline{|b|})\) average time for Q1 search, where A is the number of single words in the query string t, and \(\overline{|b|}\) is the average length of a subphrase in t. Q2 and Q3 queries require \(O(A^{2}\overline{|b|} + Z\log _{\mathcal {M}}\mathcal {D}_N + y)\) and \(O(\log _{\mathcal {M}}\mathcal {D}_N + y)\) time, respectively, where Z is the number of point records in the list \(\mathcal {P}\) of text search results, \(\mathcal {D}_N\) is the number of data objects indexed in the R*-tree for N records, \(\mathcal {M}\) is the maximum number of entries of an interior node in the R*-tree, and y is the number of R*-tree leaf nodes found in range in a Q3 query.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lucene as a ranking engine. http://www.wortcook.com/pdf/lucene-ranking.pdf (accessed November 10, 2013)
Specimen Label Data for the Connell Memorial Herbarium. http://herbarium.biology.unb.ca/fmi/iwp/res/iwp_auth.html
Stopwords. http://www.ranks.nl/stopwords (accessed May 5, 2014)
Suffix tree. http://en.wikipedia.org/wiki/Suffix_tree (accessed June 23, 2011)
Arge, L., de Berg, M., Haverkort, H.J., Yi, K.: The priority r-tree: A practically efficient and worst-case optimal r-tree. ACM Transactions on Algorithms 4(1) (2008)
Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The r*-tree: an efficient and robust access method for points and rectangles. In: SIGMOD Conference, pp. 322–331 (1990)
Chen, L., Cong, G., Jensen, C.S., Wu, D.: Spatial keyword query processing: An experimental evaluation. PVLDB 6(3), 217–228 (2013). http://www.vldb.org/pvldb/vol6/p217-chen.pdf
Christoforaki, M., He, J., Dimopoulos, C., Markowetz, A., Suel, T.: Text vs. space: efficient geo-search query processing. In: Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, pp. 423–432, October 24–28, 2011. http://doi.acm.org/10.1145/2063576.2063641
Fan, J., Li, G., Zhou, L., Chen, S., Hu, J.: SEAL: spatio-textual similarity search. PVLDB 5(9), 824–835 (2012). http://vldb.org/pvldb/vol5/p824_jufan_vldb2012.pdf
Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS, pp. 137–143 (1997)
Ferragina, P., González, R., Navarro, G., Venturini, R.: Compressed text indexes: From theory to practice. J. Exp. Algorithmics 13, 12:1.12–12:1.31 (2009). http://doi.acm.org/10.1145/1412228.1455268
Foundation, A.S.: Apache lucene - scoring (2011). letzter Zugriff: 20, Oktober 2011. http://lucene.apache.org/java/3_4_0/scoring.html
Göbel, R., Henrich, A., Niemann, R., Blank, D.: A hybrid index structure for geo-textual searches. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2–6, 2009, pp. 1625–1628. http://doi.acm.org/10.1145/1645953.1646188
Gusfield, D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press (1997)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD Conference, pp. 47–57 (1984)
Han, D., Nickerson, B.G.: Comparison of text search ranking algorithms. Tech. rep., TR11-209, Faculty of Computer Science. University of New Brunswick, August, 2011
Han, D.A.: Efficient text search with spatial constraints. Tech. rep., TR14-233, Faculty of Computer Science. University of New Brunswick, August, 2014
Heuer, J.T., Dupke, S.: Towards a spatial search engine using geotags. In: Probst, F., Keßler, C. (eds.) GI-Days 2007 - Young Researchers Forum. IfGIprints (2007). http://www.gi-tage.de/downloads/acceptedPapers/heuer.pdf
Jones, C.B., Abdelmoty, A.I., Finch, D., Fu, G., Vaid, S.: The SPIRIT spatial search engine: architecture, ontologies and spatial indexing. In: Egenhofer, M., Freksa, C., Miller, H.J. (eds.) GIScience 2004. LNCS, vol. 3234, pp. 125–139. Springer, Heidelberg (2004)
Li, Z., Lee, K.C.K., Zheng, B., Lee, W., Lee, D.L., Wang, X.: Ir-tree: An efficient index for geographic document search. IEEE Trans. Knowl. Data Eng. 23(4), 585–599 (2011). http://dx.doi.org/10.1109/TKDE.2010.149
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Roussopoulos, N., Leifker, D.: Direct spatial search on pictorial databases using packed r-trees. SIGMOD Rec. 14(4), 17–31 (1985). http://doi.acm.org.proxy.hil.unb.ca/10.1145/971699.318900
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory, SWAT 1973, pp. 1–11. IEEE Computer Society, Washington, DC (1973) http://portal.acm.org/citation.cfm?id=1441424.1441766
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Han, A., Nickerson, B.G. (2015). Efficient Combined Text and Spatial Search. In: Gervasi, O., et al. Computational Science and Its Applications -- ICCSA 2015. ICCSA 2015. Lecture Notes in Computer Science(), vol 9157. Springer, Cham. https://doi.org/10.1007/978-3-319-21470-2_52
Download citation
DOI: https://doi.org/10.1007/978-3-319-21470-2_52
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21469-6
Online ISBN: 978-3-319-21470-2
eBook Packages: Computer ScienceComputer Science (R0)