skip to main content
10.1145/2676536.2676542acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article

Partitioning strategies for spatio-textual similarity join

Published: 04 November 2014 Publication History

Abstract

Given a collection of geo-tagged objects with associated textual descriptors, the spatio-textual similarity join (STJoin) problem is to identify all pairs of similar objects that are close in distance. This task, which is useful in localized recommendations and other applications, is challenging since computing the join is super-linear with respect to the size of the collection. In this paper, we explore partitioning strategies for tackling STJoin. One approach is to start with a spatial data structure, traverse regions and apply a previous algorithm for identifying similar pairs of textual documents called All-Pairs. An alternative approach is to construct a global index but partition postings spatially and modify the All-Pairs algorithm to prune candidates based on distance. We evaluate these approaches on two real-world datasets and find that when running in a single thread, both approaches are comparable in terms of performance. However, a multi-threaded implementation of the global index approach is able to achieve far better speedup given its ability to parallelize at a finer granularity to avoid skewed distributions in task sizes. In addition to using All-Pairs as the underlying textual similarity join algorithm, we also explored an alternate algorithm known as PPJ: our findings are consistent, which suggests that load balancing is a fundamental issue affecting parallel implementations of STJoin algorithms.

References

[1]
A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006.
[2]
J. Ballesteros, A. Cary, and N. Rishe. SpSJoin: parallel spatial similarity joins. In GIS, 2011.
[3]
R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. WWW, 2007.
[4]
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. In SIGMOD, 1990.
[5]
P. Bouros, S. Ge, and N. Mamoulis. Spatio-textual similarity joins. PVLDB, 2012.
[6]
T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Efficient processing of spatial joins using R-trees. In SIGMOD, 1993.
[7]
X. Cao, G. Cong, C. S. Jensen, and B. C. Ooi. Collective spatial keyword querying. In SIGMOD, 2011.
[8]
Y.-Y. Chen, T. Suel, and A. Markowetz. Efficient query processing in geographic web search engines. In SIGMOD, 2006.
[9]
G. Cong, C. S. Jensen, and D. Wu. Efficient retrieval of the top-k most relevant spatial web objects. PVLDB, 2009.
[10]
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Annual Symposium on Computational Geometry, 2004.
[11]
I. De Felipe, V. Hristidis, and N. Rishe. Keyword search on spatial databases. In ICDE, 2008.
[12]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. CACM, 2008.
[13]
A. Eldawy and M. F. Mokbel. A demonstration of SpatialHadoop: an efficient MapReduce framework for spatial data. PVLDB, 2013.
[14]
J. Fan, G. Li, L. Zhou, S. Chen, and J. Hu. Seal: Spatio-textual similarity search. PVLDB, 2012.
[15]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD, 1984.
[16]
R. Hariharan, B. Hore, C. Li, and S. Mehrotra. Processing spatial-keyword (SK) queries in geographic information retrieval (GIR) systems. In SSBDM, 2007.
[17]
G. R. Hjaltason and H. Samet. Incremental distance join algorithms for spatial databases. In SIGMOD, 1998.
[18]
G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. TODS, 1999.
[19]
E. H. Jacox and H. Samet. Spatial join techniques. TODS, 2007.
[20]
E. H. Jacox and H. Samet. Metric space similarity joins. TODS, 2008.
[21]
Z. Li, K. C. Lee, B. Zheng, W.-C. Lee, D. L. Lee, and X. Wang. IR-Tree: An efficient index for geographic document search. TKDE, 2011.
[22]
S. Liu, G. Li, and J. Feng. Star-Join: Spatio-textual similarity join. In CIKM, 2012.
[23]
S. Liu, G. Li, and J. Feng. A prefix-filter based method for spatio-textual similarity join. TKDE, 2013.
[24]
G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In WWW, 2007.
[25]
S. Nutanong, E. H. Jacox, and H. Samet. An incremental Hausdorff distance calculation algorithm. PVLDB, 2011.
[26]
H. Samet. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
[27]
H. Samet, H. Alborzi, F. Brabec, C. Esperança, G. R. Hjaltason, F. Morgan, and E. Tanin. Use of the SAND spatial browser for digital government applications. CACM, 2003.
[28]
H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelfio, B. C. Fruin, J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler. Reading news with maps by exploiting spatial synonyms. CACM, 2014.
[29]
S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicates. In SIGMOD, 2004.
[30]
Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In VLDB, 2005.
[31]
B. E. Teitler, M. D. Lieberman, D. Panozzo, J. Sankaranarayanan, H. Samet, and J. Sperling. NewsStand: A new view on news. In GIS, 2008.
[32]
G. Tolias and Y. Avrithis. Speeded-up, relaxed spatial matching. In ICCV, 2011.
[33]
K.-Y. Whang and R. Krishnamurthy. The multilevel grid file: a dynamic hierarchical multidimensional file structure. In International Symposium on Database Systems for Advanced Applications, 1992.
[34]
C. Xiao, W. Wang, X. Lin, J. X. Yu, and G. Wang. Efficient similarity joins for near-duplicate detection. TODS, 2011.

Cited By

View all
  • (2025)Efficient Parallel Processing of Semantic Trajectory Similarity JoinsIEEE Internet of Things Journal10.1109/JIOT.2024.342767612:4(3534-3548)Online publication date: 15-Feb-2025
  • (2023)A distributed framework for large-scale semantic trajectory similarity joinMultimedia Tools and Applications10.1007/s11042-023-15236-w83:6(16205-16229)Online publication date: 13-Jul-2023
  • (2021)Feat-SKSJProceedings of the 29th International Conference on Advances in Geographic Information Systems10.1145/3474717.3483629(15-24)Online publication date: 2-Nov-2021
  • Show More Cited By

Index Terms

  1. Partitioning strategies for spatio-textual similarity join

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    BigSpatial '14: Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data
    November 2014
    69 pages
    ISBN:9781450331326
    DOI:10.1145/2676536
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 04 November 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. geotagged data
    2. indexing
    3. spatio-textual similarity join

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGSPATIAL '14
    Sponsor:

    Acceptance Rates

    BigSpatial '14 Paper Acceptance Rate 8 of 13 submissions, 62%;
    Overall Acceptance Rate 32 of 58 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Efficient Parallel Processing of Semantic Trajectory Similarity JoinsIEEE Internet of Things Journal10.1109/JIOT.2024.342767612:4(3534-3548)Online publication date: 15-Feb-2025
    • (2023)A distributed framework for large-scale semantic trajectory similarity joinMultimedia Tools and Applications10.1007/s11042-023-15236-w83:6(16205-16229)Online publication date: 13-Jul-2023
    • (2021)Feat-SKSJProceedings of the 29th International Conference on Advances in Geographic Information Systems10.1145/3474717.3483629(15-24)Online publication date: 2-Nov-2021
    • (2021)Efficient Spatio-Textual Similarity Join Processing on NUMA Systems2021 22nd IEEE International Conference on Mobile Data Management (MDM)10.1109/MDM52706.2021.00022(79-88)Online publication date: Jun-2021
    • (2021)Location- and keyword-based querying of geo-textual data: a surveyThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-021-00661-w30:4(603-640)Online publication date: 30-Mar-2021
    • (2020)NUMA-Aware Spatio-Textual Similarity JoinProceedings of the 28th International Conference on Advances in Geographic Information Systems10.1145/3397536.3422227(139-142)Online publication date: 3-Nov-2020
    • (2020)Parallel Semantic Trajectory Similarity Join2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00091(997-1008)Online publication date: Apr-2020
    • (2018)Spatio-textual user matching and clustering based on set similarity joinsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-018-0498-527:3(297-320)Online publication date: 1-Jun-2018
    • (2017)Clue-based spatio-textual queryProceedings of the VLDB Endowment10.14778/3055540.305554610:5(529-540)Online publication date: 1-Jan-2017
    • (2017)Cultural Heritage RoutingJournal on Computing and Cultural Heritage 10.1145/304020010:4(1-20)Online publication date: 31-Jul-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media