Skip to main content
Log in

A novel similarity measure for spatial entity resolution based on data granularity model: Managing inconsistencies in place descriptions

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Tremendous amounts of data are generated every day by different sources and stored in heterogeneous databases. Providing an integrated view by fusion of data is essential to enhance data utilization. An indispensable type of data is spatial data, with diverse application domains, including GIS, e-commerce, military, and tourism. The concept of location forms a key part of user-generated data with serious challenges, including uncertainty. A particular location may have different names, and conversely, various locations may have the same name. Furthermore, geographical coordinates of locations may not be expressed accurately in datasets. More challenges also exist that have received less attention. Various data sources might describe locations in different levels of detail. This increases data inconsistency and decreases the quality of data fusion. This paper focuses on spatial data granulation to deal with this variety. If these diversities are not taken into consideration, the different descriptions of a location may be interpreted differently and, in turn, not be fused. The contribution of this paper are: (a) Introducing a granular approach to measure the similarity between two place description for managing apparent differences. The proposed method improves the quality of the geocoding and data fusion phases, (b) Introducing a novel data blocking method to decrease pairwise comparisons based on geographical features. For result evaluation, we developed a dataset from two real aviation accident datasets. The evaluation shows that the quality of entity recognition and data fusion improved by using our proposed data granulation technique.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://www.kaggle.com/saurograndi/airplane-crashes-since-1908

  2. Taj Mahal is an ivory-white marble mausoleum in the Indian city of Agra (Wikipedia).

  3. Tepe Sialk is an ancient archeological site in a suburb of the city of Kashan, Isfahan Province, in Iran (Wikipedia).

  4. Naghsh-e Jahan Square is in Isfahan, Iran (Wikipedia).

  5. http://api.geonames.org/hierarchy

  6. https://www.wanderlust.co.uk/content/londons-around-the-world/

  7. Full list of Feature Classes and Feature codes: https://www.geonames.org/export/codes.html

  8. http://api.geonames.org/neighbours

  9. http://api.geonames.org/search

  10. https://download.geonames.org/export/dump/alternatenames

  11. http://www.movable-type.co.uk/scripts/latlong-vincenty.html

References

  1. Acheson E, Volpi M, Purves RS (2019) Machine learning for cross-gazetteer matching of natural features. Int J Geogr Inf Sci, pp 1–27

  2. Bai L, Shao Z, Lin Z, Cheng S (2017) Fixing inconsistencies of fuzzy spatiotemporal XML data. Appl Intell 47(1):257–275

    Article  Google Scholar 

  3. Beeri C, Doytsher Y, Kanza Y, Safra E, Sagiv Y (2005) Finding corresponding objects when integrating several geo-spatial datasets. In: Proceedings of the 13th annual ACM international workshop on Geographic information systems, pp 87–96

  4. Berjawi B (2017) Integration of heterogeneous data from multiple location-based services providers: A use case on tourist points of interest

  5. Bleiholder J, Naumann F (2009) Data fusion. ACM Computing Surveys (CSUR) 41(1):1

    Article  Google Scholar 

  6. Cheng G, Lu X, Ge X, Yu H, Wang Y, Ge X (2010) Data fusion method for digital gazetteer. In: 2010 18th international conference on geoinformatics, IEEE, pp 1–4

  7. Christen P (2011) A survey of indexing techniques for scalable record linkage and deduplication. IEEE Transactions on Knowledge and Data Engineering 24(9):1537–1555

    Article  Google Scholar 

  8. Cohen W, Ravikumar P, Fienberg S (2003) A comparison of string metrics for matching names and records. In: Kdd workshop on data cleaning and object consolidation, vol 3, pp 73–78

  9. Dalvi N, Olteanu M, Raghavan M, Bohannon P (2014) Deduplicating a places database. In: Proceedings of the 23rd international conference on world wide Web, ACM, pp 409– 418

  10. Deng Y, Luo A, Liu J, Wang Y (2019) Point of interest matching between different geospatial datasets. ISPRS International Journal of Geo-Information 8(10):435

    Article  Google Scholar 

  11. Derczynski L (2016) Complementarity, F-score, and NLP Evaluation. In: Proceedings of the Tenth international conference on language resources and evaluation (LREC’16), pp 261– 266

  12. Dong X, Halevy A, Madhavan J (2005) Reference reconciliation in complex information spaces. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, ACM, pp 85–96

  13. Elfeky MG, Verykios VS, Elmagarmid AK (2002) TAILOR: A record linkage toolbox. In: Proceedings 18th international conference on data engineering, IEEE, pp 17–28

  14. Esteban J, Starr A, Willetts R, Hannah P, Bryanston-Cross P (2005) A review of data fusion models and architectures: towards engineering guidelines. Neural Computing & Applications 14(4):273–281

    Article  Google Scholar 

  15. Fujita H, Gaeta A, Loia V, Orciuoli F (2018) Resilience analysis of critical infrastructures: a cognitive approach based on granular computing. IEEE Trans Cybern 49(5):1835–1848

    Article  Google Scholar 

  16. Gelernter J, Ganesh G, Krishnakumar H, Zhang W (2013) Automatic gazetteer enrichment with user-geocoded data. In: Proceedings of the Second ACM SIGSPATIAL international workshop on crowdsourced and volunteered geographic information, ACM, pp 87–94

  17. Hall DL, Llinas J (1997) An introduction to multisensor data fusion. Proc IEEE 85(1):6–23

    Article  Google Scholar 

  18. Khaleghi B, Khamis A, Karray FO, Razavi SN (2013) Multisensor data fusion: a review of the state-of-the-art. Information Fusion 14(1):28–44

    Article  Google Scholar 

  19. Köpcke H, Rahm E (2008) Training selection for tuning entity matching. In: QDB/MUD, pp 3–12

  20. Lamprianidis G, Skoutas D, Papatheodorou G, Pfoser D (2014) Extraction, integration and analysis of crowdsourced points of interest from multiple web sources. In: Proceedings of the 3rd ACM SIGSPATIAL international workshop on crowdsourced and volunteered geographic information, ACM, pp 16–23

  21. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet hysics doklady, vol 10, pp 707–710

  22. Liang S, Deng X, Jiang W (2019) Optimal data fusion based on information quality function. Appl Intell 49(11):3938–3946

    Article  Google Scholar 

  23. Lieberman MD, Samet H, Sankaranayananan J (2010) Geotagging: Using proximity, sibling, and prominence clues to understand comma groups. In: Proceedings of the 6th workshop on geographic information retrieval, ACM, pp 6

  24. Martins B (2011) A supervised machine learning approach for duplicate detection over gazetteer records. In: International conference on geospatial sematics, Springer, pp 34–51

  25. McKenzie G, Janowicz K, Adams B (2014) A weighted multi-attribute method for matching user-generated points of interest. Cartogr Geogr Inf Sci 41(2):125–137

    Article  Google Scholar 

  26. Middleton SE, Kordopatis-Zilos G, Papadopoulos S, Kompatsiaris Y (2018) Location extraction from social media: Geoparsing, location disambiguation, and geotagging. ACM Transactions on Information Systems (TOIS) 36(4):1–27

    Article  Google Scholar 

  27. Mishra S, Saha S, Mondal S (2017) GAEMTBD: Genetic Algorithm based entity matching techniques for bibliographic databases. Appl Intell 47(1):197–230

    Article  Google Scholar 

  28. Monteiro BR, Davis CA Jr, Fonseca F (2016) A survey on the geographic scope of textual documents. Computers & Geosciences 96:23–34

    Article  Google Scholar 

  29. Müller M (2015) Hierarchical profiling of geoprocessing services. Computers & Geosciences 82:68–77

    Article  Google Scholar 

  30. Raimond AMO, Mustière S (2008) Data matching–a matter of belief. In: Headway in spatial data handling, Springer, pp 501–519

  31. Safra E, Kanza Y, Sagiv Y, Beeri C, Doytsher Y (2010) Location-based algorithms for finding sets of corresponding objects over several geo-spatial data sets. Int J Geogr Inf Sci 24(1):69– 106

    Article  Google Scholar 

  32. Samal A, Seth S, Cueto 1 K (2004) A feature-based approach to conflation of geospatial sources. Int J Geogr Inf Sci 18(5):459–489

  33. Santos R, Murrieta-Flores P, Calado P, Martins B (2018) Toponym matching through deep neural networks. Int J Geogr Inf Sci 32(2):324–348

    Article  Google Scholar 

  34. Santos R, Murrieta-Flores P, Martins B (2018b) Learning to combine multiple string similarity metrics for effective toponym matching. International Journal of Digital Earth 11(9):913– 938

    Article  Google Scholar 

  35. Scheffler T, Schirru R, Lehmann P (2012) Matching points of interest from different social networking sites. In: Annual conference on artificial intelligence, Springer, pp 245–248

  36. Sehgal V, Getoor L, Viechnicki PD (2006) Entity resolution in geospatial data integration. In: Proceedings of the 14th annual ACM international symposium on Advances in geographic information systems, ACM, pp 83–90

  37. Smart PD, Jones CB, Twaroch FA (2010) Multi-source toponym data integration and mediation for a meta-gazetteer service. In: International conference on geographic information science, Springer, pp 234–248

  38. Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 350–359

  39. Thor A, Rahm E (2007) MOMA-A mapping-based object matching system. In: CIDR, pp 247–258

  40. Vincenty T (1975) Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations. Survey Review 23(176):88–93

    Article  Google Scholar 

  41. Wang G, Yang J, Xu J (2017) Granular computing: from granularity optimization to multi-granularity joint problem solving. Granular Computing 2(3):105–120

    Article  Google Scholar 

  42. Wiemann S (2017) Formalization and web-based implementation of spatial data fusion. Computers & Geosciences 99:107–115

    Article  Google Scholar 

  43. Wilke G, Portmann E (2016) Granular computing as a basis of human–data interaction: a cognitive cities use case. Granular Computing 1(3):181–197

    Article  Google Scholar 

  44. Yao JT, Vasilakos AV, Pedrycz W (2013) Granular computing: perspectives and challenges. IEEE Trans Cybern 43(6):1977– 1989

    Article  Google Scholar 

  45. Zadeh LA (1996) Key roles of information granulation and fuzzy logic in human reasoning, Concept formulation and computing with words. In: Proceedings of IEEE 5th international fuzzy systems, vol 1. IEEE, p 1

  46. Zadorozhny V, Hsu YF (2011) Conflict-aware historical data fusion. In: International conference on scalable uncertainty management, Springer, pp 331–345

  47. Zenasni S, Kergosien E, Roche M, Teisseire M (2018) Spatial information extraction from short messages. Expert Syst Appl 95:351–367

    Article  Google Scholar 

  48. Zhang W, Gelernter J (2014) Geocoding location expressions in Twitter messages: A preference learning method. J Spatial Inform Sci 2014(9):37–70

    Google Scholar 

  49. Zhang Y, Chiang YY, Szekely P, Knoblock CA (2013) A semantic approach to retrieving, linking, and integrating heterogeneous geospatial data. In: Joint proceedings of the workshop on AI problems and approaches for intelligent environments and workshop on semantic cities, ACM, pp 31–37

  50. Zheng Y, Fen X, Xie X, Peng S, Fu J (2010) Detecting nearly duplicated records in location datasets. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, ACM, pp 137–143

Download references

Funding

Not applicable

Author information

Authors and Affiliations

Authors

Contributions

Mohammad Khodizadeh Nahari proposed the idea for problem solving, implemented and tested required algorithm, and drafted the manuscript.

Nasser Ghadiri defined the problem, designed the algorithm, reviewed manuscript and supervised the project.

Ahmad Baraani Dastjerdi assisted in problem definition and solution designing.

Jörg-R. Sack contributed significantly to the drafting of the manuscript, and provided critical feedback.

Corresponding author

Correspondence to Nasser Ghadiri.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Additional information

Availability of data and material:

A portion of data have been uploaded to https://github.com/khodizadeh/GSDF. If needed, the entire data can be published.

Code availability:

Name of code: GSDF

Accessible on https://github.com/khodizadeh/GSDF, first available: 2020

Developer and contact address: M. Khodizadeh , m.khodizadeh@ec.iut.ac.ir

Hardware: CPU: Intel® CoreTM i3 2GHz, RAM: 4GB, HDD: 500GB

Software: MS SQL Server 2016, Python 3.7 - Program language: SQL, Python

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khodizadeh-Nahari, M., Ghadiri, N., Baraani-Dastjerdi, A. et al. A novel similarity measure for spatial entity resolution based on data granularity model: Managing inconsistencies in place descriptions. Appl Intell 51, 6104–6123 (2021). https://doi.org/10.1007/s10489-020-01959-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01959-y

Keywords

Navigation