Skip to main content

Advertisement

Log in

Linking place records using multi-view encoders

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

A Correction to this article was published on 16 May 2021

This article has been updated

Abstract

Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. This data, however, is inherently noisy and has several issues, such as data replication. In this work, we aim to detect replicated places using a deep-learning model, named PlacERN, that relies on multi-view encoders. These encoders learn different representations from distinct information levels of a place, using intermediate mappings and non-linearities. They are then compared in order to predict whether a place pair is a duplicate or not. We then indicate how this model can be used to solve the place linkage problem in an end-to-end fashion by fitting it into a pipeline. PlacERN is evaluated on top of two distinct datasets, containing missing values and high class imbalance. The results show that: (1) PlacERN is effective in performing place deduplication, even on such challenging datasets; and (2) it outperforms previous place deduplication approaches, and competitive algorithms, namely Random Forest and LightGBM using pairwise features, on both datasets in regards to different metrics (F score, Gini Coefficient and Area Under Precision-recall Curve).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Change history

Notes

  1. https://www.factual.com/data-set/global-places/.

  2. https://www.safegraph.com/.

  3. https://www.tripadvisor.com/.

  4. https://www.swarmapp.com/.

  5. https://www.inloco.com.br/solutions.

  6. https://schema.org/.

  7. https://colab.research.google.com.

  8. https://github.com/facebookresearch/fastText/blob/master/reduce_model.py.

References

  1. Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp 2623–2631. https://doi.org/10.1145/3292500.3330701

  2. Barbosa L (2018) Learning representations of web entities for entity resolution. Int J Web Inf Syst 15(3):246–256. https://doi.org/10.1108/ijwis-07-2018-0059

    Article  Google Scholar 

  3. Berjawi B (2017) Integration of heterogeneous data from multiple location-based services providers: a use case on tourist points of interest. Ph.D. thesis, Ecole doctorale d’informatique et mathmatique de Lyon

  4. Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146

    Article  Google Scholar 

  5. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324

    Article  MATH  Google Scholar 

  6. Buscaldi D (2009) Toponym ambiguity in geographical information retrieval. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR’09. Association for Computing Machinery, New York, NY, USA, p 847. https://doi.org/10.1145/1571941.1572168

  7. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734. https://doi.org/10.3115/v1/D14-1179

  8. Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string metrics for matching names and records. In: KDD workshop on data cleaning and object consolidation. Association for Computing Machinery, Washington, DC . https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/postscript/kdd-2003-match-ws.pdf

  9. Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537

    MATH  Google Scholar 

  10. Cousseau V (2020) A linkage pipeline for place records using multi-view encoders. Master’s thesis, Universidade Federal de Pernambuco (UFPE), Pernambuco, Brazil . https://github.com/vinimoraesrc/placern

  11. Cousseau V, Barbosa L (2019) Industrial paper: large-scale record linkage of web-based place entities. In: Anais Principais do XXXIV Simpsio Brasileiro de Banco de Dados. SBC, Porto Alegre, RS, Brasil, pp 181–186. https://doi.org/10.5753/sbbd.2019.8820

  12. Cui Y, Jia M, Lin TY, Song Y, Belongie SJ (2019) Class-balanced loss based on effective number of samples. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). Computer Vision Foundation/IEEE, Long Beach, California, pp 9260–9269

  13. Dalvi N, Olteanu M, Raghavan M, Bohannon P (2014) Deduplicating a places database. In: Proceedings of the 23rd international conference on world wide web, WWW’14. Association for Computing Machinery, New York, NY, USA, pp 409-418. https://doi.org/10.1145/2566486.2568034

  14. Damgaard C, Weiner J (2000) Describing inequality in plant size or fecundity. Ecology 81:1139–1142. https://doi.org/10.2307/177185

    Article  Google Scholar 

  15. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  16. Dempster AP (1967) Upper and lower probabilities induced by a multivalued mapping. Ann Math Stat 38(2):325–339

    Article  MathSciNet  Google Scholar 

  17. Deng Y, Luo A, Liu J, Wang Y (2019) Point of interest matching between different geospatial datasets. ISPRS Int J Geo Inf 8(10):435. https://doi.org/10.3390/ijgi8100435

    Article  Google Scholar 

  18. Dixon PM, Weiner J, Mitchell-olds T, Woodley R (1987) Bootstrapping the Gini coefficient of inequality. Ecology 68:1548–1551

    Article  Google Scholar 

  19. Dong XL (2020) Big data integration

  20. Gini C (1912) Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle relazioni statistiche. [Fasc. I.]. Studi economico-giuridici pubblicati per cura della facoltà di Giurisprudenza della R. Università di Cagliari. Tipogr. di P. Cuppini, Cagliari, Italy. https://books.google.com.br/books?id=fqjaBPMxB9kC

  21. Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res Proc Track 9:249–256

    Google Scholar 

  22. Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, pp 3483–3487. https://www.aclweb.org/anthology/L18-1550

  23. Guo J, Fan Y, Ai Q, Croft WB (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international on conference on information and knowledge management, CIKM’16. Association for Computing Machinery, New York, NY, USA, pp 55–64. https://doi.org/10.1145/2983323.2983769

  24. Guo X, Gao L, Liu X, Yin J (2017) Improved deep embedded clustering with local structure preservation. In: Proceedings of the 26th international joint conference on artificial intelligence, IJCAI’17. AAAI Press, Melbourne, Australia, pp 1753–1759

  25. Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. In: Proceedings of the 27th international conference on neural information processing systems—vol 2, NIPS’14. MIT Press, Cambridge, MA, USA, pp 2042–2050

  26. Jiang X, de Souza EN, Pesaranghader A, Hu B, Silver DL, Matwin S (2017) Trajectorynet: an embedded GPS trajectory representation for point-based classification using recurrent neural networks. In: Proceedings of the 27th annual international conference on computer science and software engineering, CASCON’17. IBM Corp., USA, pp 192–200

  27. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in neural information processing systems 30. NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 3149–3157

  28. Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1746–1751. https://doi.org/10.3115/v1/D14-1181

  29. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings. ICLR, San Diego, California . http://arxiv.org/abs/1412.6980

  30. Lin T, Goyal P, Girshick RB, He K, Dollár P (2017) Focal loss for dense object detection. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22–29, 2017. IEEE Computer Society, Venice, Italy, pp 2999–3007. https://doi.org/10.1109/ICCV.2017.324

  31. Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, Long Beach, pp 4765–4774

    Google Scholar 

  32. Marinho A (2018) Approximate string matching and duplicate detection in the deep learning era. Master’s thesis, Instituto Superior Tcnico - Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal

  33. Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, pp 52–55. https://www.aclweb.org/anthology/L18-1008

  34. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems—Volume 2, NIPS’13. Curran Associates Inc., Red Hook, NY, USA, pp 3111–3119

  35. Morton G (1966) A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, Amonk, NY, USA . https://books.google.com.br/books?id=9FFdHAAACAAJ

  36. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on international conference on machine learning, ICML’10. Omnipress, Madison, WI, USA, pp 807–814

  37. Ng A (2019) Machine learning yearning: Technical strategy for AI engineers, in the era of deep learning . https://www.deeplearning.ai/machine-learning-yearning/

  38. Niemeyer G (2008) geohash.org is public! https://web.archive.org/web/20080305102941/http://blog.labix.org/2008/02/26/geohashorg-is-public/

  39. Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572

    Article  Google Scholar 

  40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  41. Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’14. Association for Computing Machinery, New York, NY, USA, pp 701–710. https://doi.org/10.1145/2623330.2623732

  42. Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en

  43. Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. https://doi.org/10.1371/journal.pone.0118432

    Article  Google Scholar 

  44. Santos R, Murrieta-Flores P, Calado P, Martins B (2017) Toponym matching through deep neural networks. Int J Geogr Inf Sci 32:1–25. https://doi.org/10.3390/ijgi81004351

    Article  Google Scholar 

  45. Santos R, Murrieta-Flores P, Martins B (2017) Learning to combine multiple string similarity metrics for effective toponym matching. Int J Digit Earth 11(9):913–938. https://doi.org/10.1080/17538947.2017.1371253

    Article  Google Scholar 

  46. Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton

    Book  Google Scholar 

  47. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  48. Stefanidis K, Efthymiou V, Herschel M, Christophides V (2014) Entity resolution in the web of data. In: Proceedings of the 23rd international conference on world wide web, WWW 14 companion. Association for Computing Machinery, New York, NY, USA, pp 203–204. https://doi.org/10.1145/2567948.2577263

  49. Pandas Development Team T (2020) Pandas-Dev/Pandas: Pandas. https://doi.org/10.5281/zenodo.3509134

  50. W3Techs (2020) Usage statistics of structured data formats for websites

  51. Wang D, Zhang J, Cao W, Li J, Zheng Y (2018) When will you arrive? estimating travel time based on deep neural networks. AAAI. AAAI Press, New Orleans, LA, USA, pp 2500–2507

    Google Scholar 

  52. Winkler WE (1990) String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In: Proceedings of the section on survey research, issues in matching and administrative records section. American Statistical Association, Alexandria, VA, pp 354–359

  53. Xiong C, Zhong V, Socher R (2017) Dynamic coattention networks for question answering. In: 5th international conference on learning representations, ICLR 2017, conference track proceedings. OpenReview.net, Toulon, France . https://openreview.net/forum?id=rJeKjwvclx

  54. Yalavarthi VK, Ke X, Khan A (2017) Select your questions wisely: For entity resolution with crowd errors. In: Proceedings of the 2017 ACM on conference on information and knowledge management, CIKM 17. Association for Computing Machinery, New York, NY, USA, pp 317–326. https://doi.org/10.1145/3132847.3132876

  55. Yang C, Bai L, Zhang C, Yuan Q, Han J (2017) Bridging collaborative filtering and semi-supervised learning: a neural approach for poi recommendation. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, KDD 17. Association for Computing Machinery, New York, NY, USA, pp 1245–1254. https://doi.org/10.1145/3097983.3098094

  56. Yang C, Hoang DH, Mikolov T, Han J (2019) Place deduplication with embeddings. In: The world wide web conference, WWW’19. Association for Computing Machinery, New York, NY, USA, pp 3420–3426. https://doi.org/10.1145/3308558.3313456

  57. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Luciano Barbosa.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Hyperparameters

Appendix: Hyperparameters

To improve reproducibility of our results, we describe in this appendix the optimization process utilized for the previous approaches, for our supervised models, and for PlacERN. We also provide the best hyperparameter values achieved during optimization. The value ranges used as input to the optimization algorithms may be found in [10].

1.1 Previous approaches

We utilize Bayesian optimizations from Optuna [1] to tune the parameters of PE [56], and a grid search for the model of Dalvi et al. [13], both using validation data. The EM algorithm from [13] is executed for 10 iterations in \(Pairs_{BR}\) and 5 iterations in \(Pairs_{US}\), while PE [56] runs for 100 Optuna trials in \(Pairs_{BR}\) and 50 trials in \(Pairs_{US}\), using a median pruner after 5 warm-up trials and 3 warm-up steps. Geohashes are utilized as a means for the creation of tiles in both methods.

The optimal values obtained for the hyperparameters of [13] in the \(Pairs_{BR}\) dataset are: \(\lambda = 0.0\), \(\text {Geohash}\, \text {precision} = 6\), \(\alpha = 0.3\), and \(t_c = 0.5\). For the \(Pairs_{US}\): \(\lambda = 0.0\), Geohash precision \(= 6\), \(\alpha = 0.9\), and \(t_c = 0.5\).

Meanwhile, the embedding smoothing process from PE runs for 10 full epochs and uses 100,000 smoothing random walks with a fixed length of 10, a minimum frequency of 1, and a half window size of 5. The training batch size is fixed as 512 for \(Pairs_{BR}\) and 1024 for \(Pairs_{US}\), and the MLP uses 3 feedforward layers. The best values for the tuned hyperparameters of PE in the \(Pairs_{BR}\) dataset are: \(\alpha = 0.4\), smoothing negative sampling ratio \(= 20\), \(\text {neurons} = 512, 128, 128\), \(t_c = 0.75\). For the \(Pairs_{US}\) set: \(\alpha = 0.5\), smoothing negative sampling ratio \( = 5\), \(\text {neurons} = 512, 128, 128\), \(t_c = 0.75\).

1.2 Supervised baseline models

Both models (PRF and PLGBM) are tuned by Bayesian optimization, using Optuna for 100 trials in the scope of the \(Pairs_{BR}\) and \(Pairs_{US}\) validation data sets. The description of the hyperparameters for each model follows the parameter names from their respective libraries, with any value not shown assuming the default value.

PRF utilized the class_weight parameter set to balanced and the oob_score parameter set to True in both sets. The optimal tuned hyperparameter values for PRF in the scope of \(Pairs_{BR}\) are: n_estimators \( = 110\), max_features \( = log2\), max_leaf_nodes \( = 150\), min_samples_split \( = 3\), \(t_c = 0.9\). In the \(Pairs_{US}\) set: n_estimators \( = 150\), max_features \( = sqrt\), max_leaf_nodes \( = 150\), min_sample_splits \( = 5\), \(t_c = 0.9\).

The PLGBM model utilizes 10 early stopping rounds for 100 iterations in each trial, using a fixed class_weight value of balanced. The optimal values for its hyperparameters in \(Pairs_{BR}\) are: lambda_l1 \( = 1.247 \cdot 10^{-8}\), lambda_l2 \( = 0.659\), num_leaves \( = 95\), feature_fraction \( = 0.5\), bagging_fraction \( = 1.0\), bagging_freq \( = 0\), min_child_samples \( = 5\), \(t_c = 0.5\). In the \(Places_{US}\) set, the values are: lambda_l1 \( = 1.132 \cdot 10^{-8}\), lambda_l2 \( = 0.247\), num_leaves \( = 256\), feature_fraction \( = 0.62\), bagging_fraction \( = 1.0\), bagging_freq \( = 0\), min_child_samples \( = 20\), \(t_c = 0.5\)

1.3 PlacERN

PlacERN is implemented with \(L = 3\) feed-forward network layers, the first of them having \(H_0 = 256\) neurons. Regarding the sequence lengths noted in Sect. 3, we use the 90th percentile of lengths for each field to extract \(S^w = 5\), and \(S^{ct} = 3\) for both sets, \(S^{ch}_n = 42, S^{ch}_a = 32\) for \(Pairs_{BR}\), and \(S^{ch}_n = 26, S^{ch}_a = 23\) for \(Pairs_{US}\). We use \(B = 100\) distance buckets in the geographical encoder, and \(m = 100\) dimensions for the embedding layers.

To use the pre-trained FastText embeddings in our model, their dimensionality is reduced to 100 beforehand by means of a Principal Component Analysis [39] dimensionality reduction script.Footnote 8 In order to improve reproducibility, we also fix the random seeds as 6810818.

The model is tuned by Bayesian optimization from Optuna for 50 trials on top of the \(Pairs_{BR}\) validation data set and 20 trials in the \(Pairs_{US}\) one, with a median pruner after 3 warm-up trials and 1 warm-up step. An additional early stopping callback with a patience of 2 epochs and a minimum change of \(10^{-3}\) is also added as insurance against degenerate cases not detected by the median pruner. A batch size of 64 for \(Pairs_{BR}\) and 1024 for \(Pairs_{US}\) is utilized during training. The best tuned hyperparameter values for PlacERN and its ablated versions in both datasets are described in Table 5.

Table 5 Hyperparameters of PlacERN and its ablations, with the best values obtained in each data set

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cousseau, V., Barbosa, L. Linking place records using multi-view encoders. Neural Comput & Applic 33, 12103–12119 (2021). https://doi.org/10.1007/s00521-021-05932-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05932-9

Keywords