Abstract
Extracting information about Web entities has become commonplace in the academy and industry alike. In particular, data about places distinguish themselves as rich sources of geolocalized information and spatial context, serving as a foundation for a series of applications. This data, however, is inherently noisy and has several issues, such as data replication. In this work, we aim to detect replicated places using a deep-learning model, named PlacERN, that relies on multi-view encoders. These encoders learn different representations from distinct information levels of a place, using intermediate mappings and non-linearities. They are then compared in order to predict whether a place pair is a duplicate or not. We then indicate how this model can be used to solve the place linkage problem in an end-to-end fashion by fitting it into a pipeline. PlacERN is evaluated on top of two distinct datasets, containing missing values and high class imbalance. The results show that: (1) PlacERN is effective in performing place deduplication, even on such challenging datasets; and (2) it outperforms previous place deduplication approaches, and competitive algorithms, namely Random Forest and LightGBM using pairwise features, on both datasets in regards to different metrics (F score, Gini Coefficient and Area Under Precision-recall Curve).



Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Change history
16 May 2021
A Correction to this paper has been published: https://doi.org/10.1007/s00521-021-06076-6
References
Akiba T, Sano S, Yanase T, Ohta T, Koyama M (2019) Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25rd ACM SIGKDD international conference on knowledge discovery and data mining. Association for Computing Machinery, New York, NY, USA, pp 2623–2631. https://doi.org/10.1145/3292500.3330701
Barbosa L (2018) Learning representations of web entities for entity resolution. Int J Web Inf Syst 15(3):246–256. https://doi.org/10.1108/ijwis-07-2018-0059
Berjawi B (2017) Integration of heterogeneous data from multiple location-based services providers: a use case on tourist points of interest. Ph.D. thesis, Ecole doctorale d’informatique et mathmatique de Lyon
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/a:1010933404324
Buscaldi D (2009) Toponym ambiguity in geographical information retrieval. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR’09. Association for Computing Machinery, New York, NY, USA, p 847. https://doi.org/10.1145/1571941.1572168
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1724–1734. https://doi.org/10.3115/v1/D14-1179
Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string metrics for matching names and records. In: KDD workshop on data cleaning and object consolidation. Association for Computing Machinery, Washington, DC . https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/postscript/kdd-2003-match-ws.pdf
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa PP (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Cousseau V (2020) A linkage pipeline for place records using multi-view encoders. Master’s thesis, Universidade Federal de Pernambuco (UFPE), Pernambuco, Brazil . https://github.com/vinimoraesrc/placern
Cousseau V, Barbosa L (2019) Industrial paper: large-scale record linkage of web-based place entities. In: Anais Principais do XXXIV Simpsio Brasileiro de Banco de Dados. SBC, Porto Alegre, RS, Brasil, pp 181–186. https://doi.org/10.5753/sbbd.2019.8820
Cui Y, Jia M, Lin TY, Song Y, Belongie SJ (2019) Class-balanced loss based on effective number of samples. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR). Computer Vision Foundation/IEEE, Long Beach, California, pp 9260–9269
Dalvi N, Olteanu M, Raghavan M, Bohannon P (2014) Deduplicating a places database. In: Proceedings of the 23rd international conference on world wide web, WWW’14. Association for Computing Machinery, New York, NY, USA, pp 409-418. https://doi.org/10.1145/2566486.2568034
Damgaard C, Weiner J (2000) Describing inequality in plant size or fecundity. Ecology 81:1139–1142. https://doi.org/10.2307/177185
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Dempster AP (1967) Upper and lower probabilities induced by a multivalued mapping. Ann Math Stat 38(2):325–339
Deng Y, Luo A, Liu J, Wang Y (2019) Point of interest matching between different geospatial datasets. ISPRS Int J Geo Inf 8(10):435. https://doi.org/10.3390/ijgi8100435
Dixon PM, Weiner J, Mitchell-olds T, Woodley R (1987) Bootstrapping the Gini coefficient of inequality. Ecology 68:1548–1551
Dong XL (2020) Big data integration
Gini C (1912) Variabilità e mutabilità: contributo allo studio delle distribuzioni e delle relazioni statistiche. [Fasc. I.]. Studi economico-giuridici pubblicati per cura della facoltà di Giurisprudenza della R. Università di Cagliari. Tipogr. di P. Cuppini, Cagliari, Italy. https://books.google.com.br/books?id=fqjaBPMxB9kC
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. J Mach Learn Res Proc Track 9:249–256
Grave E, Bojanowski P, Gupta P, Joulin A, Mikolov T (2018) Learning word vectors for 157 languages. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, pp 3483–3487. https://www.aclweb.org/anthology/L18-1550
Guo J, Fan Y, Ai Q, Croft WB (2016) A deep relevance matching model for ad-hoc retrieval. In: Proceedings of the 25th ACM international on conference on information and knowledge management, CIKM’16. Association for Computing Machinery, New York, NY, USA, pp 55–64. https://doi.org/10.1145/2983323.2983769
Guo X, Gao L, Liu X, Yin J (2017) Improved deep embedded clustering with local structure preservation. In: Proceedings of the 26th international joint conference on artificial intelligence, IJCAI’17. AAAI Press, Melbourne, Australia, pp 1753–1759
Hu B, Lu Z, Li H, Chen Q (2014) Convolutional neural network architectures for matching natural language sentences. In: Proceedings of the 27th international conference on neural information processing systems—vol 2, NIPS’14. MIT Press, Cambridge, MA, USA, pp 2042–2050
Jiang X, de Souza EN, Pesaranghader A, Hu B, Silver DL, Matwin S (2017) Trajectorynet: an embedded GPS trajectory representation for point-based classification using recurrent neural networks. In: Proceedings of the 27th annual international conference on computer science and software engineering, CASCON’17. IBM Corp., USA, pp 192–200
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY (2017) Lightgbm: a highly efficient gradient boosting decision tree. In: Advances in neural information processing systems 30. NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 3149–3157
Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp 1746–1751. https://doi.org/10.3115/v1/D14-1181
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings. ICLR, San Diego, California . http://arxiv.org/abs/1412.6980
Lin T, Goyal P, Girshick RB, He K, Dollár P (2017) Focal loss for dense object detection. In: IEEE international conference on computer vision, ICCV 2017, Venice, Italy, October 22–29, 2017. IEEE Computer Society, Venice, Italy, pp 2999–3007. https://doi.org/10.1109/ICCV.2017.324
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates Inc, Long Beach, pp 4765–4774
Marinho A (2018) Approximate string matching and duplicate detection in the deep learning era. Master’s thesis, Instituto Superior Tcnico - Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001 Lisboa, Portugal
Mikolov T, Grave E, Bojanowski P, Puhrsch C, Joulin A (2018) Advances in pre-training distributed word representations. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan, pp 52–55. https://www.aclweb.org/anthology/L18-1008
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems—Volume 2, NIPS’13. Curran Associates Inc., Red Hook, NY, USA, pp 3111–3119
Morton G (1966) A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, Amonk, NY, USA . https://books.google.com.br/books?id=9FFdHAAACAAJ
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th international conference on international conference on machine learning, ICML’10. Omnipress, Madison, WI, USA, pp 807–814
Ng A (2019) Machine learning yearning: Technical strategy for AI engineers, in the era of deep learning . https://www.deeplearning.ai/machine-learning-yearning/
Niemeyer G (2008) geohash.org is public! https://web.archive.org/web/20080305102941/http://blog.labix.org/2008/02/26/geohashorg-is-public/
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Philos Mag 2:559–572
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’14. Association for Computing Machinery, New York, NY, USA, pp 701–710. https://doi.org/10.1145/2623330.2623732
Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks. ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en
Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. https://doi.org/10.1371/journal.pone.0118432
Santos R, Murrieta-Flores P, Calado P, Martins B (2017) Toponym matching through deep neural networks. Int J Geogr Inf Sci 32:1–25. https://doi.org/10.3390/ijgi81004351
Santos R, Murrieta-Flores P, Martins B (2017) Learning to combine multiple string similarity metrics for effective toponym matching. Int J Digit Earth 11(9):913–938. https://doi.org/10.1080/17538947.2017.1371253
Shafer G (1976) A mathematical theory of evidence. Princeton University Press, Princeton
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Stefanidis K, Efthymiou V, Herschel M, Christophides V (2014) Entity resolution in the web of data. In: Proceedings of the 23rd international conference on world wide web, WWW 14 companion. Association for Computing Machinery, New York, NY, USA, pp 203–204. https://doi.org/10.1145/2567948.2577263
Pandas Development Team T (2020) Pandas-Dev/Pandas: Pandas. https://doi.org/10.5281/zenodo.3509134
W3Techs (2020) Usage statistics of structured data formats for websites
Wang D, Zhang J, Cao W, Li J, Zheng Y (2018) When will you arrive? estimating travel time based on deep neural networks. AAAI. AAAI Press, New Orleans, LA, USA, pp 2500–2507
Winkler WE (1990) String comparator metrics and enhanced decision rules in the Fellegi–Sunter model of record linkage. In: Proceedings of the section on survey research, issues in matching and administrative records section. American Statistical Association, Alexandria, VA, pp 354–359
Xiong C, Zhong V, Socher R (2017) Dynamic coattention networks for question answering. In: 5th international conference on learning representations, ICLR 2017, conference track proceedings. OpenReview.net, Toulon, France . https://openreview.net/forum?id=rJeKjwvclx
Yalavarthi VK, Ke X, Khan A (2017) Select your questions wisely: For entity resolution with crowd errors. In: Proceedings of the 2017 ACM on conference on information and knowledge management, CIKM 17. Association for Computing Machinery, New York, NY, USA, pp 317–326. https://doi.org/10.1145/3132847.3132876
Yang C, Bai L, Zhang C, Yuan Q, Han J (2017) Bridging collaborative filtering and semi-supervised learning: a neural approach for poi recommendation. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, KDD 17. Association for Computing Machinery, New York, NY, USA, pp 1245–1254. https://doi.org/10.1145/3097983.3098094
Yang C, Hoang DH, Mikolov T, Han J (2019) Place deduplication with embeddings. In: The world wide web conference, WWW’19. Association for Computing Machinery, New York, NY, USA, pp 3420–3426. https://doi.org/10.1145/3308558.3313456
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Hyperparameters
Appendix: Hyperparameters
To improve reproducibility of our results, we describe in this appendix the optimization process utilized for the previous approaches, for our supervised models, and for PlacERN. We also provide the best hyperparameter values achieved during optimization. The value ranges used as input to the optimization algorithms may be found in [10].
1.1 Previous approaches
We utilize Bayesian optimizations from Optuna [1] to tune the parameters of PE [56], and a grid search for the model of Dalvi et al. [13], both using validation data. The EM algorithm from [13] is executed for 10 iterations in \(Pairs_{BR}\) and 5 iterations in \(Pairs_{US}\), while PE [56] runs for 100 Optuna trials in \(Pairs_{BR}\) and 50 trials in \(Pairs_{US}\), using a median pruner after 5 warm-up trials and 3 warm-up steps. Geohashes are utilized as a means for the creation of tiles in both methods.
The optimal values obtained for the hyperparameters of [13] in the \(Pairs_{BR}\) dataset are: \(\lambda = 0.0\), \(\text {Geohash}\, \text {precision} = 6\), \(\alpha = 0.3\), and \(t_c = 0.5\). For the \(Pairs_{US}\): \(\lambda = 0.0\), Geohash precision \(= 6\), \(\alpha = 0.9\), and \(t_c = 0.5\).
Meanwhile, the embedding smoothing process from PE runs for 10 full epochs and uses 100,000 smoothing random walks with a fixed length of 10, a minimum frequency of 1, and a half window size of 5. The training batch size is fixed as 512 for \(Pairs_{BR}\) and 1024 for \(Pairs_{US}\), and the MLP uses 3 feedforward layers. The best values for the tuned hyperparameters of PE in the \(Pairs_{BR}\) dataset are: \(\alpha = 0.4\), smoothing negative sampling ratio \(= 20\), \(\text {neurons} = 512, 128, 128\), \(t_c = 0.75\). For the \(Pairs_{US}\) set: \(\alpha = 0.5\), smoothing negative sampling ratio \( = 5\), \(\text {neurons} = 512, 128, 128\), \(t_c = 0.75\).
1.2 Supervised baseline models
Both models (PRF and PLGBM) are tuned by Bayesian optimization, using Optuna for 100 trials in the scope of the \(Pairs_{BR}\) and \(Pairs_{US}\) validation data sets. The description of the hyperparameters for each model follows the parameter names from their respective libraries, with any value not shown assuming the default value.
PRF utilized the class_weight parameter set to balanced and the oob_score parameter set to True in both sets. The optimal tuned hyperparameter values for PRF in the scope of \(Pairs_{BR}\) are: n_estimators \( = 110\), max_features \( = log2\), max_leaf_nodes \( = 150\), min_samples_split \( = 3\), \(t_c = 0.9\). In the \(Pairs_{US}\) set: n_estimators \( = 150\), max_features \( = sqrt\), max_leaf_nodes \( = 150\), min_sample_splits \( = 5\), \(t_c = 0.9\).
The PLGBM model utilizes 10 early stopping rounds for 100 iterations in each trial, using a fixed class_weight value of balanced. The optimal values for its hyperparameters in \(Pairs_{BR}\) are: lambda_l1 \( = 1.247 \cdot 10^{-8}\), lambda_l2 \( = 0.659\), num_leaves \( = 95\), feature_fraction \( = 0.5\), bagging_fraction \( = 1.0\), bagging_freq \( = 0\), min_child_samples \( = 5\), \(t_c = 0.5\). In the \(Places_{US}\) set, the values are: lambda_l1 \( = 1.132 \cdot 10^{-8}\), lambda_l2 \( = 0.247\), num_leaves \( = 256\), feature_fraction \( = 0.62\), bagging_fraction \( = 1.0\), bagging_freq \( = 0\), min_child_samples \( = 20\), \(t_c = 0.5\)
1.3 PlacERN
PlacERN is implemented with \(L = 3\) feed-forward network layers, the first of them having \(H_0 = 256\) neurons. Regarding the sequence lengths noted in Sect. 3, we use the 90th percentile of lengths for each field to extract \(S^w = 5\), and \(S^{ct} = 3\) for both sets, \(S^{ch}_n = 42, S^{ch}_a = 32\) for \(Pairs_{BR}\), and \(S^{ch}_n = 26, S^{ch}_a = 23\) for \(Pairs_{US}\). We use \(B = 100\) distance buckets in the geographical encoder, and \(m = 100\) dimensions for the embedding layers.
To use the pre-trained FastText embeddings in our model, their dimensionality is reduced to 100 beforehand by means of a Principal Component Analysis [39] dimensionality reduction script.Footnote 8 In order to improve reproducibility, we also fix the random seeds as 6810818.
The model is tuned by Bayesian optimization from Optuna for 50 trials on top of the \(Pairs_{BR}\) validation data set and 20 trials in the \(Pairs_{US}\) one, with a median pruner after 3 warm-up trials and 1 warm-up step. An additional early stopping callback with a patience of 2 epochs and a minimum change of \(10^{-3}\) is also added as insurance against degenerate cases not detected by the median pruner. A batch size of 64 for \(Pairs_{BR}\) and 1024 for \(Pairs_{US}\) is utilized during training. The best tuned hyperparameter values for PlacERN and its ablated versions in both datasets are described in Table 5.
Rights and permissions
About this article
Cite this article
Cousseau, V., Barbosa, L. Linking place records using multi-view encoders. Neural Comput & Applic 33, 12103–12119 (2021). https://doi.org/10.1007/s00521-021-05932-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05932-9