Abstract
The present article addresses the problem of a hotel deduplication. Obvious approaches, such as name or location comparisons, fail, because hotel descriptions differ among different databases. The most accurate approach to solve this problem is to use the professionally trained content managers, but it is expensive, hence an automatic solution should be implemented. We propose a method to improve a hypothesis that a pair of hotels is identical, and compare its performance with alternative solutions. The proposed method satisfies business requirements set for the precision and recall of the hotel deduplication task. The method is based on machine learning approach with the use of some unique features, including those built with the help of computer vision algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Jaccard, P.: Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bull. Soc. Vaudoise Sci. Natur. 37(140), 241–272 (1901)
Benjelloun, O., et al.: Swoosh: a generic approach to entity resolution. VLDB J. Int. J. Very Large Data Bases 18(1), 255–276 (2009)
Peled, O., et al.: Matching entities across online social networks (2014). arXiv preprint arXiv:1410.6717
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)
Su, Q., et al.: Internet-scale collection of human-reviewed data. In: Proceedings of the 16th International Conference on World Wide Web, pp. 231–240. ACM (2007)
Zauner, C.: Implementation and benchmarking of perceptual image hash functions (2010)
Brizan, D.G., Tansel, A.U.: A. survey of entity resolution and record linkage methodologies. Commun. IIMA 6(3), 5 (2015)
Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval (2015). arXiv preprint arXiv:1510.07493
Getoor, L., Diehl, C.P.: Link mining: a survey. ACM SIGKDD Explor. Newsl. 7(2), 3–12 (2005)
Image database organized according to the WordNet hierarchy. http://www.image-net.org/
Wang, J., et al.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)
Dalvi, N., et al.: Deduplicating a places database. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 409–418. ACM (2014)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
Acknowledgement
Authors would like to thank Vladislav Dolbilov for his active involvement in hypotheses testing, features implementation and machine learning experiments; Margarita Pyartel for the help with the preparation of the final learning dataset and providing expert classification results in difficult cases; Andrey Filchenkov for valuable advice and reviewing this article; Andrey Tarkhov for proofreading; Yandex.Travel team for support and help; our partners for providing us with the hotel data and Yandex computer vision team for their expertise. This work was financially supported by the Government of Russian Federation, Grant 074-U01.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Kozhevnikov, I., Gorovoy, V. (2016). Comparison of Different Approaches for Hotels Deduplication. In: Ngonga Ngomo, AC., Křemen, P. (eds) Knowledge Engineering and Semantic Web. KESW 2016. Communications in Computer and Information Science, vol 649. Springer, Cham. https://doi.org/10.1007/978-3-319-45880-9_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-45880-9_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45879-3
Online ISBN: 978-3-319-45880-9
eBook Packages: Computer ScienceComputer Science (R0)