Skip to main content

Comparison of Different Approaches for Hotels Deduplication

  • Conference paper
  • First Online:
Knowledge Engineering and Semantic Web (KESW 2016)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 649))

Included in the following conference series:

Abstract

The present article addresses the problem of a hotel deduplication. Obvious approaches, such as name or location comparisons, fail, because hotel descriptions differ among different databases. The most accurate approach to solve this problem is to use the professionally trained content managers, but it is expensive, hence an automatic solution should be implemented. We propose a method to improve a hypothesis that a pair of hotels is identical, and compare its performance with alternative solutions. The proposed method satisfies business requirements set for the precision and recall of the hotel deduplication task. The method is based on machine learning approach with the use of some unique features, including those built with the help of computer vision algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://toloka.yandex.com/.

  2. 2.

    http://scikit-learn.org/.

References

  1. Jaccard, P.: Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bull. Soc. Vaudoise Sci. Natur. 37(140), 241–272 (1901)

    Google Scholar 

  2. Benjelloun, O., et al.: Swoosh: a generic approach to entity resolution. VLDB J. Int. J. Very Large Data Bases 18(1), 255–276 (2009)

    Article  Google Scholar 

  3. Peled, O., et al.: Matching entities across online social networks (2014). arXiv preprint arXiv:1410.6717

  4. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  MATH  Google Scholar 

  5. Su, Q., et al.: Internet-scale collection of human-reviewed data. In: Proceedings of the 16th International Conference on World Wide Web, pp. 231–240. ACM (2007)

    Google Scholar 

  6. Zauner, C.: Implementation and benchmarking of perceptual image hash functions (2010)

    Google Scholar 

  7. Brizan, D.G., Tansel, A.U.: A. survey of entity resolution and record linkage methodologies. Commun. IIMA 6(3), 5 (2015)

    Google Scholar 

  8. Babenko, A., Lempitsky, V.: Aggregating deep convolutional features for image retrieval (2015). arXiv preprint arXiv:1510.07493

  9. Getoor, L., Diehl, C.P.: Link mining: a survey. ACM SIGKDD Explor. Newsl. 7(2), 3–12 (2005)

    Article  Google Scholar 

  10. Image database organized according to the WordNet hierarchy. http://www.image-net.org/

  11. Wang, J., et al.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)

    Article  Google Scholar 

  12. Dalvi, N., et al.: Deduplicating a places database. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 409–418. ACM (2014)

    Google Scholar 

  13. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)

    MATH  Google Scholar 

Download references

Acknowledgement

Authors would like to thank Vladislav Dolbilov for his active involvement in hypotheses testing, features implementation and machine learning experiments; Margarita Pyartel for the help with the preparation of the final learning dataset and providing expert classification results in difficult cases; Andrey Filchenkov for valuable advice and reviewing this article; Andrey Tarkhov for proofreading; Yandex.Travel team for support and help; our partners for providing us with the hotel data and Yandex computer vision team for their expertise. This work was financially supported by the Government of Russian Federation, Grant 074-U01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ivan Kozhevnikov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Kozhevnikov, I., Gorovoy, V. (2016). Comparison of Different Approaches for Hotels Deduplication. In: Ngonga Ngomo, AC., Křemen, P. (eds) Knowledge Engineering and Semantic Web. KESW 2016. Communications in Computer and Information Science, vol 649. Springer, Cham. https://doi.org/10.1007/978-3-319-45880-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45880-9_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45879-3

  • Online ISBN: 978-3-319-45880-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics