Abstract
There are a large number of overlapping problems within information retrieval that involve retrieving objects with certain features or objects based on their similarity to other objects. If the features that define these objects can be extracted, these objects can be reduced to a common representation that maintains pairwise similarity but discards all other data in order to facilitate compact storage and scalable retrieval. In this paper we introduce TopSig, an open-source tool for hashing and retrieving topology-sensitive document signatures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences, 1997, pp. 21–29. IEEE (1997)
Callan, J., Hoy, M., Yoo, C., Zhao, L.: Clueweb09 data set, January 2009. boston.lti.cs.cmu.edu
Chappell, T., Geva, S., Zuccon, G.: Approximate nearest-neighbour search with inverted signature slice lists. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 147–158. Springer, Heidelberg (2015)
Denoyer, L., Gallinari, P.: The Wikipedia XML Corpus. SIGIR Forum (2006)
Geva, S., DeVries, C.M.: Topsig: topology preserving document signatures. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 333–338. ACM (2011)
Geva, S., Kamps, J., Lehtonen, M., Schenkel, R., Thom, J.A., Trotman, A.: Overview of the INEX 2009 ad hoc track. In: Geva, S., Kamps, J., Trotman, A. (eds.) INEX 2009. LNCS, vol. 6203, pp. 4–25. Springer, Heidelberg (2010)
Hamming, R.W.: Error detecting and error correcting codes. Bell Syst. Tech. J. 29(2), 147–160 (1950)
Manku, G.S.,. Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150. ACM (2007)
Norouzi, M., Punjani, A., Fleet, D.J.: Fast exact search in hamming space with multi-index hashing. IEEE Trans. Pattern Anal. Mach. Intell. 36(6), 1107–1119 (2014)
Adibi, S.: Introduction. In: Adibi, S. (ed.) Mobile Health. SSBN, vol. 5, pp. 1–8. Springer, Heidelberg (2015)
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Theobald, M., Siddharth, J., Paepcke, A.: Spotsigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 563–570. ACM (2008)
Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted files versus signature files for text indexing. ACM Trans. Database Syst. (TODS) 23(4), 453–490 (1998)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Chappell, T., Geva, S. (2015). TopSig: A Scalable System for Hashing and Retrieving Document Signatures. In: Zuccon, G., Geva, S., Joho, H., Scholer, F., Sun, A., Zhang, P. (eds) Information Retrieval Technology. AIRS 2015. Lecture Notes in Computer Science(), vol 9460. Springer, Cham. https://doi.org/10.1007/978-3-319-28940-3_40
Download citation
DOI: https://doi.org/10.1007/978-3-319-28940-3_40
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-28939-7
Online ISBN: 978-3-319-28940-3
eBook Packages: Computer ScienceComputer Science (R0)