Improving NCD accuracy by combining document segmentation and document distortion

Granados, Ana; Martínez, Rafael; Camacho, David; Rodríguez, Francisco de Borja

doi:10.1007/s10115-013-0664-4

Improving NCD accuracy by combining document segmentation and document distortion

Regular Paper
Published: 06 June 2013

Volume 41, pages 223–245, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ana Granados¹,
Rafael Martínez¹,
David Camacho¹ &
…
Francisco de Borja Rodríguez¹

259 Accesses
6 Citations
Explore all metrics

Abstract

Compression distances have been applied to a broad range of domains because of their parameter-free nature, wide applicability and leading efficacy. However, they have a characteristic that can be a drawback when applied under particular circumstances. Said drawback is that when they are used to compare two very different-sized objects, they do not consider them to be similar even if they are related by a substring relationship. This work focuses on addressing this issue when compression distances are used to calculate similarities between documents. The approach proposed in this paper consists of combining document segmentation and document distortion. On the one hand, it is proposed to use document segmentation to tackle the above mentioned drawback. On the other hand, it is proposed to use document distortion to help compression distances to obtain more reliable similarities. The results show that combining both techniques provides better results than not applying them or applying them separately. The said results are consistent across datasets of diverse nature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image segmentation evaluation: a survey of methods

Article 18 April 2020

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

Article 09 February 2021

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

Article Open access 22 November 2021

References

Bustince H, Pagola M, Barrenechea E (2007) Construction of fuzzy indices from fuzzy DI-subsethood measures: application to the global comparison of images. Inf Sci 177(3):906–929
Article MATH MathSciNet Google Scholar
Bustince H, Barrenechea E, Pagola M (2008) Relationship between restricted dissimilarity functions, restricted equivalence functions and normal EN-functions: image thresholding invariant. Pattern Recogn Lett 29(4):525–536
Article Google Scholar
Cai D, Yu S, Wen J, Ma W (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in, information retrieval, pp 456–463
Callan JP (1994) Passage-level evidence in document retrieval. In: Proceedings of the seventeenth annual international ACM SIGIR conference on research and development in, information retrieval, pp 302–310
Cerra D, Datcu M (2008) A model conditioned data compression based similarity measure. In: Proceedings of the data compression conference, pp 509–509
Cilibrasi RL, Vitanyi PMB (2005) Clustering by compression. IEEE Trans Inf Theory 51(4):1523–1545
Article MathSciNet Google Scholar
Cilibrasi RL, Vitanyi PMB (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3):370–383
Article Google Scholar
Cohen AR, Bjornsson CS, Temple S, Banker G, Roysam B (2009) Automatic summarization of changes in biological image sequences using algorithmic information theory. IEEE Trans Pattern Anal Mach Intell 31(8):1386–1403
Article Google Scholar
Dobrinkat M, Väyrynen J, Tapiovaara T, Kettunen K (2010) Normalized compression distance based measures for MetricsMATR. In: Proceedings of the joint fifth workshop on statistical machine translation and metricsMATR, pp 343–348
Granados A, Cebrián M, Camacho D, Rodríguez FB (2008) Evaluating the impact of information distortion on normalized compression distance. In: Proceedings of the 2nd international castle meeting on coding theory and applications, pp 69–79
Granados A, Cebrián M, Camacho D, Rodríguez FB (2011) Reducing the loss of information through annealing text distortion. IEEE Trans Knowl Data Eng 23(7):1090–1102
Article Google Scholar
Granados A, Camacho D, Rodríguez FB (2012) Is the contextual information relevant in text clustering by compression? Expert Syst Appl 39(10):8537–8546
Article Google Scholar
Gong Z, U LH, CW Cheang (2006) Web image indexing by using associated texts. Knowl Inf Syst 10(2):243–264
Article Google Scholar
Hammouda KM, Kamel MS (2004) Document similarity using a phrase indexing graph model. Knowl Inf Syst 6(6):710–727
Article Google Scholar
Hearst MA, Plaunt C (1993) Subtopic structuring for full-length document access. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in, information retrieval, pp 59–68
Kaszkiel M, Zobel J (1997) Passage retrieval revisited. In Proceedings of the 20th annual international ACM SIGIR conference on research and development in, information retrieval, pp 178–185
Kondrak G (2005) N-gram similarity and distance. In: Proceedings of the 12th international conference on string processing and, information retrieval, pp 115–126
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Article MATH MathSciNet Google Scholar
Lavesson N, Axelsson S (2011) Similarity assessment for removal of noisy end user license agreements. Knowl Inf Syst 1–23
Li M, Chen X, Li X, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12):3250–3264
Article MathSciNet Google Scholar
Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental investigation. Knowl Inf Syst 10(4):453–472
Article Google Scholar
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2:159–165
Article MathSciNet Google Scholar
Martínez R, Cebrian M, Rodríguez FB, Camacho D (2008) Contextual information retrieval based on algorithmic information theory and statistical outlier detection. In: Proceedings of the IEEE information theory, workshop, pp 292–297
Melville JL, Riley JF, Hirst JD (2007) Similarity by compression. J Chem Inf Model 47(1):25–33
Article Google Scholar
Mittendorf E, Schäuble P (1994) Document and passage retrieval based on hidden markov models. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in, information retrieval, pp 318–327
Rozenfeld B, Feldman R (2008) Self-supervised relation extraction from the web. Knowl Inf Syst 17(1):17–33
Article Google Scholar
Salomon D (2004) Data compression: the complete reference. Springer, New York
Google Scholar
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston
Google Scholar
Salton G, Allan J, Buckley C (1993) Approaches to passage retrieval in full text information systems. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in, information retrieval, pp 49–58
Sun R, Ong C, Chua T (2006) Mining dependency relations for query expansion in passage retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in, information retrieval, pp 382–389
Tellex S, Katz B, Lin J, Fernandes A, Marton G (2003) Quantitative evaluation of passage retrieval algorithms for question answering. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, pp 41–47
Theeramunkong T (2004) Applying passage in web text mining. Int J Intell Syst 19:149–158
Article Google Scholar
Tiedemann J, Mur J (2008) Simple is best: experiments with different document segmentation strategies for passage retrieval. In: Proceedings of the 2nd workshop on information retrieval for question answering, pp 17–25
Ukkonen E (1992) Approximate string-matching with q-grams and maximal matches. Theor Comput Sci 92(1):191–211
Article MATH MathSciNet Google Scholar
Van Rijsbergen CJ (1979) Information retrieval. Butterworth-Heinemann, Newton
Google Scholar
Verdú S, Weissman T (2008) The information lost in erasures. IEEE Trans Inf Theory 54(11):5030–5058
Article Google Scholar
Wan X (2008) Beyond topical similarity: a structural similarity measure for retrieving highly similar documents. Knowl Inf Syst 15(1):55–73
Article Google Scholar
Wilbur WJ, Sirotkin K (1992) The automatic identification of stop words. J Inf Sci 18(1):45
Article Google Scholar
Wu D, Mendel JM (2008) A vector similarity measure for linguistic approximation: interval type-2 and type-1 fuzzy sets. Inf Sci 178(2):381–402
Article MATH MathSciNet Google Scholar
Xiong H, Pandey G, Steinbach M, Kumar V (2006) Enhancing data analysis with noise removal. IEEE Trans Knowl Data Eng 18(3):304–319
Article Google Scholar
Yang Y (1995) Noise reduction in a statistical approach to text categorization. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in, information retrieval, pp 256–263
Zhang X, Hao Y, Zhu X, Li M (2008) New information distance measure and its application in question answering system. J Comput Sci Technol 23(4):557–572
Article MathSciNet Google Scholar
Zipf GK (1935) The psychobiology of language. Houghton-Mifflin, New York
Google Scholar
Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge
Google Scholar
Zobel J, Moffat A, Wilkinson R, Sacks-Davis R (1995) Efficient retrieval of partial documents. Inf Process Manag 31:361–377
Article Google Scholar

Download references

Acknowledgments

We want to thank Francisco Aura for his useful comments on the draft. We want to thank the anonymous referees for their constructive comments on the manuscript. This work was partially supported by the Spanish Ministry of Science and Innovation under TIN2010-19607 and TIN2010-19872/TSI.

Author information

Authors and Affiliations

Department of Computer Science, Escuela Politécnica Superior, Universidad Autónoma de Madrid, Madrid, Spain
Ana Granados, Rafael Martínez, David Camacho & Francisco de Borja Rodríguez

Authors

Ana Granados
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Martínez
View author publications
You can also search for this author in PubMed Google Scholar
David Camacho
View author publications
You can also search for this author in PubMed Google Scholar
Francisco de Borja Rodríguez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ana Granados.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Granados, A., Martínez, R., Camacho, D. et al. Improving NCD accuracy by combining document segmentation and document distortion. Knowl Inf Syst 41, 223–245 (2014). https://doi.org/10.1007/s10115-013-0664-4

Download citation

Received: 23 June 2012
Revised: 20 December 2012
Accepted: 25 May 2013
Published: 06 June 2013
Issue Date: October 2014
DOI: https://doi.org/10.1007/s10115-013-0664-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving NCD accuracy by combining document segmentation and document distortion

Abstract

Access this article

Similar content being viewed by others

Image segmentation evaluation: a survey of methods

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improving NCD accuracy by combining document segmentation and document distortion

Abstract

Access this article

Similar content being viewed by others

Image segmentation evaluation: a survey of methods

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation