skip to main content
10.1145/2557977.2558053acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

Published:09 January 2014Publication History

ABSTRACT

A fingerprinting algorithm and sequence alignment are used widely to calculate the similarity of documents. The fingerprinting method is simple and fast but it cannot find specific similar regions. A string alignment method is used to identify similar regions by arranging sequences of strings. This has the advantage that it can find specific similar regions, but it also has the disadvantage that it requires more computational time. Multi-level alignment (MLA) is a new method, which was designed to exploit the advantages of both methods. MLA divides input documents into uniform length blocks, before extracting the fingerprint from each block and calculating the similarity of block pairs by comparing fingerprints. A similarity table is created during this process. Finally, sequence alignment is used to identify the longest similar regions in the similarity table. MLA allows users to change the block's size to control the relative proportion of the fingerprint algorithm and sequence alignment. A document is divided into several block, so similar regions are also fragmented into two or more blocks. To address this fragmentation problem, we propose a united block method. The united block method integrates adjacent fragmented similar regions to increase the similarity value. Our experiments demonstrated that computing a document's similarity using the united block method was more accurate than the original MLA method, with minor reductions in time.

References

  1. D. R. Ashbaugh. Ridgeology. J. of Forensic Identification, 31(1), 1991.Google ScholarGoogle Scholar
  2. R. G. Conklin, Barbara Gardner and D. Shortelle. Encyclopedia of forensic science: a compendium of detective fact and fiction. 2002.Google ScholarGoogle Scholar
  3. L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297--302, 1945.Google ScholarGoogle ScholarCross RefCross Ref
  4. M. DM. Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY., 2004.Google ScholarGoogle Scholar
  5. E. R. Henry. Classification and uses of finger prints. 1900.Google ScholarGoogle Scholar
  6. T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Hu, R. Kashi, and G. Wilfong. Comparison and classification of documents based on layout similarity. Information Retrieval, 2:227--243, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Islam and D. Inkpen. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data, 2(2):1--25, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Liu and W. Fang. Adaptive spam filtering based on fingerprint vectors. In Proc. of ISECS, CCCM '08, pages 384--388. IEEE Computer Society, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. MmemeChecker. http://www.memechecker.com/.Google ScholarGoogle Scholar
  11. A. Prinzie and D. Van den Poel. Incorporating sequential information into traditional classification models by using an element/position- sensitive sam. Technical report, Ghent University, FEBA, 2005.Google ScholarGoogle Scholar
  12. M. O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University, 1981.Google ScholarGoogle Scholar
  13. Y.-K. Seo. A study on undergraduate students' understanding and acts about plagiarism. JASIST, 50(9):772--778, 1999.Google ScholarGoogle Scholar
  14. N. Shoval and M. Isaacson. Sequence alignment as a method for human activity analysis in space and time. Annals of the AAG, 97(2):282--297, 2007.Google ScholarGoogle Scholar
  15. T. T. Tanimoto. Ibm internal report., November 1957.Google ScholarGoogle Scholar
  16. TurnItIn. http://www.turnitin.com/.Google ScholarGoogle Scholar
  17. A. Tversky. Features of similarity. Psychological Reviews, 84(4):327--352, 1977.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICUIMC '14: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
        January 2014
        757 pages
        ISBN:9781450326445
        DOI:10.1145/2557977

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 January 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        ICUIMC '14 Paper Acceptance Rate116of407submissions,29%Overall Acceptance Rate251of941submissions,27%
      • Article Metrics

        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader