ABSTRACT
A fingerprinting algorithm and sequence alignment are used widely to calculate the similarity of documents. The fingerprinting method is simple and fast but it cannot find specific similar regions. A string alignment method is used to identify similar regions by arranging sequences of strings. This has the advantage that it can find specific similar regions, but it also has the disadvantage that it requires more computational time. Multi-level alignment (MLA) is a new method, which was designed to exploit the advantages of both methods. MLA divides input documents into uniform length blocks, before extracting the fingerprint from each block and calculating the similarity of block pairs by comparing fingerprints. A similarity table is created during this process. Finally, sequence alignment is used to identify the longest similar regions in the similarity table. MLA allows users to change the block's size to control the relative proportion of the fingerprint algorithm and sequence alignment. A document is divided into several block, so similar regions are also fragmented into two or more blocks. To address this fragmentation problem, we propose a united block method. The united block method integrates adjacent fragmented similar regions to increase the similarity value. Our experiments demonstrated that computing a document's similarity using the united block method was more accurate than the original MLA method, with minor reductions in time.
- D. R. Ashbaugh. Ridgeology. J. of Forensic Identification, 31(1), 1991.Google Scholar
- R. G. Conklin, Barbara Gardner and D. Shortelle. Encyclopedia of forensic science: a compendium of detective fact and fiction. 2002.Google Scholar
- L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297--302, 1945.Google ScholarCross Ref
- M. DM. Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY., 2004.Google Scholar
- E. R. Henry. Classification and uses of finger prints. 1900.Google Scholar
- T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarDigital Library
- J. Hu, R. Kashi, and G. Wilfong. Comparison and classification of documents based on layout similarity. Information Retrieval, 2:227--243, 2000. Google ScholarDigital Library
- A. Islam and D. Inkpen. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data, 2(2):1--25, 2008. Google ScholarDigital Library
- W. Liu and W. Fang. Adaptive spam filtering based on fingerprint vectors. In Proc. of ISECS, CCCM '08, pages 384--388. IEEE Computer Society, 2008. Google ScholarDigital Library
- MmemeChecker. http://www.memechecker.com/.Google Scholar
- A. Prinzie and D. Van den Poel. Incorporating sequential information into traditional classification models by using an element/position- sensitive sam. Technical report, Ghent University, FEBA, 2005.Google Scholar
- M. O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University, 1981.Google Scholar
- Y.-K. Seo. A study on undergraduate students' understanding and acts about plagiarism. JASIST, 50(9):772--778, 1999.Google Scholar
- N. Shoval and M. Isaacson. Sequence alignment as a method for human activity analysis in space and time. Annals of the AAG, 97(2):282--297, 2007.Google Scholar
- T. T. Tanimoto. Ibm internal report., November 1957.Google Scholar
- TurnItIn. http://www.turnitin.com/.Google Scholar
- A. Tversky. Features of similarity. Psychological Reviews, 84(4):327--352, 1977.Google ScholarCross Ref
Index Terms
- Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching
Recommendations
Parametric Sequence Alignment with Constraints
Approximate matching techniques based on string alignment are important tools for investigating similarities between strings, such as those representing DNA and protein sequences. We propose a constraint based approach for parametric sequence alignment ...
Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection
AbstractGenerally, the process of plagiarism detection can be divided into two main stages: source retrieval and text alignment. The paper evaluates and compares effectiveness of five fingerprint selection algorithms used during the source retrieval stage:...
Comments