research-article

Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

Authors:
Jong-kyu Seo

Pusan National University, Korea

Pusan National University, Korea
View Profile

,
Hae-sung Tak

Pusan National University, Korea

Pusan National University, Korea
View Profile

,
Hwan-gue Cho

Pusan National University, Korea

Pusan National University, Korea
View Profile

ICUIMC '14: Proceedings of the 8th International Conference on Ubiquitous Information Management and CommunicationJanuary 2014Article No.: 73Pages 1–8https://doi.org/10.1145/2557977.2558053

Published:09 January 2014Publication History

ICUIMC '14: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Pages 1–8

ABSTRACT

A fingerprinting algorithm and sequence alignment are used widely to calculate the similarity of documents. The fingerprinting method is simple and fast but it cannot find specific similar regions. A string alignment method is used to identify similar regions by arranging sequences of strings. This has the advantage that it can find specific similar regions, but it also has the disadvantage that it requires more computational time. Multi-level alignment (MLA) is a new method, which was designed to exploit the advantages of both methods. MLA divides input documents into uniform length blocks, before extracting the fingerprint from each block and calculating the similarity of block pairs by comparing fingerprints. A similarity table is created during this process. Finally, sequence alignment is used to identify the longest similar regions in the similarity table. MLA allows users to change the block's size to control the relative proportion of the fingerprint algorithm and sequence alignment. A document is divided into several block, so similar regions are also fragmented into two or more blocks. To address this fragmentation problem, we propose a united block method. The united block method integrates adjacent fragmented similar regions to increase the similarity value. Our experiments demonstrated that computing a document's similarity using the united block method was more accurate than the original MLA method, with minor reductions in time.

References

D. R. Ashbaugh. Ridgeology. J. of Forensic Identification, 31(1), 1991.Google Scholar
R. G. Conklin, Barbara Gardner and D. Shortelle. Encyclopedia of forensic science: a compendium of detective fact and fiction. 2002.Google Scholar
L. R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297--302, 1945.Google ScholarCross Ref
M. DM. Bioinformatics: Sequence and Genome Analysis (2nd ed.). Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY., 2004.Google Scholar
E. R. Henry. Classification and uses of finger prints. 1900.Google Scholar
T. C. Hoad and J. Zobel. Methods for identifying versioned and plagiarized documents. JASIST, 54(3):203--215, 2003. Google ScholarDigital Library
J. Hu, R. Kashi, and G. Wilfong. Comparison and classification of documents based on layout similarity. Information Retrieval, 2:227--243, 2000. Google ScholarDigital Library
A. Islam and D. Inkpen. Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data, 2(2):1--25, 2008. Google ScholarDigital Library
W. Liu and W. Fang. Adaptive spam filtering based on fingerprint vectors. In Proc. of ISECS, CCCM '08, pages 384--388. IEEE Computer Society, 2008. Google ScholarDigital Library
MmemeChecker. http://www.memechecker.com/.Google Scholar
A. Prinzie and D. Van den Poel. Incorporating sequential information into traditional classification models by using an element/position- sensitive sam. Technical report, Ghent University, FEBA, 2005.Google Scholar
M. O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University, 1981.Google Scholar
Y.-K. Seo. A study on undergraduate students' understanding and acts about plagiarism. JASIST, 50(9):772--778, 1999.Google Scholar
N. Shoval and M. Isaacson. Sequence alignment as a method for human activity analysis in space and time. Annals of the AAG, 97(2):282--297, 2007.Google Scholar
T. T. Tanimoto. Ibm internal report., November 1957.Google Scholar
TurnItIn. http://www.turnitin.com/.Google Scholar
A. Tversky. Features of similarity. Psychological Reviews, 84(4):327--352, 1977.Google ScholarCross Ref

Index Terms

Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Multiple sequence alignment using block chaining
Read More
Parametric Sequence Alignment with Constraints

Approximate matching techniques based on string alignment are important tools for investigating similarities between strings, such as those representing DNA and protein sequences. We propose a constraint based approach for parametric sequence alignment ...
Read More
Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection
Abstract
Generally, the process of plagiarism detection can be divided into two main stages: source retrieval and text alignment. The paper evaluates and compares effectiveness of five fingerprint selection algorithms used during the source retrieval stage:...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICUIMC '14: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
January 2014
757 pages
ISBN:9781450326445
DOI:10.1145/2557977
Conference Chairs:
Dongsoo S. Kim
Indiana University
,
Sang-Wook Kim
Hanyang University, Korea
,
General Chairs:
Suk-Han Lee
Sungkyunkwan University
,
Korea Lajos Hanzo
University of Southampton, UK
,
Roslan Ismail
Universiti Kuala Lumpur, Malaysia
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 January 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document comparing
document fingerprinting
plagiarism detection
string alignment
Qualifiers
- research-article
Conference

Acceptance Rates
ICUIMC '14 Paper Acceptance Rate116of407submissions,29%Overall Acceptance Rate251of941submissions,27%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 61
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

ICUIMC '14: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multiple sequence alignment using block chaining

Parametric Sequence Alignment with Constraints

Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multi-level sequence alignment: a trade-off between speed and accuracy in similar text searching

ICUIMC '14: Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multiple sequence alignment using block chaining

Parametric Sequence Alignment with Constraints

Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media