A Detection of the Most Influential Documents

Ceglarek, Dariusz; Haniewicz, Konstanty

doi:10.1007/978-3-642-32518-2_5

A Detection of the Most Influential Documents

Dariusz Ceglarek³ &
Konstanty Haniewicz⁴

Conference paper

1397 Accesses

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 185))

Abstract

This work is a result of the ongoing research on semantic compression and robust algorithms applicable in plagiarism detection. This article includes a brief description of Sentence Hashing Algorithm for Plagiarism Detection SHAPD along with a comparison with the other available alternatives using frame structures for subsequence detection. What is more, the core of this publication is devoted to the application of SHAPD to a task of discovery of the most influential documents in a corpus. The experiments were carried out on multiple datasets diversified in terms of structure and content. The observations gathered during the experiments were summarised and are given in the article. The experiment allowed the authors to verify their initial hypothesis that it is possible to single out the most important documents in a corpus capturing the relations of citation among them.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.: Detecting the origin of text segments efficiently. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, vol. 7(3), pp. 61–70 (2009)
Google Scholar
Ahmed, M.N., Yamany, S.M., Mohamed, N., Farag, A.A., Moriarty, T.: A modified fuzzy c-means algorithm for bias field estimation and segmentation of mri data. IEEE Transactions on Medical Imaging 21(3), 193–199 (2002)
Article Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8-13), 1157–1166 (1997)
Article Google Scholar
Burrows, S., Tahaghoghi, S.M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Software: Practice and Experience 37(2), 151–175 (2007)
Article Google Scholar
Ceglarek, D., Haniewicz, K.: Fast Plagiarism Detection by Sentence Hashing. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2012, Part II. LNCS, vol. 7268, pp. 30–37. Springer, Heidelberg (2012)
Chapter Google Scholar
Ceglarek, D., Haniewicz, K., Rutkowski, W.: Semantic Compression for Specialised Information Retrieval Systems. In: Nguyen, N.T., Katarzyniak, R., Chen, S.-M. (eds.) Advances in Intelligent Information and Database Systems. SCI, vol. 283, pp. 111–121. Springer, Heidelberg (2010)
Chapter Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388. ACM, New York (2002)
Chapter Google Scholar
Chvatal, V., Klarner, D.A., Knuth, D.E.: Selected combinatorial research problems. Technical report, Stanford, CA, USA (1972)
Google Scholar
Grozea, C., Gehl, C., Popescu, M.: Encoplot: Pairwise sequence matching in linear time applied to plagiarism detection. Time, 10–18 (2009)
Google Scholar
Hirsch, J.E.: An index to quantify an individuals scientific research output. Proceedings of the National Academy of Sciences of the United States of America 102(46), 16569–16572 (2005)
Article Google Scholar
Hunt, J.W., Szymanski, T.G.: A fast algorithm for computing longest common subsequences. Commun. ACM 20, 350–353 (1977)
Article MathSciNet MATH Google Scholar
Irving, R.W.: Plagiarism and collusion detection using the smith-waterman algorithm. Technical report, University of Glasgow, Department of Computing Science (2004)
Google Scholar
Lukashenko, R., Graudina, V., Grundspenkis, J.: Computer-based plagiarism detection methods and tools: an overview. In: Proceedings of the 2007 International Conference on Computer Systems and Technologies, CompSysTech 2007, pp. 40:1–40:6. ACM, New York (2007)
Chapter Google Scholar
Manber, U.: Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference, WTEC 1994, p. 2. USENIX Association, Berkeley (1994)
Google Scholar
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. Journal of Computer and System Sciences 20(1), 18–31 (1980)
Article MathSciNet MATH Google Scholar
Mozgovoy, M., Karakovskiy, S., Klyuev, V.: Fast and reliable plagiarism detection system. In: 37th Annual Frontiers In Education Conference - Global Engineering: Knowledge Without Borders, Opportunities Without Passports, FIE 2007, pp. S4H-11–S4H-14 (October 2007)
Google Scholar
Nock, R., Nielsen, F.: On weighting clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(8), 1223–1235 (2006)
Article Google Scholar
Ota, T., Masuyama, S.: Automatic plagiarism detection among term papers. In: Proceedings of the 3rd International Universal Communication Symposium, IUCS 2009, pp. 395–399. ACM, New York (2009)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Poznan School of Banking, Poznan, Poland
Dariusz Ceglarek
Poznan University of Economics, Poznan, Poland
Konstanty Haniewicz

Authors

Dariusz Ceglarek
View author publications
You can also search for this author in PubMed Google Scholar
Konstanty Haniewicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Ceglarek .

Editor information

Editors and Affiliations

, Department of Computer Science, Eindhoven University of Technology, Eindhoven, 5600, Netherlands
Mykola Pechenizkiy
Institute of Computing Science, Poznan University of Technology, ul. Piotrowo 2, Poznan, 60-965, Poland
Marek Wojciechowski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ceglarek, D., Haniewicz, K. (2013). A Detection of the Most Influential Documents. In: Pechenizkiy, M., Wojciechowski, M. (eds) New Trends in Databases and Information Systems. Advances in Intelligent Systems and Computing, vol 185. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32518-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-32518-2_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32517-5
Online ISBN: 978-3-642-32518-2
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics