A Data Cleaning Method for CiteSeer Dataset

Wang, Yan; Zhang, Hao; Li, Yaxin; Wang, Deyun; Ma, Yanlin; Zhou, Tong; Lu, Jianguo

doi:10.1007/978-3-319-48740-3_3

Yan Wang¹⁹,
Hao Zhang¹⁹,
Yaxin Li¹⁹,
Deyun Wang¹⁹,
Yanlin Ma¹⁹,
Tong Zhou²⁰ &
…
Jianguo Lu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10041))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1417 Accesses
7 Citations

Abstract

CiteSeer is considered as the first academic search engine that have been serving data for almost twenty years. Recently, CiteSeer graciously makes all the data public, including raw PDF files, text transformed from PDF, and metadata extracted from the text. Numerous efforts have been tried to improve the accuracy of the metadata extraction. The problem is inherently challenging and errors are abundant. In this paper, we propose an innovative record-linkage-based method for data cleaning, which use two new matching algorithms to significantly improve the cleaning performance for the CiteSeer dataset. One is an enhanced matching algorithm for local datasets, the other is developed for online datasets. Experimental results show that 48.1 % wrong metadata entries can be corrected by our method in total and the improvement is more than 539 % compared to existing state-of-the-art data cleaning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Peng, F., Schuurmans, D.: Combining naive bayes and n-gram language models for text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003). doi:10.1007/3-540-36618-0_24
Chapter Google Scholar
Caragea, C., Silvescu, A., Kataria, S., Caragea, D., Mitra, P.: Classifying scientific publications using abstract features. In: SARA (2011)
Google Scholar
Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collective classification in network data. AI Mag. 29(3), 93–106 (2008)
Google Scholar
Caragea, C., Silvescu, A., Mitra, P., Giles, C.: Can’t see the forest for the trees? a citation recommendation system. In: JCDL, pp. 111–114 (2013)
Google Scholar
Carage, C., Wu, J., Williams, K., Das, S., Khabsa, M., Teregowda, P., Giles, C.L.: Automatic identification of research articles from crawled documents. In: WSDM-WSCBD (2014)
Google Scholar
Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C., Rokach, L.: Recommending citations: translating papers into references. In: CIKM, pp. 1910–1914 (2012)
Google Scholar
Caragea, C., Wu, J., Ciobanu, A., Williams, K. ndez Ram rez, J.F., Chen, H., Wu, Z., Giles, L.: Citeseerx: a scholarly big dataset. In: Advances in InformationRetrieval, pp. 311–322 (2014)
Google Scholar
CiteSeerX. http://csxstatic.ist.psu.edu/about/data
Lipinski, M., Yao, K., Breitinger, C., Beel, J., Gipp, B.: Evaluation of header metadata extraction approaches and tools for scientific pdf documents. In: Proceedings of JCDL, pp. 385–386 (2013)
Google Scholar
Wu, J., Williams, K., Khabsa, M., Giles, C.L.: The impact of user corrections on a crawl-based digital library: a citeseerx perspective. In: Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom) (2014)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)
Article MATH Google Scholar
Tang, J.: https://aminer.org/
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information retrieval. Cambridge University Press, Cambridge (2008)
Google Scholar
Wang, Y., Lu, J., Chen, J.: TS-IDS algorithm for query selection in the deep web crawling. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 189–200. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11116-2_17
Google Scholar
Manku, G., Jain, A., Sarma, S.A.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–150 (2007)
Google Scholar
Wu, J., William, K., Chen, H., Khabsa, M., Caragea, C., Tuarob, S., Ororbia, A.G., Jordan, D., Mitra, P., Lee Giles, C.: Citeseerx: AI in a digital library search engine. AI Mag. 36(3), 35–49 (2015)
Google Scholar
Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of string metrics for matching names and records. In: KDD Workshop on Data Cleaning and Object Consolidation, vol. 3, pp. 73–78 (2003)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of SIGKDD, pp. 169–178 (2000)
Google Scholar
Rahm, E., Do, H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)
Google Scholar
Tejada, S., Knoblock, C., Minton, S.: Learning object identification rules for information integration. J. Inf. Syst. 26(3), 607–633 (2001)
Article MATH Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of SIGKDD, pp. 475–480 (2002)
Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24(9), 1537–1555 (2012)
Article Google Scholar
Chakrabarti, S.: Mining the web: discovering knowledge from hypertext data. Morgan-Kauffman (2002)
Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of DASFAA, pp. 137–146 (2003)
Google Scholar
Hatcher, E., Gospodnetic, O.: Lucene in Action. Manning Publications (2004)
Google Scholar
Khabsa, M., Giles, C.L.: The number of scholarly documents on the public web. PLoS One 9(5) (2014)
Google Scholar

Download references

Acknowledgements

This work has been partially supported by National Key Research Program of China (2016YFB1001101), NSFC (No.61440020, No.61272398 and No.61309030), NSERC Discovery grant (RGPIN-2014-04463) and Programs for Innovation Research in CUFE.

Author information

Authors and Affiliations

School of Information, Central University of Finance and Economics, Beijing, China
Yan Wang, Hao Zhang, Yaxin Li, Deyun Wang & Yanlin Ma
School of Computer Science, University of Windsor, Windsor, Ontario, N9B 3P4, Canada
Tong Zhou & Jianguo Lu

Authors

Yan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yaxin Li
View author publications
You can also search for this author in PubMed Google Scholar
Deyun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yanlin Ma
View author publications
You can also search for this author in PubMed Google Scholar
Tong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jianguo Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Wang .

Editor information

Editors and Affiliations

Poznań University of Economics, Poznan, Poland
Wojciech Cellary
University of Minnesota, Minneapolis, Minnesota, USA
Mohamed F. Mokbel
Tsinghua University, Beijing, China
Jianmin Wang
Victoria University, Melbourne, Victoria, Australia
Hua Wang
Victoria University, Melbourne, Victoria, Australia
Rui Zhou
Victoria University, Melbourne, Victoria, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y. et al. (2016). A Data Cleaning Method for CiteSeer Dataset. In: Cellary, W., Mokbel, M., Wang, J., Wang, H., Zhou, R., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2016. WISE 2016. Lecture Notes in Computer Science(), vol 10041. Springer, Cham. https://doi.org/10.1007/978-3-319-48740-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-48740-3_3
Published: 02 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48739-7
Online ISBN: 978-3-319-48740-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics