CiteSeer x : A Scholarly Big Dataset

Caragea, Cornelia; Wu, Jian; Ciobanu, Alina; Williams, Kyle; Fernández-Ramírez, Juan; Chen, Hung-Hsuan; Wu, Zhaohui; Giles, Lee

doi:10.1007/978-3-319-06028-6_26

Cornelia Caragea^22,25,
Jian Wu^23,26,
Alina Ciobanu^24,27,
Kyle Williams^23,26,
Juan Fernández-Ramírez^22,28,
Hung-Hsuan Chen^22,26,
Zhaohui Wu^22,26 &
…
Lee Giles^22,23,26

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8416))

Included in the following conference series:

European Conference on Information Retrieval

3346 Accesses
20 Citations
1 Altmetric

Abstract

The CiteSeer^x digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific information, CiteSeer^x has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeer^x is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata. Since the performance of models trained on these data highly depends on the quality of the data, we propose an approach to CiteSeer^x metadata cleaning that incorporates information from an external data source. The result is a subset of CiteSeer^x, which is substantially cleaner than the entire set. Our goal is to make the new dataset available to the research community to facilitate future work in Information Retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

A Data Cleaning Method for CiteSeer Dataset

CERMINE: automatic extraction of structured metadata from scientific literature

Article Open access 03 July 2015

unarXive: a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

Article Open access 02 March 2020

References

Giles, C.L., Bollacker, K., Lawrence, S.: Citeseer: An automatic citation indexing system. In: Digital Libraries 1998, pp. 89–98 (1998)
Google Scholar
Lu, Q., Getoor, L.: Link-based classification. In: ICML (2003)
Google Scholar
Peng, F., Schuurmans, D.: Combining naive bayes and n-gram language models for text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)
Chapter Google Scholar
Caragea, C., Silvescu, A., Kataria, S., Caragea, D., Mitra, P.: Classifying scientific publications using abstract features. In: SARA (2011)
Google Scholar
Sen, P., Namata, G.M., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine 29(3), 93–106 (2008)
Google Scholar
Zhou, D., Zhu, S., Yu, K., Song, X., Tseng, B.L., Zha, H., Giles, C.L.: Learning multiple graphs for document recommendations. In: Proc. of WWW 2008 (2008)
Google Scholar
Caragea, C., Silvescu, A., Mitra, P., Giles, C.L.: Can’t see the forest for the trees? a citation recommendation system. In: Proceedings of JCDL 2013 (2013)
Google Scholar
Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C.L., Rokach, L.: Recommending citations: translating papers into references. In: CIKM (2012)
Google Scholar
Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Diversified recommendation on graphs: pitfalls, measures, and algorithms. In: WWW (2013)
Google Scholar
Nallapati, R.M., Ahmed, A., Xing, E.P., Cohen, W.W.: Joint latent topic models for text and citations. In: Proceedings of KDD 2008 (2008)
Google Scholar
Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proc. of JCDL, JCDL 2009 (2009)
Google Scholar
Gollapalli, S.D., Mitra, P., Giles, C.L.: Similar researcher search in academic environments. In: JCDL (2012)
Google Scholar
Chen, H.H., Gou, L., Zhang, X., Giles, C.L.: Collabseer: a search engine for collaboration discovery. In: Proceedings of JCDL 2011 (2011)
Google Scholar
Kan, M.Y.: Slideseer: a digital library of aligned document and presentation pairs. In: Proceedings of JCDL 2007 (2007)
Google Scholar
Kataria, S., Mitra, P., Caragea, C., Giles, C.L.: Context sensitive topic models for author influence in document networks. In: Proceedings of IJCAI 2011 (2011)
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of UAI 2004 (2004)
Google Scholar
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SDM (2006)
Google Scholar
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: JCDL (2003)
Google Scholar
Councill, I.G., Giles, C.L., Yen Kan, M.: Parscit: An open-source crf reference string parsing package. In: Intl. Language Resources and Evaluation (2008)
Google Scholar
Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: DMKD (2004)
Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Article Google Scholar
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23 (2000)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Journal Information Systems (2001)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of KDD 2003 (2003)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of KDD 2002 (2002)
Google Scholar
Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, Statistical Research Div., U.S. Bureau of the Census (2002)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IJCAI, pp. 73–78 (2003)
Google Scholar
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: NIPS. MIT Press (2003)
Google Scholar
McCallum, A., Wellner, B.: Toward conditional models of identity uncertainty with application to proper noun coreference. In: IIWeb (2003)
Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proc. of WWW 2007 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, University of North Texas, Denton, TX, USA
Cornelia Caragea, Juan Fernández-Ramírez, Hung-Hsuan Chen, Zhaohui Wu & Lee Giles
Information Sciences and Technology, University of North Texas, Denton, TX, USA
Jian Wu, Kyle Williams & Lee Giles
Computer Science, University of North Texas, Denton, TX, USA
Alina Ciobanu
University of North Texas, Denton, TX, USA
Cornelia Caragea
Pennsylvania State University, University Park, PA, USA
Jian Wu, Kyle Williams, Hung-Hsuan Chen, Zhaohui Wu & Lee Giles
University of Bucharest, Bucharest, Romania
Alina Ciobanu
University of the Andes, Bogota, Colombia
Juan Fernández-Ramírez

Authors

Cornelia Caragea
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wu
View author publications
You can also search for this author in PubMed Google Scholar
Alina Ciobanu
View author publications
You can also search for this author in PubMed Google Scholar
Kyle Williams
View author publications
You can also search for this author in PubMed Google Scholar
Juan Fernández-Ramírez
View author publications
You can also search for this author in PubMed Google Scholar
Hung-Hsuan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Lee Giles
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Maarten de Rijke & Tom Kenter &
Centrum Wiskunde en Informatica, Amsterdam, The Netherlands and Delft University of Technology, Delft, The Netherlands
Arjen P. de Vries
University of Illinois at Urbana-Champaign, Urbana, IL, USA
ChengXiang Zhai
University of Twente, Twente, The Netheralnds and Erasmus University Rotterdam, Rotterdam, The Netherlands
Franciska de Jong
SalesPredict, Haifa, Israel
Kira Radinsky
Microsoft Research, Cambridge, UK
Katja Hofmann

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Caragea, C. et al. (2014). CiteSeer^x: A Scholarly Big Dataset. In: de Rijke, M., et al. Advances in Information Retrieval. ECIR 2014. Lecture Notes in Computer Science, vol 8416. Springer, Cham. https://doi.org/10.1007/978-3-319-06028-6_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-06028-6_26
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06027-9
Online ISBN: 978-3-319-06028-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CiteSeerx: A Scholarly Big Dataset