Document Analysis and Retrieval Tasks in Scientific Digital Libraries

Gollapalli, Sujatha Das; Caragea, Cornelia; Li, Xiaoli; Giles, C. Lee

doi:10.1007/978-3-319-25485-2_1

Sujatha Das Gollapalli¹⁴,
Cornelia Caragea¹⁵,
Xiaoli Li¹⁴ &
…
C. Lee Giles¹⁶

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 505))

Included in the following conference series:

Russian Summer School in Information Retrieval

2057 Accesses

Abstract

Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference summary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer\(^x\), which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Building datasets to support information extraction and structure parsing from electronic theses and dissertations

Article Open access 03 May 2024

Classifying Document Types to Enhance Search and Recommendations in Digital Libraries

Delve: A Data Set Retrieval and Document Analysis System

Notes

1.
http://citeseerx.ist.psu.edu/.
2.
http://scholar.google.com/.
3.
http://academic.research.microsoft.com/.
4.
http://romip.ru/russir2014/.
5.
http://dl.acm.org/.
6.
http://www.ncbi.nlm.nih.gov/pubmed.
7.
http://www.cs.cmu.edu/afs/cs/project/theo20/www/data/.
8.
The WebKB dataset was created in 1997.
9.
http://wordnet.princeton.edu/.
10.
http://nlp.stanford.edu/ner/index.shtml.

References

Hood, W.W., Wilson, C.S.: The literature of bibliometrics, scientometrics, and informetrics. Scientometrics 52(2), 291–314 (2001)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Boudin, F.: A comparison of centrality measures for graph-based keyphrase extraction. In: IJCNLP (2013)
Google Scholar
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
Article Google Scholar
Caragea, C., Wu, J., Williams, K., Gollapalli, S.D., Khabsa, M., Teregowda, P., Giles, C.L.: Automatic identification of research articles from crawled documents. In: Web-Scale Classification: Classifying Big Data from the Web, Co-Located with WSDM (2014)
Google Scholar
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, Burlington (2002)
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)
Article Google Scholar
Chen, B., Zhu, L., Kifer, D., Lee, D.: What is an opinion about? exploring political standpoints using opinion scoring model. In: AAAI (2010)
Google Scholar
Councill, I.G., Giles, C.L., Kan, M.-Y.: Parscit: an open-source crf reference string parsing package. In: LREC (2008)
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Article Google Scholar
Deng, H., King, I., Lyu, M.R.: Formal models for expert finding on dblp bibliography data. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 163–172. IEEE Computer Society, Washington, DC, USA (2008)
Google Scholar
Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 595–602. ACM, New York (2008)
Google Scholar
Firdhous, M.: Automating legal research through data mining. CoRR, abs/1211.1861 (2012)
Google Scholar
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, G., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: IJCAI (1999)
Google Scholar
Ganchev, K., Graça, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)
MathSciNet MATH Google Scholar
Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: AAAI, pp. 1629–1635 (2014)
Google Scholar
Gollapalli, S.D., Caragea, C., Mitra, P., Giles, C.L.: Researcher homepage classification using unlabeled data. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013, pp. 471–482. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2013)
Google Scholar
Gollapalli, S.D., Giles, C.L., Mitra, P., Caragea, C.: On identifying academic homepages for digital libraries. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL 2011, pp. 123–132. ACM, New York (2011)
Google Scholar
Gollapalli, S.D., Mitra, P., Giles, C.L.: Learning to rank homepages for researcher-name queries. In: SIGIR Workshop on Entity Oriented Search (2011)
Google Scholar
Gollapalli, S.D., Mitra, P., Giles, C.L.: Ranking experts using author-document-topic graphs. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2013, pp. 87–96, ACM, New York (2011)
Google Scholar
Gollapalli, S.D., Qi, Y., Mitra, P., Giles, C.L.: Extracting researcher metadata with labeled features. In: SDM, pp. 740–748 (2014)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5228–5235 (2004)
Article Google Scholar
Hammouda, K.M., Matute, D.N., Kamel, M.S.: Corephrase: keyphrase extraction for document clustering. In: Machine Learning and Data Mining in Pattern Recognition (2005)
Google Scholar
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2003, pp. 37–48. IEEE Computer Society, Washington, DC, USA (2003)
Google Scholar
Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., Burlington (2005)
Google Scholar
Haveliwala, T., Kamvar, S., Klein, D., Manning, C., Golub, G.: Computing pagerank using power extrapolation. Number 2003–45. Stanford (2003)
Google Scholar
He, Q., Chen, B., Pei, J., Qiu, B., Mitra, P., Giles, C.L.: Detecting topic evolution in scientific literature: how can citations help? In: CIKM, pp. 957–966 (2009)
Google Scholar
Heinrich, G.: Parameter estimation for text analysis. Technical report (2008)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)
Google Scholar
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: EMNLP, pp. 216–223 (2003)
Google Scholar
Jakulin, A., Buntine, W., La Pira, T., Brasher, H.: Analyzing the U.S. senate in 2003: similarities, clusters, and blocs. Polit. Anal. 17(3), 10 (2009)
Article Google Scholar
Jones, S., Staveley, M.S.: Phrasier: a system for interactive document retrieval using keyphrases. In: SIGIR (1999)
Google Scholar
Kataria, S., Kumar, K.S., Rastogi, R., Sen, P., Sengamedu, S.H.: Entity disambiguation with hierarchical topic models. In: KDD, pp. 1037–1045 (2011)
Google Scholar
Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative bayesian models for linked corpus. In: AAAI (2010)
Google Scholar
Kataria, S., Mitra, P., Caragea, C., Giles, C.L.: Context sensitive topic models for author influence in document networks. In: IJCAI, pp. 2274–2280 (2011)
Google Scholar
Kim, S.N., Kan, M.-Y.: Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE 2009 (2009)
Google Scholar
Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2013)
Article Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289, Morgan Kaufmann Publishers Inc., San Francisco (2001)
Google Scholar
Li, H., Councill, I.G., Bolelli, L., Zhou, D., Song, Y., Lee, W.-C., Sivasubramaniam, A., Giles, C.L.: Citeseerx: a scalable autonomous scientific digital library. In: Proceedings of the 1st International Conference on Scalable Information Systems, InfoScale 2006. ACM, New York (2006)
Google Scholar
Li, X., Ng, S.-K., Wang, J.T.L.: Biological Data Mining and Its Applications in Healthcare, 1st edn. World Scientific Publishing Co., Inc., Singapore (2013)
Google Scholar
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc., New York (2006)
Google Scholar
Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of NAACL 2009, pp. 620–628 (2009)
Google Scholar
Liu, X., Croft, W.B.: Statistical language modeling for information retrieval. ARIST 39(1), 1–31 (2005)
Google Scholar
Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010)
MathSciNet MATH Google Scholar
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Marujo, L., Ribeiro, R., de Matos, D.M., Neto, J.P., Gershman, A., Carbonell, J.G.: Key phrase extraction of lightly filtered broadcast news. CoRR (2013)
Google Scholar
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)
Google Scholar
Ortega-Priego, J.-L., Aguillo, I.F., Prieto-Valverde, J.A.: Longitudinal study of contents and elements in the scientific web environment. J. Inf. Sci. 32(4), 344–351 (2006)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report (1999)
Google Scholar
Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., Tasso, C.: Automatic keyphrase extraction and ontology mining for content-based tag recommendation. Int. J. Intell. Syst. 25(12), 1158–1186 (2010)
Article MATH Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
MATH Google Scholar
Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)
Article MATH Google Scholar
Tang, J., Jin, R., Zhang, J.: A topic modeling approach and its integration into the random walk framework for academic search. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 1055–1060. IEEE Computer Society, Washington, DC, USA (2008)
Google Scholar
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery nd Data Mining, KDD 2008, pp. 990–998. ACM, New York (2008)
Google Scholar
Teregowda, P.B., Councill, I.G., Fernández, R.J.P., Khabsa, M., Zheng, S., Giles, C.L.: Seersuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In: Proceedings of the 2010 USENIX Conference on Web Application Development WebApps 2010 (2010)
Google Scholar
Tuarob, S., Pouchard, L.C., Giles, C.L.: Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, pp. 239–248. ACM (2013)
Google Scholar
Wu, J., Williams, K., Chen, H.-H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., Giles, C.L.: Citeseerx: Ai in a digital library search engine. In: IAAI (2014)
Google Scholar
Zha, H.: Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: SIGIR (2002)
Google Scholar
Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting author meta-data from web using visual features. In: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, ICDMW 2007, pp. 33–40. IEEE Computer Society, Washington, DC, USA (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Infocomm Research, Agency for Science and Technology Research, Singapore, Singapore
Sujatha Das Gollapalli & Xiaoli Li
Computer Science and Engineering, University of North Texas, Denton, USA
Cornelia Caragea
Information Sciences and Technology, Computer Science and Engineering, The Pennsylvania State University, State College, USA
C. Lee Giles

Authors

Sujatha Das Gollapalli
View author publications
You can also search for this author in PubMed Google Scholar
Cornelia Caragea
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Li
View author publications
You can also search for this author in PubMed Google Scholar
C. Lee Giles
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sujatha Das Gollapalli .

Editor information

Editors and Affiliations

Ural Federal University, Yekaterinburg, Russia
Pavel Braslavski
National Research University Higher School of Economics, Nizhniy Novgorod, Russia
Nikolay Karpov
Intelligent Systems Laboratory, University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Barcelona Media Research Foundation, Barcelona, Spain
Yana Volkovich
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gollapalli, S.D., Caragea, C., Li, X., Giles, C.L. (2015). Document Analysis and Retrieval Tasks in Scientific Digital Libraries. In: Braslavski, P., Karpov, N., Worring, M., Volkovich, Y., Ignatov, D.I. (eds) Information Retrieval. RuSSIR 2014. Communications in Computer and Information Science, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-319-25485-2_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-25485-2_1
Published: 10 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25484-5
Online ISBN: 978-3-319-25485-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics