Abstract
Machine Learning (ML) algorithms have opened up new possibilities for the acquisition and processing of documents in Information Retrieval (IR) systems. Indeed, it is now possible to automate several labor-intensive tasks related to documents such as categorization and entity extraction. Consequently, the application of machine learning techniques for various large-scale IR tasks has gathered significant research interest in both the ML and IR communities. This tutorial provides a reference summary of our research in applying machine learning techniques to diverse tasks in Digital Libraries (DL). Digital library portals are specialized IR systems that work on collections of documents related to particular domains. We focus on open-access, scientific digital libraries such as CiteSeer\(^x\), which involve several crawling, ranking, content analysis, and metadata extraction tasks. We elaborate on the challenges involved in these tasks and highlight how machine learning methods can successfully address these challenges.
Keywords
This is a preview of subscription content, log in via an institution.
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
The WebKB dataset was created in 1997.
- 9.
- 10.
References
Hood, W.W., Wilson, C.S.: The literature of bibliometrics, scientometrics, and informetrics. Scientometrics 52(2), 291–314 (2001)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Boudin, F.: A comparison of centrality measures for graph-based keyphrase extraction. In: IJCNLP (2013)
Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2(2), 121–167 (1998)
Caragea, C., Wu, J., Williams, K., Gollapalli, S.D., Khabsa, M., Teregowda, P., Giles, C.L.: Automatic identification of research articles from crawled documents. In: Web-Scale Classification: Classifying Big Data from the Web, Co-Located with WSDM (2014)
Chakrabarti, S.: Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, Burlington (2002)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11–16), 1623–1640 (1999)
Chen, B., Zhu, L., Kifer, D., Lee, D.: What is an opinion about? exploring political standpoints using opinion scoring model. In: AAAI (2010)
Councill, I.G., Giles, C.L., Kan, M.-Y.: Parscit: an open-source crf reference string parsing package. In: LREC (2008)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. JASIS 41(6), 391–407 (1990)
Deng, H., King, I., Lyu, M.R.: Formal models for expert finding on dblp bibliography data. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 163–172. IEEE Computer Society, Washington, DC, USA (2008)
Druck, G., Mann, G., McCallum, A.: Learning from labeled features using generalized expectation criteria. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2008, pp. 595–602. ACM, New York (2008)
Firdhous, M.: Automating legal research through data mining. CoRR, abs/1211.1861 (2012)
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, G., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: IJCAI (1999)
Ganchev, K., Graça, J., Gillenwater, J., Taskar, B.: Posterior regularization for structured latent variable models. J. Mach. Learn. Res. 11, 2001–2049 (2010)
Gollapalli, S.D., Caragea, C.: Extracting keyphrases from research papers using citation networks. In: AAAI, pp. 1629–1635 (2014)
Gollapalli, S.D., Caragea, C., Mitra, P., Giles, C.L.: Researcher homepage classification using unlabeled data. In: Proceedings of the 22nd International Conference on World Wide Web, WWW 2013, pp. 471–482. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland (2013)
Gollapalli, S.D., Giles, C.L., Mitra, P., Caragea, C.: On identifying academic homepages for digital libraries. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL 2011, pp. 123–132. ACM, New York (2011)
Gollapalli, S.D., Mitra, P., Giles, C.L.: Learning to rank homepages for researcher-name queries. In: SIGIR Workshop on Entity Oriented Search (2011)
Gollapalli, S.D., Mitra, P., Giles, C.L.: Ranking experts using author-document-topic graphs. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2013, pp. 87–96, ACM, New York (2011)
Gollapalli, S.D., Qi, Y., Mitra, P., Giles, C.L.: Extracting researcher metadata with labeled features. In: SDM, pp. 740–748 (2014)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5228–5235 (2004)
Hammouda, K.M., Matute, D.N., Kamel, M.S.: Corephrase: keyphrase extraction for document clustering. In: Machine Learning and Data Mining in Pattern Recognition (2005)
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital libraries, JCDL 2003, pp. 37–48. IEEE Computer Society, Washington, DC, USA (2003)
Han, J.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., Burlington (2005)
Haveliwala, T., Kamvar, S., Klein, D., Manning, C., Golub, G.: Computing pagerank using power extrapolation. Number 2003–45. Stanford (2003)
He, Q., Chen, B., Pei, J., Qiu, B., Mitra, P., Giles, C.L.: Detecting topic evolution in scientific literature: how can citations help? In: CIKM, pp. 957–966 (2009)
Heinrich, G.: Parameter estimation for text analysis. Technical report (2008)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999, pp. 50–57. ACM, New York (1999)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: EMNLP, pp. 216–223 (2003)
Jakulin, A., Buntine, W., La Pira, T., Brasher, H.: Analyzing the U.S. senate in 2003: similarities, clusters, and blocs. Polit. Anal. 17(3), 10 (2009)
Jones, S., Staveley, M.S.: Phrasier: a system for interactive document retrieval using keyphrases. In: SIGIR (1999)
Kataria, S., Kumar, K.S., Rastogi, R., Sen, P., Sengamedu, S.H.: Entity disambiguation with hierarchical topic models. In: KDD, pp. 1037–1045 (2011)
Kataria, S., Mitra, P., Bhatia, S.: Utilizing context in generative bayesian models for linked corpus. In: AAAI (2010)
Kataria, S., Mitra, P., Caragea, C., Giles, C.L.: Context sensitive topic models for author influence in document networks. In: IJCAI, pp. 2274–2280 (2011)
Kim, S.N., Kan, M.-Y.: Re-examining automatic keyphrase extraction approaches in scientific articles. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE 2009 (2009)
Kim, S.N., Medelyan, O., Kan, M.-Y., Baldwin, T.: Automatic keyphrase extraction from scientific articles. Lang. Resour. Eval. 47(3), 723–742 (2013)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289, Morgan Kaufmann Publishers Inc., San Francisco (2001)
Li, H., Councill, I.G., Bolelli, L., Zhou, D., Song, Y., Lee, W.-C., Sivasubramaniam, A., Giles, C.L.: Citeseerx: a scalable autonomous scientific digital library. In: Proceedings of the 1st International Conference on Scalable Information Systems, InfoScale 2006. ACM, New York (2006)
Li, X., Ng, S.-K., Wang, J.T.L.: Biological Data Mining and Its Applications in Healthcare, 1st edn. World Scientific Publishing Co., Inc., Singapore (2013)
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer-Verlag New York Inc., New York (2006)
Liu, F., Pennell, D., Liu, F., Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: Proceedings of NAACL 2009, pp. 620–628 (2009)
Liu, X., Croft, W.B.: Statistical language modeling for information retrieval. ARIST 39(1), 1–31 (2005)
Mann, G.S., McCallum, A.: Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Mach. Learn. Res. 11, 955–984 (2010)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Marujo, L., Ribeiro, R., de Matos, D.M., Neto, J.P., Gershman, A., Carbonell, J.G.: Key phrase extraction of lightly filtered broadcast news. CoRR (2013)
Nguyen, T.D., Kan, M.-Y.: Keyphrase extraction in scientific publications. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 317–326. Springer, Heidelberg (2007)
Ortega-Priego, J.-L., Aguillo, I.F., Prieto-Valverde, J.A.: Longitudinal study of contents and elements in the scientific web environment. J. Inf. Sci. 32(4), 344–351 (2006)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Technical report (1999)
Pudota, N., Dattolo, A., Baruzzo, A., Ferrara, F., Tasso, C.: Automatic keyphrase extraction and ontology mining for content-based tag recommendation. Int. J. Intell. Syst. 25(12), 1158–1186 (2010)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill Inc., New York (1986)
Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)
Tang, J., Jin, R., Zhang, J.: A topic modeling approach and its integration into the random walk framework for academic search. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM 2008, pp. 1055–1060. IEEE Computer Society, Washington, DC, USA (2008)
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery nd Data Mining, KDD 2008, pp. 990–998. ACM, New York (2008)
Teregowda, P.B., Councill, I.G., Fernández, R.J.P., Khabsa, M., Zheng, S., Giles, C.L.: Seersuite: developing a scalable and reliable application framework for building digital libraries by crawling the web. In: Proceedings of the 2010 USENIX Conference on Web Application Development WebApps 2010 (2010)
Tuarob, S., Pouchard, L.C., Giles, C.L.: Automatic tag recommendation for metadata annotation using probabilistic topic modeling. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2013, pp. 239–248. ACM (2013)
Wu, J., Williams, K., Chen, H.-H., Khabsa, M., Caragea, C., Ororbia, A., Jordan, D., Giles, C.L.: Citeseerx: Ai in a digital library search engine. In: IAAI (2014)
Zha, H.: Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: SIGIR (2002)
Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting author meta-data from web using visual features. In: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, ICDMW 2007, pp. 33–40. IEEE Computer Society, Washington, DC, USA (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Gollapalli, S.D., Caragea, C., Li, X., Giles, C.L. (2015). Document Analysis and Retrieval Tasks in Scientific Digital Libraries. In: Braslavski, P., Karpov, N., Worring, M., Volkovich, Y., Ignatov, D.I. (eds) Information Retrieval. RuSSIR 2014. Communications in Computer and Information Science, vol 505. Springer, Cham. https://doi.org/10.1007/978-3-319-25485-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-25485-2_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25484-5
Online ISBN: 978-3-319-25485-2
eBook Packages: Computer ScienceComputer Science (R0)