Abstract
The issues for Natural Language Processing and Information Retrieval have been studied for long time but the recent availability of very large resources (Web pages, digital documents…) and the development of statistical machine learning methods exploiting annotated texts (manual encoding by crowdsourcing is a new major way) have transformed these fields. This allows not limiting these approaches to highly specialized domains and reducing the cost of their implementation. For this chapter, our aim is to present some popular text-mining statistical approaches for information retrieval and information extraction and to discuss the practical limits of actual systems that introduce challenges for future.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Alternatives to the use of probability and to Bayesian networks or other probabilistic graphic models for dealing with uncertainty have been proposed. Among them fuzzy logic and Dempster-Shafer theory.
- 2.
Precision is the fraction of retrieved items that are relevant or well classified while recall is the fraction of relevant items that are retrieved and provided as result. F-score is the harmonic mean of precision and recall.
- 3.
Stemming consists in reducing words according to their morphological variants and roots. See for example Snowball that makes light stemming available for many languages (http://snowball.tartarus.org). Lemmatization can be seen as an advanced stemming.
- 4.
Google Books Ngram (http://books.google.com/ngrams) and n-grams from the Corpus of Contemporary American English COCA (http://www.ngrams.info/) are two popular and freely downloadable word n-grams sets for English.
- 5.
- 6.
- 7.
During CoNLL 2003 (Conference on Computational Natural Language Learning) a challenge that concerned language-independent named entity recognition was organized. Many other tasks related to Natural Language Processing have been organized in the context of CoNLL conferences: grammatical error correction, multilingual parsing, analysis of dependencies… (http://www.clips.ua.ac.be/conll/).
- 8.
Freebase (https://developers.google.com/freebase/) contains in June 2013 more than 37 million entities, 1,998 types and 30,000 properties.
- 9.
- 10.
- 11.
LDC catalog number LDC2002T31 (http://www.ldc.upenn.edu).
- 12.
- 13.
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html (about 40 GB of data for feeds only).
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
DBPedia is a large knowledge base (more than 3.77 million things are classified in an ontology) localized in 111 languages built by extracting structured information from Wikipedia (http://dbpedia.org)—June 2013.
- 21.
- 22.
- 23.
- 24.
- 25.
Text Encoding Initiative (http://www.tei-c.org/Guidelines/).
- 26.
This project was supported by the 6th Framework Research Programme of the European Union (EU), Project LUNA, IST contract no 33549 (www.ist-luna.eu).
- 27.
References
Aljaber, B., Stokes, N., Bailey, J., Pei, J.: Document clustering of scientific texts using citation contexts. Inf. Retrieval 13, 101–131 (2009). (Kluwer Academic Pub.)
Almuhareb, A., Poesio, M.: Attribute-based and value-based clustering: an evaluation. In: Proceedings of EMNLP, pp. 158–165 (2004)
Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta (May, 2010)
Balog, K., Serdyukov, P., Vries, A.P.D.: Overview of the TREC 2010 entity track. DTIC document, (2010)
Béchet, F., Charton, E.: Unsupervised knowledge acquisition for extracting named entities from speech. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2010), pp. 5338–5341 (2010)
Béchet, F., Raymond, C., Duvert, F., de Mori, R.: Frame based interpretation of conversational speech. Spoken Language Technology Workshop (SLT), 2010 IEEE, pp. 401–406 (2010)
Belkin, N.J.: Some (what) grand challenges for information retrieval. SIGIR Forum 42, 47–54 (2008)
Bellot, P., Chappell, T., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Landoni, M., Marx, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., Ram´ırez, G., Sanderson, M., Sanjuan, E., Scholer, F., Schuh, A., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Wang, Q.: Report on INEX 2012. SIGIR Forum 46, 50–59 (2012)
Bellot, P., Crestan, E., El-bèze, M., Gillard, L., de Loupy, C.: Coupling named entity recognition, vector-space model and knowledge bases for TREC-11 question-answering track. In: Proceedings of the Twelfth Text Retrieval Conference (TREC 2003), NIST Special publication, pp. 500–251 (2003)
Berland, M., Charniak, E.: Finding parts in very large corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, pp. 57–64 (1999)
Bonneau-maynard, H., Rosset, S., Ayache, C., Kuhn, A., Mostefa, D.: Semantic annotation of the French media dialog corpus. In: Proceedings of Ninth European Conference on Speech Communication and Technology, Lisboa, Portugal (2005)
Bonnefoy, L., Bellot, P., Benoit, M.: The Web as a source of evidence for filtering candidate answers to natural language questions. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pp. 63–66 (2011)
Bonnefoy, L., Bouvier, V., Bellot, P.: LSIS/LIA at TREC 2012 knowledge base acceleration. In: Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), pp. 500–298. NIST Special Publication SP, Gaithersburg, USA (2013)
Bordogna, G., Pasi, G.: A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation. JASIS 44, 70–82 (1993)
Brocki, Ł., Koržinek, D., Marasek, K.: Telephony based voice portal for a University. Appl. Syst. Homel. Secur. (2008)
Bunescu, R., Mooney, R.: Subsequence kernels for relation extraction. Adv. Neural Inf. Process. Syst. 18, 171 (2006)
Burger, J.D.: Mitre’s quanda at trec-12. In: Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), pp. 500–298. NIST Special Publication SP, Gaithersburg, USA (2003)
Camelin, N., Bechet, F., Damnati, G., de Mori, R.: Detection and interpretation of opinion expressions in spoken surveys. IEEE Trans. Audio Speech Lang. Process. 18, 369–381 (2010)
Carreras, X., Marquez, L., Padró, L.: Named entity extraction using AdaBoost. In: Proceedings of the 6th Conference on Natural Language Learning-Volume 20, pp. 1–4. Association for Computational Linguistics (2002)
Cassidy, T., Zheng, C., Artiles, J., Ji, H., Deng, H., Ratinov, L.-A., Zheng, J., Han, J., Roth, D.: CUNY-UIUC-SRI TAC-KBP2011 entity linking system description. In: Proceedings of Text Analysis Conference (TAC2011), (2010)
Chang, H.C.: A new perspective on twitter hashtag use: diffusion of innovation theory. Proc. Am. Soc. Inform. Sci. Technol. 47, 1–4 (2010)
Chomsky, N.: Current issues in linguistic theory. In: Fodor, J., Katz, B. (eds.) The Structure of Language. Prentice Hall, New York (1964)
Chomsky, N.: Lectures in Government and Binding. Foris Publications, Dordrecht (1981)
Ciravegna, D.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle (2001)
Collins, M., Singer, Y. Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 189–196 (1999)
Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39, 80–91 (1996)
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 423. Association for Computational Linguistics (2004)
Cutler, A., Fodor, J.A.: Semantic focus and sentence comprehension. Cognition 7, 49–59 (1979)
Dang, H.T., Owczarzak, K.: Overview of the TAC 2008 opinion question answering and summarization tasks. In: Proceedings of the First Text Analysis Conference, (2008)
Davidov, D., Rappoport, A.: Extraction and approximation of numerical attributes from the Web. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1308–1317. Association for Computational Linguistics (2010)
Deerwester, S.C., Dumais, S., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
Deveaud, R., Avignon, F., Sanjuan, E., Bellot, P.: LIA at TREC 2011 Web track: experiments on the combination of online resources. In: Proceedings of the Twentieth Text REtrieval Conference (TREC 2011), pp. 500–596. NIST Special Publication SP, Gaithersburg, USA (2011)
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., Weischedel, R.: The automatic content extraction (ACE) program-tasks, data, and evaluation. In: Proceedings of LREC, pp. 837–840. Citeseer (2004)
Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 2733–2739 (2007)
Duvert, F., de Mori, R.: Conditional models for detecting lambda-functions in a spoken language understanding system. In: Eleventh Annual Conference of the International Speech Communication Association, (2010)
Duvert, F., Meurs, M.-J., Servan, C., Béchet, F., Lefevre, F., de Mori, R.: Semantic composition process in a speech understanding system. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 5029–5032 (2008)
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51, 68–74 (2008)
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: the second generation. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume, vol. 1, pp. 3–10. AAAI Press (2011)
Fader, A., Soderland, S, Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)
Ferret, O., Grau, B., Hurault-plantet, M., Illouz, G., Monceaux, L., Robba, I., Vilnat, A.: Finding an answer based on the recognition of the question focus. In: Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2002 Gaithersburg, Maryland, USA (2002)
Fuhr, N., Buckley, C.: A probabilistic learning approach for document indexing. ACM Trans. Inf. Syst. (TOIS) 9, 223–248 (1991)
Garfield, E.: Citation analysis as a tool in journal evaluation. Science 178, 471–479 (1972)
Ge, N., Hale, J., Charniak, E.: A statistical approach to anaphora resolution. In: Proceedings of the Sixth Workshop on Very Large Corpora, pp. 161–170 (1998)
Giles, C.L., Bollacker, K., Lawrence, S.: CiteSeer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 89–98. ACM, Pittsburgh, Pennsylvania, USA (1998)
Griol, D., Riccardi, G., Sanchis, E.: A statistical dialog manager for the LUNA project. In: Proceedings of interspeech/ICSLP, pp. 272–275 (2009)
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of COLING, pp. 466–471 (1996)
Grodzinsky, Y.: La syntaxe générative dans le cerveau. In: Bricmont, J., Franck, J. (eds.) Chomsky (Les Cahiers de l’Herne). Editions de l’Herne, Paris (2007)
Guarino, N.: Concepts, attributes and arbitrary relations: some linguistic and ontological criteria for structuring knowledge bases. Data Knowl. Eng. 8, 249–261 (1992)
Hamdan, H., Béchet, F., Bellot, P.: Experiments with DBpedia, WordNet and SentiWordNet as re-sources for sentiment analysis in micro-blogging. In: International Workshop on Semantic Evaluation SemEval-2013 (NAACL Workshop), Atlanta, Georgia, USA (2013)
Harth, E.: The Creative Loop: How the Brain Makes a Mind. Addison-Wesley, New-York (1993)
Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7. Association for Computational Linguistics (2002)
Ji, H., Grishman, R.: Knowledge base population: Successful approaches and challenges. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1148–1158 (2011)
Kantrowitz, M., Mohit, B., Mittal, V.: Stemming and its effects on TFIDF ranking (poster session). In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 357–359. ACM Press (2000)
Kim, J.-H., Woodland, P.: A rule-based named entity recognition system for speech input. In: Proceedings of the 6th International Conference on Spoken Language Processing, (2000)
Kim, Y.-M., Bellot, P., Tavernier, J., Faath, E., Dacos, M.: Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 209–212. ACM Press, Paris, France (2012)
Krogh, A. Hidden Markov models for labeled sequences. In: Proceedings of the IEEE 12th IAPR International. Conference on Pattern Recognition, Vol. 2-Conference B: Computer Vision and Image Processing, pp. 140–144 (1994)
Lafferty, J., Mccallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282–289 (2001)
Langley, P., Simon, H.A.: Applications of machine learning and rule induction. Commun. ACM 38, 54–64 (1995)
Lehnert, W.: The Process of Question Answering: A Computer Simulation of Cognition. Lawrence Erlbaum Associates, Hillsdale (1978)
Li, F., Zheng, Z., Yang, T., Bu, F., Ge, R., Zhu, X., Zhang, X., Huang, M.: Thu quanta at TAC 2008 qa and rte track. In: Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, BC, Canada (2008)
Lin, J.: An exploration of the principles underlying redundancy-based factoid question answering. ACM Trans. Inf. Syst. 25, 4–53 (2007)
Màrquez, L., Carreras, X., Litkowski, K.C., Stevenson, S.: Semantic role labeling: an introduction to the special issue. Comput. Linguis. 34, 145–159 (2008)
Maybury, M.T.: New Directions in Question Answering. The MIT Press, Menlo Park (2004)
McCallum, A.: Information extraction: distilling structured data from unstructured text. Queue 3, 48–57 (2005)
Mehler, J., Dupoux, E.: Naître Humain. Odile Jacob, Paris (1992)
Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM Press (2005)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011, Association for Computational Linguistics (2009)
Mitkov, R.: Anaphora Resolution. Pearson Education ESL, Boston (2002)
Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R., Goodrum, R., Rus, V.: The structure and performance of an open-domain question answering system. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 563–570. Association for Computational Linguistics (2000)
Muslea, I.: Extraction patterns for information extraction tasks: a survey. The AAAI-99 workshop on machine learning for information extraction, 1999
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 3–26 (2007)
Palmer, M., Gildea, D., Xue, N.: Semantic Role Labeling. Morgan & Claypool, Waterloo (2010)
PASCA, M.: Weakly-supervised discovery of named entities using web search queries. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, ACM press, Lisbon, Portugal (2007)
Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42, 963–979 (2006)
Poesio, M., Almuhareb, A.: Extracting concept descriptions from the Web: the importance of attributes and values. In: Proceedings of the Conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp. 29–44. Citeseer (2008)
Ponte, J.M., Croft, W.B. A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM Press, Melbourne, Australia (1998)
Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 689–696. Association for Computational Linguistics (2008)
Quarteroni, S., Riccardi, G., Dinarelli, M.: What’s in an ontology for spoken language understanding. In: Proceedings of Interspeech, pp. 1023–1026 (2009)
Quintard, L., Galibert, O., Adda, G., Grau, B., Laurent, D., Moriceau, V., Rosset, S., Tannier, X., Vilnat, A.: Question answering on web data: the qa evaluation in quæro. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta (2010)
Rabiner, L., Juang, B.: An introduction to hidden Markov models. IEEE ASSP Mag. 3, 4–16 (1986)
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)
Raju, S., Pingali, P., Varma, V.: An Unsupervised Approach to Product Attribute Extraction. Springer, Berlin Heidelberg (2009). (Advances in Information Retrieval)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256. Association for Computational Linguistics (2009)
Ramakrishnan, G., Chakrabarti, S., Paranjpe, D., Bhattacharya, P.: Is question answering an acquired skill? In: Proceedings of the 13th International Conference on World Wide Web, ACM Press, New York, NY, USA (2004)
Ritchie, A., Robertson, S., Teufel, S.: Comparing citation contexts for information retrieval. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 213–222. ACM Press (2008)
Ritter, A., Clark, S., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1524–1534. Association for Computational Linguistics (2011)
Rizzi, L.: L’acquisition de la langue et la faculté de langage. In: Bricmont, J., Franck, J. (eds.) Chomsky (Les Cahiers de l’Herne). Editions de l’Herne, Paris (2007)
Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: Proceedings of the Thirteenth ACM International Conference on INFORMATION and Knowledge Management %@ 1-58113-874-1, pp. 42-49. ACM Press, Washington, DC, USA (2004)
Robertson, S.E.: The probability ranking principle in IR. J. Doc. 33, 294–304 (1977)
Salton, G., Fox, E., Wu, H.: Extended Boolean information retrieval. Commun. ACM 31, 1002–1036 (1983)
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Sánchez, D.: A methodology to learn ontological attributes from the Web. Data Knowl. Eng. 69, 573–597 (2010)
Sanjuan, E., Bellot, P., Moriceau, V., Tannier, X.: Overview of the INEX 2010 question answering track (QA@INEX). In: Proceedings of the 9th International Conference on Initiative for the Evaluation of XML Retrieval: Comparative Evaluation of Focused Retrieval, Springer, Vught, The Netherland (2011)
Sanjuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the INEX 2012 tweet contextualization track. Initiative for XML Retrieval INEX 2012, p. 148. Roma, Italia (2012)
Sarawagi, S.: Information extraction. Foundations and trends in databases 1, 261–377 (2008)
Savoy, J., Le Calvé, A., Vrajitoru, D.: Report on the TREC-5 experiment: data fusion and collection fusion. In: Proceedings of the Fifth Text REtrieval Conference (TREC-5), pp. 500–538, 489–502. NIST Special Publication (1997)
Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. AAAI-99 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM Press (1996)
Solomon, M., Yu, C., Gravano, L.: Popularity-guided top-k extraction of entity attributes. In: Proceedings of the 13th International Workshop on the Web and Databases (WebDB), p. 9. ACM Press, Indianapolis, IN, USA (2010)
Sparck-Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)
Sparck-jones, K.: A look back and a look forward. In: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–29. ACM Press, Grenoble, France
Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 159–166. ACM Press (2003)
Varma, V., Pingali, P., Katragadda, S., Krishna, R., Ganesh, S., Sarvabhotla, K.H.G., Gopisetty, H., Reddy, K., Bharadwaj, R.: IIIT hyderabad at TAC 2009. In: Proceedings of Test Analysis Conference 2008 (TAC 2008), NIST, Gaithersburg, USA (2008)
Voorhees, E.M.: Overview of the TREC 2001 question answering track. In: Proceedings of the Tenth Text Retrieval Conference (TREC 2001), pp. 500–551, 42–50. NIST Special Publication (2001)
Voorhees, E.M.: Question answering in TREC. In: Voorhees, E.M., Harman, D.K. (eds.) TREC—Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)
Voorhees, E.M., Harman, D.K.: Overview of the eighth text retrieval conference (TREC-8). In: Proceedings of the Eighth Text REtrieval Conference (TREC 8), pp. 500–546, 1–24. NIST Special Publication (1999)
Voorhees, E.M., Harman, D.K.: TREC—Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)
Weerkamp, W., Carter, S., Tsagkias, M.: How people use twitter in different languages. ACM Web Science 2011, 2011, p. 2. Koblenz, Germany (2011)
Whitelaw, C., Kehlenbeck, A., Petrovic, N., Ungar, L.: Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 123–132. ACM Press, Napa Valley, California, USA (2008)
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 4–11. ACM Press, Zurich, Suisse (1996)
Yao, C., Yu, Y., Shou, S., Li, X.: Towards a global schema for web entities. In: Proceedings of the 17th international Conference on World Wide Web, pp. 999–1008. ACM Press (2008)
Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)
Zhao, Y., Qin, B., Hu, S., Liu, T.: Generalizing syntactic structures for product attribute candidate extraction. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 377–380. Association for Computational Linguistics (2010)
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 494–503. ACM Press (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Bellot, P., Bonnefoy, L., Bouvier, V., Duvert, F., Kim, YM. (2014). Large Scale Text Mining Approaches for Information Retrieval and Extraction. In: Faucher, C., Jain, L. (eds) Innovations in Intelligent Machines-4. Studies in Computational Intelligence, vol 514. Springer, Cham. https://doi.org/10.1007/978-3-319-01866-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-01866-9_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01865-2
Online ISBN: 978-3-319-01866-9
eBook Packages: EngineeringEngineering (R0)