Abstract
Since 80% of all information in the World Wide Web (WWW) is in textual form, most of the search activities of the users are based on groups of search words forming queries that represent their information needs. The quality of the returned results -usually evaluated using measures such as precision and recall- mostly depends on the quality of the chosen query terms. Therefore, their relatedness must be evaluated accordingly using and matched against the documents to be found. In order to do so properly, in this paper, the notion of n-term co-occurrences will be introduced and distinguished from the related concepts of n-grams and higher-order co-occurrences. Finally, their applicability for search, clustering and data mining processes will be considered.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
November 2013 Web Server Survey (2013), http://news.netcraft.com/archives/2013/11/01/november-2013-web-server-survey.html (last retrieved on March 01, 2014)
Grimes, S.: Unstructured Data and the 80 Percent Rule (2008), http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule (last retrieved on March 01, 2014)
Agrawal, R., Yu, X., King, I., Zajac, R.: Enrichment and Reductionism: Two Approaches for Web Query Classification. In: Lu, B.-L., Zhang, L., Kwok, J., et al. (eds.) ICONIP 2011, Part III. LNCS, vol. 7064, pp. 148–157. Springer, Heidelberg (2011)
Website of Google Autocomplete, Web Search Help (2013), http://support.google.com/websearch/bin/answer.py?hl=en&answer=106230 (last retrieved on March 01, 2014)
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Frei, H.-P., Harman, D., Schäuble, P., Wilkinson, R. (eds.) Proc. of the 19th AnnualInternational ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1996, Zurich, pp. 4–11 (1996)
Kubek, M., Witschel, H.F.: Searching the Web by Using the Knowledge in Local Text Documents. In: Proceedings of Mallorca Workshop 2010 Autonomous Systems. Shaker Verlag, Aachen (2010)
Keiichiro, H., et al.: Query expansion based on predictive algorithms for collaborative filtering. In: Proc. of the 24th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001, pp. 414–415 (2001)
Han, L., Chen, G.: HQE: A hybrid method for query expansion. Expert Systems with Applications Journal 36, 7985–7991 (2009)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Deerwester, S., et al.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Heyer, G., Quasthoff, U., Wittig, T.: Text Mining: Wissensrohstoff Text: Konzepte, Algorithmen, Ergebnisse. W3L-Verlag, Dortmund (2006)
Büchler, M.: Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturie-ten Daten. Master’s thesis, University of Leipzig (2006)
Dice, L.R.: Measures of the Amount of Ecologic Association Between Species. Ecology 26(3), 297–302 (1945)
Jaccard, P.: Étude Comparative de la Distribution Floraledansune Portion des Alpeset des Jura. Bulletin de la SociétéVaudoise des Sciences Naturelles 37, 547–579 (1901)
Quasthoff, U., Wolff, C.: The Poisson Collocation Measure and its Applications. In: Proc. of the Second International Workshop on Computational Approaches to Collocations, Wien (2002)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1), 61–74 (1994)
Michel, J., et al.: Quantitative Analysis of Culture Using Millions of Digitized Books. Science 14 331(6014), 176–182 (2011)
Biemann, C., Bordag, S., Quasthoff, U.: Automatic Acquisition of Paradigmatic Relations using Iterated Co-occurrences. In: Proc. of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 967–970 (2004)
Witschel, H.F.: Terminologie-Extraktion - Möglichkeiten der Kombination statistischer und musterbasierter Verfahren. Ergon-Verlag (2004)
Luhn, H.P.: Automatic Creation of Literature Abstracts. IBM Journal of Research and Development 2(2), 159–165 (1958)
Website of DocAnalyser (2014), http://www.docanalyser.de (last retrieved on March 01, 2014)
Kubek, M., Unger, H.: Detecting Source Topics by Analysing Directed Co-occurrence Graphs. In: Proc. 12th Intl. Conf. on Innovative Internet Community Systems, GI Lecture Notes in Informatics, vol. P-204, pp. 202–211. Köllen Verlag, Bonn (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Kubek, M., Unger, H. (2014). On N-term Co-occurrences. In: Boonkrong, S., Unger, H., Meesad, P. (eds) Recent Advances in Information and Communication Technology. Advances in Intelligent Systems and Computing, vol 265. Springer, Cham. https://doi.org/10.1007/978-3-319-06538-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-06538-0_7
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06537-3
Online ISBN: 978-3-319-06538-0
eBook Packages: EngineeringEngineering (R0)