Abstract
The Gorkana Group provides high quality media monitoring services to its clients. This paper describes an ongoing project aimed at increasing the amount of automation in Gorkana Group’s workflow through the application of machine learning and language processing technologies. It is important that Gorkana Group’s clients should have a very high level of confidence, that, if an article is relevant to one of their briefs, then they will be shown the article. However, delivering this high-quality media monitoring service means that humans are required to read through very large quantities of data, only a small portion of which is typically deemed relevant. The challenge being addressed by the work reported in this paper is how to efficiently achieve such high-quality media monitoring in the face of huge increases in the amount of the data that needs to be monitored. We show that, while machine learning can be applied successfully to this real world business problem, the constraints of the task give rise to a number of interesting challenges.




Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Note that true population probabilities of types in natural language text are a hypothetical and an ill defined concept. The probabilities can be measured for large corpora, but this is an estimate of the hypothesised true probabilities.
References
Chen SF, Goodman J (1999) An empirical study of smoothing techniques for language modeling. Comput Speech Lang 13(4):359–393
Clarke D, Lane P, Hender P (2011) Developing robust models for favourability analysis. In: Proceedings of the 2nd workshop on computational approaches to subjectivity and sentiment analysis (WASSA 2.011), Portland, Oregon, June 2011. Association for Computational Linguistics, Stroudsburg pp 44–52.
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Gale WA, Church KW (1994) What’s wrong with adding one? In: Oostdijk N, de Haan P (eds) Corpus based research in language. Honour of Jan Aarts. Rodopi, Amsterdam, pp 189–200
Gale WA, Sampson G (1995) Good-turing frequency estimation without tears. J Quant Linguist 2(3):217–237
Good IJ, Turing AM (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40:237–264
Green PD, Lane PCR, Rainer AW, Scholz S (2010) Selecting measures in origin analysis. In: Proceedings of the thirtieth SGAI international conference on artificial intelligence
Kubat M, Holte RC, Matwin S (1998) Machine learning for the detection of oil spills in satellite radar images. Mach Learn 30(2):195–215
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Mladenić D (1998) Feature subset selection in text-learning. In: Machine learning: ECML-98, pp 95–100
Rogati M, Yang Y (2002) High-performing feature selection for text classification. In: Proceedings of the eleventh international conference on information and knowledge management. ACM, New York, pp 659–661
Tang L, Liu H (2005) Bias analysis in text classification for highly skewed data. In: Proceedings of the fifth IEEE international conference on data mining (ICDM ’05), Washington, DC, USA. IEEE Comput. Soc., Los Alamitos, pp 781–784
Tufte E (2004) Sparkline theory and practice. http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR&topic_id=1 May 2004
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22(2):179–214
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lyra, M., Clarke, D., Morgan, H. et al. High Value Media Monitoring With Machine Learning. Künstl Intell 27, 255–265 (2013). https://doi.org/10.1007/s13218-013-0255-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13218-013-0255-2