Abstract
News websites are rich sources of terms that can compose a linguistic corpus. By introducing a corpus into a Data Warehousing environment, applications can take advantage of the flexibility that a multidimensional model and OLAP operations provide. This paper presents Newsminer, an exploratory OLAP framework, which offers a consistent and clean set of texts as a multidimensional corpus for consumption by external applications. The proposal integrates real-time gathering of news and semantic enrichment, which adds automatic annotations to the corpus. The multidimensional facet allows users and applications to obtain different corpora by selecting news categories, time slice, and term selection. We performed two experiments to evaluate the semantics enrichment and the feasibility of real-time during Newsminer’s ETL.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, San Francisco. Morgan Kaufmann Publishers Inc., California (2011)
Mansmann, S., Rehman, N.U., Weiler, A., Scholl, M.H.: Discovering OLAP dimensions in semi-structured data. Inf. Syst. 44, 120–133 (2014)
Abello, A.: Using semantic web technologies for exploratory OLAP: a survey. IEEE Trans. Knowl. Data Eng. 27(2), 571–588 (2015)
Berbel, T.R.L., González, S.M.: How to help end users to get better decisions? Personalizing OLAP aggregation queries through semantic recommendation of text documents. Int. J. Bus. Intell. Data Min. 10(1), 1–18 (2015)
Mendoza, M., Alegría, E., Maca, M., Cobos, C., León, E.: Multidimensional analysis model for a document warehouse that includes textual measures. Decis. Support Syst. 72(1), 44–59 (2015)
Gallinucci, E., Golfarelli, M., Rizzi, S., Abelló, A., Romero, O.: Interactive multidimensional modeling of linked data for exploratory OLAP. Inf. Syst. 77, 86–104 (2018)
Hovy, E., Lavid, J.: Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics. Int. J. Transl. 22(1), 13–36 (2010)
Simpson, S.S., Adams, N., Brugman, C.M., Conners, T.J.: Detecting novel and emerging drug terms using natural language processing: a social media corpus study. JMIR Public Health Surveill. 4(1), e2 (2018)
Kolhatkar, V., Wu, H., Cavasso, L., Francis, E., Shukla, K., Taboada, M.: The SFU opinion and comments corpus: a corpus for the analysis of online news comments. Corpus Pragmatics 4(2), 155–190 (2019). https://doi.org/10.1007/s41701-019-00065-w
Da San Martino, G., Yu, S., Barrón-cedeño, A., Petrov, R., Nakov, P.: Fine-grained analysis of propaganda in news article. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5636–5646 (2019)
Mansmann, S.: Extending the multidimensional data model to handle complex data. J. Comput. Sci. Eng. 1(2), 125–160 (2007)
Rehman, N.U., Mansmann, S., Weiler, A., Scholl, M.H.: Building a data warehouse for Twitter stream exploration. In: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Istanbul, pp. 1341–1348 (2012)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Sakata, T.C., Nogueira, R., González, S.M.: NewsMinerCollection (2017). http://dx.doi.org/10.17632/9j47dhd4kx.2
Lang, K.: Newsweeder: Learning to Filter Netnews. In: Proceedings of the 12th International Conference on Machine Learning, Tahoe City, California, pp. 331–339 (1995)
Pedregosa, F.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Nadeau, C., Bengio, Y.: Inference for the generalization error. Mach. Learn. 52, 239–281 (2003)
Dai, Z., Taneja, H., Huang, R.: Fine-grained structure-based news genre categorization. In: Proceedings of the Workshop on Events and Stories in the News, Santa Fe, USA, pp. 61–67 (2018)
Suleymano, U., Rustamov, S.: Automated news categorization using machine learning methods. In: IOP Conference Series: Materials Science and Engineering, vol. 459, Aegean International Textile and Advanced Engineering Conference (AΙTAE 2018), Lesvos, Greece (2018)
Vaisman, A., Zimányi, E.: Data warehouses: Next challenges, Business Intelligence, Vols. First European Business Intelligence Summer School, pp. 1–26 (2012)
Bouaziz, S., Nabli, A., Gargouri, F.: From traditional data warehouse to real time data warehouse. In: Madureira, A.M., Abraham, A., Gamboa, D., Novais, P. (eds.) ISDA 2016. AISC, vol. 557, pp. 467–477. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-53480-0_46
Wagner, R., de Macedo, J.A.F., Raffaetà, A., Renso, C., Roncato, A., Trasarti, R.: Mob-warehouse: a semantic approach for mobility analysis with a trajectory data warehouse. In: Parsons, J., Chiu, D. (eds.) ER 2013. LNCS, vol. 8697, pp. 127–136. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14139-8_15
Victor, S., Rex, M.M.X.: Analytical implementation of web structure mining using data analysis in educational domain. Int. J. Appl. Eng. Res. 11(4), 2552–2556 (2016)
Meyer, R.: How many stories of newspapers publish per day?, 26 May 2016. https://www.theatlantic.com/technology/archive/2016/05/how-many-stories-do-newspapers-publish-per-day/483845/. Accessed 23 Oct 2019
Acknowledgments
This work is a result of the financial support provided by FAPESP (São Paulo Research Foundation, Brazil), grant number 2011/12115-1. The authors acknowledge the CAPES’ graduate program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
González, S.M., Sakata, T.C., Nogueira, R.R. (2020). Newsminer: Enriched Multidimensional Corpus for Text-Based Applications. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2020. Lecture Notes in Computer Science(), vol 12416. Springer, Cham. https://doi.org/10.1007/978-3-030-61534-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-61534-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-61533-8
Online ISBN: 978-3-030-61534-5
eBook Packages: Computer ScienceComputer Science (R0)