Skip to main content

Newsminer: Enriched Multidimensional Corpus for Text-Based Applications

  • Conference paper
  • First Online:
Artificial Intelligence and Soft Computing (ICAISC 2020)

Abstract

News websites are rich sources of terms that can compose a linguistic corpus. By introducing a corpus into a Data Warehousing environment, applications can take advantage of the flexibility that a multidimensional model and OLAP operations provide. This paper presents Newsminer, an exploratory OLAP framework, which offers a consistent and clean set of texts as a multidimensional corpus for consumption by external applications. The proposal integrates real-time gathering of news and semantic enrichment, which adds automatic annotations to the corpus. The multidimensional facet allows users and applications to obtain different corpora by selecting news categories, time slice, and term selection. We performed two experiments to evaluate the semantics enrichment and the feasibility of real-time during Newsminer’s ETL.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques, San Francisco. Morgan Kaufmann Publishers Inc., California (2011)

    Google Scholar 

  2. Mansmann, S., Rehman, N.U., Weiler, A., Scholl, M.H.: Discovering OLAP dimensions in semi-structured data. Inf. Syst. 44, 120–133 (2014)

    Article  Google Scholar 

  3. Abello, A.: Using semantic web technologies for exploratory OLAP: a survey. IEEE Trans. Knowl. Data Eng. 27(2), 571–588 (2015)

    Article  Google Scholar 

  4. Berbel, T.R.L., González, S.M.: How to help end users to get better decisions? Personalizing OLAP aggregation queries through semantic recommendation of text documents. Int. J. Bus. Intell. Data Min. 10(1), 1–18 (2015)

    Google Scholar 

  5. Mendoza, M., Alegría, E., Maca, M., Cobos, C., León, E.: Multidimensional analysis model for a document warehouse that includes textual measures. Decis. Support Syst. 72(1), 44–59 (2015)

    Article  Google Scholar 

  6. Gallinucci, E., Golfarelli, M., Rizzi, S., Abelló, A., Romero, O.: Interactive multidimensional modeling of linked data for exploratory OLAP. Inf. Syst. 77, 86–104 (2018)

    Article  Google Scholar 

  7. Hovy, E., Lavid, J.: Towards a ‘science’ of corpus annotation: a new methodological challenge for corpus linguistics. Int. J. Transl. 22(1), 13–36 (2010)

    Google Scholar 

  8. Simpson, S.S., Adams, N., Brugman, C.M., Conners, T.J.: Detecting novel and emerging drug terms using natural language processing: a social media corpus study. JMIR Public Health Surveill. 4(1), e2 (2018)

    Article  Google Scholar 

  9. Kolhatkar, V., Wu, H., Cavasso, L., Francis, E., Shukla, K., Taboada, M.: The SFU opinion and comments corpus: a corpus for the analysis of online news comments. Corpus Pragmatics 4(2), 155–190 (2019). https://doi.org/10.1007/s41701-019-00065-w

    Article  Google Scholar 

  10. Da San Martino, G., Yu, S., Barrón-cedeño, A., Petrov, R., Nakov, P.: Fine-grained analysis of propaganda in news article. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5636–5646 (2019)

    Google Scholar 

  11. Mansmann, S.: Extending the multidimensional data model to handle complex data. J. Comput. Sci. Eng. 1(2), 125–160 (2007)

    Article  Google Scholar 

  12. Rehman, N.U., Mansmann, S., Weiler, A., Scholl, M.H.: Building a data warehouse for Twitter stream exploration. In: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Istanbul, pp. 1341–1348 (2012)

    Google Scholar 

  13. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  14. Sakata, T.C., Nogueira, R., González, S.M.: NewsMinerCollection (2017). http://dx.doi.org/10.17632/9j47dhd4kx.2

  15. Lang, K.: Newsweeder: Learning to Filter Netnews. In: Proceedings of the 12th International Conference on Machine Learning, Tahoe City, California, pp. 331–339 (1995)

    Google Scholar 

  16. Pedregosa, F.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  17. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  18. Nadeau, C., Bengio, Y.: Inference for the generalization error. Mach. Learn. 52, 239–281 (2003)

    Article  Google Scholar 

  19. Dai, Z., Taneja, H., Huang, R.: Fine-grained structure-based news genre categorization. In: Proceedings of the Workshop on Events and Stories in the News, Santa Fe, USA, pp. 61–67 (2018)

    Google Scholar 

  20. Suleymano, U., Rustamov, S.: Automated news categorization using machine learning methods. In: IOP Conference Series: Materials Science and Engineering, vol. 459, Aegean International Textile and Advanced Engineering Conference (AΙTAE 2018), Lesvos, Greece (2018)

    Google Scholar 

  21. Vaisman, A., Zimányi, E.: Data warehouses: Next challenges, Business Intelligence, Vols. First European Business Intelligence Summer School, pp. 1–26 (2012)

    Google Scholar 

  22. Bouaziz, S., Nabli, A., Gargouri, F.: From traditional data warehouse to real time data warehouse. In: Madureira, A.M., Abraham, A., Gamboa, D., Novais, P. (eds.) ISDA 2016. AISC, vol. 557, pp. 467–477. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-53480-0_46

    Chapter  Google Scholar 

  23. Wagner, R., de Macedo, J.A.F., Raffaetà, A., Renso, C., Roncato, A., Trasarti, R.: Mob-warehouse: a semantic approach for mobility analysis with a trajectory data warehouse. In: Parsons, J., Chiu, D. (eds.) ER 2013. LNCS, vol. 8697, pp. 127–136. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14139-8_15

    Chapter  Google Scholar 

  24. Victor, S., Rex, M.M.X.: Analytical implementation of web structure mining using data analysis in educational domain. Int. J. Appl. Eng. Res. 11(4), 2552–2556 (2016)

    Google Scholar 

  25. Meyer, R.: How many stories of newspapers publish per day?, 26 May 2016. https://www.theatlantic.com/technology/archive/2016/05/how-many-stories-do-newspapers-publish-per-day/483845/. Accessed 23 Oct 2019

Download references

Acknowledgments

This work is a result of the financial support provided by FAPESP (São Paulo Research Foundation, Brazil), grant number 2011/12115-1. The authors acknowledge the CAPES’ graduate program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sahudy Montenegro González .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

González, S.M., Sakata, T.C., Nogueira, R.R. (2020). Newsminer: Enriched Multidimensional Corpus for Text-Based Applications. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2020. Lecture Notes in Computer Science(), vol 12416. Springer, Cham. https://doi.org/10.1007/978-3-030-61534-5_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-61534-5_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-61533-8

  • Online ISBN: 978-3-030-61534-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics