Skip to main content

Structuring Unstructured Data—Or: How Machine Learning Can Make You a Wine Sommelier

  • Chapter
  • First Online:
  • 1949 Accesses

Abstract

Textual data, for example in the form of e-mails, instant messages, or social media posts, is ubiquitous today. As textual data typically comes in unstructured formats and is often ambiguous in meaning, it is difficult to analyze it using computational tools. However, advances in machine learning and the increasing availability of training data make it now possible to extract useful knowledge from large amounts of unstructured textual data. In this chapter, we showcase the use of unsupervised machine learning algorithms and visualization techniques to bring structure to—and thereby learn from—more than 100,000 professional wine reviews. Something that could be useful, for example, when choosing suitable wines for the celebration of your 60th birthday.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    All data analysis steps in the following analysis were performed in Python (mainly using the spaCy, NLTK, genism, and scikit-learn packages) and all visualizations were created with Tableau.

  2. 2.

    The dataset can be downloaded at https://www.kaggle.com/zynicide/wine-reviews.

  3. 3.

    For a discussion of the optimal number of topics see, e.g., Debortoli, Müller, Junglas, & vom Brocke (2016) or Schmiedel, Müller, & vom Brocke (2018).

References

  • Blei, D. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84.

    Article  Google Scholar 

  • Blei, D., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  • BrightLocal. (2014). Local consumer review survey 2014. Retrieved September 19, 2018, from https://www.brightlocal.com/learn/local-consumer-review-survey-2014/.

  • Debortoli, S., Müller, O., Junglas, I., & vom Brocke, J. (2016). Text mining for information systems researchers: An annotated topic modeling tutorial. Communications of the Association for Information Systems, 39(1).

    Article  Google Scholar 

  • Dhar, V. (2013). Data science and prediction. Communications of the ACM, 56(12), 64–73.

    Article  Google Scholar 

  • Fan, W., Wallace, L., Rich, S., & Zhang, Z. (2006). Tapping the power of text mining. Communications of the ACM, 49(9), 76–82.

    Article  Google Scholar 

  • Frawley, W. J., Piatetsky-Shapiro, G., & Matheus, C. J. (1992). Knowledge discovery in databases: An overview. AI Magazine, 13(3), 57–70.

    Google Scholar 

  • Friedman, J., Hastie, T., & Tibshirani, R. (2013). The elements of statistical learning. New York: Springer.

    Google Scholar 

  • Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), 8–12.

    Article  Google Scholar 

  • Hastie, T., Tibshirani, R., & Friedman, J. (2013). The elements of statistical learning. New York: Springer.

    Google Scholar 

  • IDC. (2014). The 2014 digital universe study. Retrieved September 19, 2018, from http://www.emc.com/leadership/digital-universe/index.htm#2014.

  • Jurafsky, D., & Martin, J. H. (2000). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition. Pearson.

    Google Scholar 

  • Mudambi, S. M., & Schuff, D. (2010). What makes a helpful online review? A study of customer reviews on Amazon.com. MIS Quarterly, 34(1), 185–200.

    Article  Google Scholar 

  • Schmiedel, T., Müller, O., & vom Brocke, J. (2018). Topic modeling as a strategy of inquiry in organizational research: A tutorial with an application example on organizational culture. Organizational Research Methods.

    Google Scholar 

  • Statista. (2017). Number of user reviews and opinions on TripAdvisor worldwide from 2014 to 2017. Retrieved September 19, 2018, from https://www.statista.com/statistics/684862/tripadvisor-number-of-reviews/.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Müller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Müller, O. (2019). Structuring Unstructured Data—Or: How Machine Learning Can Make You a Wine Sommelier. In: Bergener, K., Räckers, M., Stein, A. (eds) The Art of Structuring. Springer, Cham. https://doi.org/10.1007/978-3-030-06234-7_29

Download citation

Publish with us

Policies and ethics