Skip to main content

Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs

  • Conference paper
  • First Online:

Abstract

Unifying information across the organizational data silos that lack documentation, structure and automated semantic discovery has been of an intense interest in the recent years. Enterprise knowledge graph is a common tool of data integration and knowledge discovery and it has become a backbone to APIs that demand access to structured knowledge. A piece which was previously unnoticed in building enterprise knowledge graph, is adding an abstract layer of themes and concepts which is mapped to various documents stored as semi-structured files in databases. Augmenting enterprise knowledge graphs by concepts will help companies to find the trends in their data and get a holistic view over their entire data stores. Extracting topics from semi-structured data suffers from lack of corpus or description as its major challenge. In this research, we investigate the impact of self-supplementation of words and documents on probabilistic topic modeling upon semi-structured data. Another contribution of this paper is finding the best tuning of probabilistic topic modeling that fits semi-structured data. The extracted topics are potential summaries and concepts about the dataset. Moreover, they can be mapped to their sources of origin in order to extend the enterprise knowledge graph. We consider 2 inferencing techniques and demonstrate the results on real data pools from Open City data and Kaggle data containing 7.5 GB and 1.15 GB of data stored in MongoDB collections, respectively. We also propose a selection heuristic for effective identification of topics hidden in various data sources.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Pan, J.Z., Vetere, G., Gómez-Pérez, J.M., Wu, H.: Exploiting Linked Data and Knowledge Graphs in Large Organisations. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-45654-6

    Book  Google Scholar 

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)

    MATH  Google Scholar 

  3. Hype cycle for emerging technologies. https://www.gartner.com/smarterwithgar-tner/5-trends-emerge-in-gartner-hype-cycle-for-emerging-technologies-2018/. Accessed 22 Oct 2018

  4. Halevy, A.Y., et al.: Managing Google’s data lake: an overview of the goods system. IEEE Data Eng. Bull. 39(3), 5–14 (2016)

    Google Scholar 

  5. Chang, F., et al.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)

    Article  Google Scholar 

  6. Hellerstein, J.M., et al.: Ground: a data context service. In: CIDR (2017)

    Google Scholar 

  7. Beheshti, A., Benatallah, B., Nouri, R., Tabebordbar, A.: CoreKG: a knowledge lake service. Proce. VLDB Endowment 11(12), 1942–1945 (2018)

    Article  Google Scholar 

  8. Beheshti, A., Benatallah, B., Nouri, R., Chhieng, V.M., Xiong, H., Zhao, X.: CoreDB: a data lake service. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2451–2454. ACM (2017)

    Google Scholar 

  9. Data discovery and lineage for big data ecosystem. https://github.com/linkedin/WhereHows. Accessed 22 Jan 2018

  10. Abolhassani, N., et al.: Universal metadata repository: integrating data profiles across an organization. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI), pp. 452–459. IEEE (2018)

    Google Scholar 

  11. Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 289–296. Morgan Kaufmann Publishers Inc. (1999)

    Google Scholar 

  12. Song, Y., Wang, H., Wang, Z., Li, H., Chen, W.: Short text conceptualization using a probabilistic knowledgebase. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume Volume Three, pp. 2330–2336. AAAI Press (2011)

    Google Scholar 

  13. Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: a probabilistic taxonomy for text understanding. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481–492. ACM (2012)

    Google Scholar 

  14. Jin, O., Liu, N.N., Zhao, K., Yu, Y., Yang, Q.: Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 775–784. ACM (2011)

    Google Scholar 

  15. Qiang, J., Chen, P., Wang, T., Wu, X.: Topic modeling over short texts by incorporating word embeddings. In: Kim, J., Shim, K., Cao, L., Lee, J.-G., Lin, X., Moon, Y.-S. (eds.) PAKDD 2017. LNCS (LNAI), vol. 10235, pp. 363–374. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57529-2_29

    Chapter  Google Scholar 

  16. Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 1445–1456. ACM (2013)

    Google Scholar 

  17. Quan, X., Kit, C., Ge, Y., Pan, S.J.: Short and sparse text topic modeling via self-aggregation. In: IJCAI, pp. 2270–2276 (2015)

    Google Scholar 

  18. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Nat. Acad. Sci. 101(suppl 1), 5228–5235 (2004)

    Article  Google Scholar 

  19. Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent Dirichlet allocation. In: Advances in Neural Information Processing Systems, pp. 856–864 (2010)

    Google Scholar 

  20. Sievert, C., Shirley, K.: LDAvis: a method for visualizing and interpreting topics. In: Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, pp. 63–70 (2014)

    Google Scholar 

  21. Kaggle open datasets. https://www.kaggle.com/datasets/. Accessed 02 Oct 2019

  22. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. Association for Computational Linguistics (2011)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Dr. Pouya Asrar and Kamal Shadi for their valuable feedback in this project. This research has been partially funded by the National Science Foundation (NSF) under grants CCF-1442672 and SCC-1637277 and gifts from Accenture Research Labs. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF or other funding agencies and companies mentioned above.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neda Abolhassani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Abolhassani, N., Ramaswamy, L. (2019). Extracting Topics from Semi-structured Data for Enhancing Enterprise Knowledge Graphs. In: Wang, X., Gao, H., Iqbal, M., Min, G. (eds) Collaborative Computing: Networking, Applications and Worksharing. CollaborateCom 2019. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 292. Springer, Cham. https://doi.org/10.1007/978-3-030-30146-0_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30146-0_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30145-3

  • Online ISBN: 978-3-030-30146-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics