Empowering Bridging Term Discovery for Cross-Domain Literature Mining in the TextFlows Platform

Perovšek, Matic; Juršič, Matjaž; Cestnik, Bojan; Lavrač, Nada

doi:10.1007/978-3-319-50478-0_4

Matic Perovšek^14,15,
Matjaž Juršič^14,15,
Bojan Cestnik^14,16 &
…
Nada Lavrač^14,15,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9605))

5159 Accesses

Abstract

Given its immense growth, scientific literature can be explored to reveal new discoveries, based on yet uncovered relations between knowledge from different, relatively isolated fields of research specialization. This chapter proposes a bisociation-based text mining approach, which shows to be effective for cross-domain knowledge discovery. The proposed cross-domain literature mining functionality, including text acquisition, text preprocessing, and bisociative cross-domain literature mining facilities, is made publicly available within a new browser-based workflow execution engine TextFlows, which supports visual construction and execution of text mining and natural language processing (NLP) workflows. To support bisociative cross-domain literature mining, the TextFlows platform includes implementations of several elementary and ensemble heuristics that guide the expert in the process of exploring new cross-context bridging terms. We have extended the TextFlows platform with several components, which—together with document exploration and visualization features of the CrossBee human-computer interface—make it a powerful, user-friendly text analysis tool for exploratory cross-domain knowledge discovery. Another novelty of the developed technology is the enabled use of controlled vocabularies to improve bridging term extraction. The potential of the developed functionality was showcased in two medical benchmark domains.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Mining Biomedical Literature: An Open Source and Modular Approach

Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research

Improving Literature-Based Discovery with Advanced Text Mining

Notes

1.
Our new text mining platform, named TextFlows, is publicly available for use at http://textflows.org. The source code (open sourced under the MIT Licence) is available at https://github.com/xflows/textflows. Detailed installation instructions are provided with the source code.
2.
http://textflows.org/workflow/486/.
3.
Our platform TextFlows is a fork of data mining platform ClowdFlows [21], adapted to text mining and enriched with text analytics and natural language processing algorithms. As a fork of ClowdFlows, it benefits from its service-oriented architecture, which allows the user to utilize arbitrary web-services as workflow components. In addition to the new functionality, its novelty is a common text representation structure and the development of ‘hubs’ for algorithm execution.
4.
LATINO (Link Analysis and Text Mining Toolbox) is open-source—mostly under the LGPL license—and is available at https://github.com/LatinoLib/LATINO/.
5.
https://www.python.org/.
6.
LATINO (Link Analysis and Text Mining Toolbox library) is open-source—mostly under the LGPL license—and is available at https://github.com/LatinoLib/LATINO/.
7.
Natural Language Toolkit.
8.
Compressed Sparse Row (CSR) matrices are implemented in the scipy.sparse package http://docs.scipy.org/doc/scipy/reference/sparse.html.
9.
The Calculate Term Heuristic Scores widget also takes as input the BowModelContructor object and the AnnotatedDocumentCorpus. The parse settings from the BowModelConstructor object are used to construct Compressed Sparse Row (CSR) matrices, which represents the BoW model. TextFlows uses mathematical libraries numpy and scipy to efficiently perform the heuristics calculations.
10.
Due to a large number of heuristics and auxiliary functions, we use the so called camel casing multi-word naming scheme for easier distinction; names are formed by word concatenation and capitalization of all non first words (e.g., freqProdRel and tfidfProduct).
11.
LemmaGen is an open source lemmatizer with 15 prebuilted european lexicons. Its source code and documentation is publicly available at http://lemmatise.ijs.si/.
12.
http://www.nlm.nih.gov/mesh/filelist.html.
13.
This workflow is publicly available at http://textflows.org/workflow/497/.
14.
Note that Swanson did not state that this was an exclusive list, hence there may exist other important bridging terms which he did not list.
15.
If a heuristic is perfect (it detects all the B-terms and ranks them at the top of the ordered list), we get a curve that goes first just up and then just right with an AUROC of 100%. The worst possible heuristic sorts all the terms randomly regardless of being a B-term or not and achieves AUROC of 50%. This random heuristic is represented by the diagonal in the ROC space.
16.
In such cases, the AUROC calculation can either maximize the AUROC by sorting all the B-terms in front of all the other terms inside equal scoring sets or minimize it by putting the B-terms at the back. The AUROC calculation can also achieve many AUROC values in between these two extremes by using different (e.g., random) sortings of equal scoring sets. Preferable are the heuristics with a smaller interval which implies that they produce smaller and fewer equal scoring sets.
17.
In contrast to the results reported in [4, 5], the AUROC scores presented in this chapter take into account only the terms which appear in both domains. This results in lower AUROC scores, which are thus not directly comparable between the studies. The reason for this approach is in the definition of a bridging term, where the term is required to appear in both domain, as it cannot form a connection otherwise.

References

Koestler, A.: The Act of Creation, vol. 13 (1964)
Google Scholar
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I., et al.: Fast discovery of association rules. Adv. Knowl. Discov. Data Min. 12(1), 307–328 (1996)
Google Scholar
Dubitzky, W., Kötter, T., Schmidt, O., Berthold, M.R.: Towards creative information exploration based on koestler’s concept of bisociation. In: Berthold, M.R. (ed.) Bisociative Knowledge Discovery. LNCS (LNAI), vol. 7250, pp. 11–32. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31830-6_2
Chapter Google Scholar
Juršič, M., Cestnik, B., Urbančič, T., Lavrač, N.: Bisociative literature mining by ensemble heuristics. In: Berthold, M.R. (ed.) Bisociative Knowledge Discovery. LNCS (LNAI), vol. 7250, pp. 338–358. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31830-6_24
Chapter Google Scholar
Juršič, M., Cestnik, B., Urbančič, T., Lavrač, N.: Cross-domain literature mining: finding bridging concepts with CrossBee. In: Proceedings of the 3rd International Conference on Computational Creativity, pp. 33–40 (2012)
Google Scholar
Berthold, M.R. (ed.): Bisociative Knowledge Discovery. LNCS (LNAI), vol. 7250. Springer, Heidelberg (2012)
Google Scholar
Swanson, D.R.: Medical literature as a potential source of new knowledge. Bull. Med. Libr. Assoc. 78(1), 29 (1990)
Google Scholar
Smalheiser, N., Swanson, D., et al.: Using ARROWSMITH: a computer-assisted approach to formulating and assessing scientific hypotheses. Comput. Methods Programs Biomed. 57(3), 149–154 (1998)
Article Google Scholar
Hristovski, D., Peterlin, B., Mitchell, J.A., Humphrey, S.M.: Using literature-based discovery to identify disease candidate genes. Int. J. Med. Inf. 74(2), 289–298 (2005)
Article Google Scholar
Yetisgen-Yildiz, M., Pratt, W.: Using statistical and knowledge-based approaches for literature-based discovery. J. Biomed. Inform. 39(6), 600–611 (2006)
Article Google Scholar
Holzinger, A., Yildirim, P., Geier, M., Simonic, K.M.: Quality-based knowledge discovery from medical text on the web. In: Pasi, G., Bordogna, G., Jain, L.C. (eds.) Qual. Issues in the Management of Web Information. ISRL, vol. 50, pp. 145–158. Springer, Heidelberg (2013)
Chapter Google Scholar
Kastrin, A., Rindflesch, T.C., Hristovski, D.: Link prediction on the semantic MEDLINE network. In: Džeroski, S., Panov, P., Kocev, D., Todorovski, L. (eds.) DS 2014. LNCS (LNAI), vol. 8777, pp. 135–143. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11812-3_12
Google Scholar
Swanson, D.R.: Migraine and magnesium: eleven neglected connections. Perspect. Biol. Med. 78(1), 526–557 (1988)
Article Google Scholar
Lindsay, R.K., Gordon, M.D.: Literature-based discovery by lexical statistics. J. Am. Soc. Inform. Sci. Technol. 1, 574–587 (1999)
Article Google Scholar
Srinivasan, P.: Text mining: generating hypotheses from medline. J. Am. Soc. Inform. Sci. Technol. 55(5), 396–413 (2004)
Article Google Scholar
Weeber, M., Klein, H., de Jong-va den Berg, L.T.W.: Using concepts in literature-based discovery: simulating swanson’s raynaud-fish oil and migraine-magnesium discoveries. J. Am. Soc. Inform. Sci. Technol. 52(7), 548–557 (2001)
Article Google Scholar
Petrič, I., Cestnik, B., Lavrač, N., Urbančič, T.: Outlier detection in cross-context link discovery for creative literature mining. Comput. J. 55(1), 47–61 (2012)
Article Google Scholar
Sluban, B., Juršič, M., Cestnik, B., Lavrač, N.: Exploring the power of outliers for cross-domain literature mining. In: Berthold, M.R. (ed.) Bisociative Knowledge Discovery. LNCS (LNAI), vol. 7250, pp. 325–337. Springer, Heidelberg (2012). doi:10.1007/978-3-642-31830-6_23
Chapter Google Scholar
Urbančič, T., Petrič, I., Cestnik, B., Macedoni-Lukšič, M.: Literature mining: towards better understanding of Autism. In: Bellazzi, R., Abu-Hanna, A., Hunter, J. (eds.) AIME 2007. LNCS (LNAI), vol. 4594, pp. 217–226. Springer, Heidelberg (2007). doi:10.1007/978-3-540-73599-1_29
Chapter Google Scholar
Aggarwal, C.: Outlier Analysis. Springer, Heidelberg (2013)
Book MATH Google Scholar
Kranjc, J., Podpečan, V., Lavrač, N.: ClowdFlows: a cloud based scientific workflow platform. In: Flach, P.A., Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 816–819. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33486-3_54
Chapter Google Scholar
Grčar, M.: Mining text-enriched heterogeneous information networks. Ph.D. thesis, Jožef Stefan International Postgraduate School (2015) (To appear)
Google Scholar
Bird, S.: Nltk: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive Presentation Sessions, pp. 69–72. Association for Computational Linguistics (2006)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, New York (2007)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)
Article Google Scholar
Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000). doi:10.1007/3-540-45014-9_1
Chapter Google Scholar
Rokach, L.: Pattern classification using ensemble methods. World Scientific (2009)
Google Scholar
Hoi, S.C., Jin, R.: Semi-supervised ensemble ranking. In: AAAI, pp. 634–639 (2008)
Google Scholar
Juršič, M.: Text mining for cross-domain knowledge discovery. Ph.D. thesis, Jožef Stefan International Postgraduate School (2015)
Google Scholar
Juršič, M., Mozetič, I., Erjavec, T., Lavrač, N.: Lemmagen: multilingual lemmatisation with induced ripple-down rules. J. Univ. Comput. Sci. 16(9), 1190–1214 (2010)
Google Scholar
Sluban, B., Gamberger, D., Lavrač, N.: Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Mining Knowl. Discov. 28, 265–303 (2013)
Article MathSciNet MATH Google Scholar
Petrič, I., Urbančič, T., Cestnik, B., Macedoni-Lukšič, M.: Literature mining method rajolink for uncovering relations between biomedical concepts. J. Biomed. Inform. 42(2), 219–227 (2009)
Article Google Scholar
Provost, F.J., Fawcett, T., Kohavi, R.: The case against accuracy estimation for comparing induction algorithms. In: ICML, vol. 98, pp. 445–453 (1998)
Google Scholar
Holzinger, A.: Human-computer interaction and knowledge discovery (HCI-KDD): what is the benefit of bringing those two fields to work together? In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 319–328. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40511-2_22
Chapter Google Scholar
Holzinger, A.: Interactive machine learning for health informatics: when do we need the human-in-the-loop? Springer Brain Inform. (BRIN) 3, 1–13 (2016)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Slovenian Research Agency and the FP7 European Commission FET projects MUSE (grant no. 296703) and ConCreTe (grant no. 611733).

Author information

Authors and Affiliations

Jožef Stefan Institute, Ljubljana, Slovenia
Matic Perovšek, Matjaž Juršič, Bojan Cestnik & Nada Lavrač
Jožef Stefan International Postgraduate School, Ljubljana, Slovenia
Matic Perovšek, Matjaž Juršič & Nada Lavrač
Temida d.o.o, Ljubljana, Slovenia
Bojan Cestnik
University of Nova Gorica, Nova Gorica, Slovenia
Nada Lavrač

Authors

Matic Perovšek
View author publications
You can also search for this author in PubMed Google Scholar
Matjaž Juršič
View author publications
You can also search for this author in PubMed Google Scholar
Bojan Cestnik
View author publications
You can also search for this author in PubMed Google Scholar
Nada Lavrač
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matic Perovšek .

Editor information

Editors and Affiliations

Institute for Medical Informatics, Statistics and Documentation, Medical University Graz, Graz, Austria
Andreas Holzinger

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Perovšek, M., Juršič, M., Cestnik, B., Lavrač, N. (2016). Empowering Bridging Term Discovery for Cross-Domain Literature Mining in the TextFlows Platform. In: Holzinger, A. (eds) Machine Learning for Health Informatics. Lecture Notes in Computer Science(), vol 9605. Springer, Cham. https://doi.org/10.1007/978-3-319-50478-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-50478-0_4
Published: 10 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50477-3
Online ISBN: 978-3-319-50478-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Empowering Bridging Term Discovery for Cross-Domain Literature Mining in the TextFlows Platform

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Mining Biomedical Literature: An Open Source and Modular Approach

Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research

Improving Literature-Based Discovery with Advanced Text Mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Empowering Bridging Term Discovery for Cross-Domain Literature Mining in the TextFlows Platform

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Mining Biomedical Literature: An Open Source and Modular Approach

Brief Description of COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research

Improving Literature-Based Discovery with Advanced Text Mining

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation