Building Concise Text Corpora from Web Contents

Bartussek, Wolfram

doi:10.1007/978-3-662-55433-3_8

Wolfram Bartussek⁴

919 Accesses
1 Citations

Abstract

This is a report on ongoing work done in a research project for Small and Medium-sized Enterprises (SMEs), funded by the German Federal Ministry of Education and Research (Funding ID: 01IS15056D; project duration: Jan 2016 – Dec 2017). The project, named OntoPMS, is targeted at post market surveillance (PMS) of medical devices as required by the medical device regulation (Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L, pp 1–175, 2017) which entered into power following formal publication in May 2017. Being a regulation, it is immediately legally binding in all member states of the European Union. This project aims at providing both technical support and assisting procedures to satisfy article 4 of the MDR: “Key elements of the existing regulatory approach, such as the supervision of notified bodies, conformity assessment procedures, clinical investigations and clinical evaluation, vigilance and market surveillance should be significantly reinforced, whilst provisions ensuring transparency and traceability regarding medical devices should be introduced, to improve health and safety.” This chapter focuses on one component of the software system under development, the corpus builder. This component retrieves scientific publications of interest from the web and other sources, checks them for relevance and transfers them to a linguistic corpus and in parallel to a search engine based on the open source package Elasticsearch. The challenge was, in this case, not to take everything that one can get hold of (whole web crawling) but to find and to take only those publications that really belong to the domain of interest and are relevant with respect to surveillance aspects. So, the dictum was to build comprehensive yet minimal corpora for the purposes at hand. Although the software has been developed in the context of medical device PMS, its use is not bound in any way to this specific application area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Hardcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Web Crawler: https://en.wikipedia.org/wiki/Web_crawler.
2.
Unified Resource Locator: https://en.wikipedia.org/wiki/URL.
3.
Web Scraping: https://en.wikipedia.org/wiki/Web_scraping.
4.
Inverted Index: https://en.wikipedia.org/wiki/Inverted_index.
5.
Vertical Search: https://en.wikipedia.org/wiki/Vertical_search.
6.
Markup Language: https://en.wikipedia.org/wiki/Markup_language.
7.
Hypertext Markup Language: https://en.wikipedia.org/wiki/HTML.
8.
Portable Document Format: https://en.wikipedia.org/wiki/Portable_Document_Format.
9.
Standard Operating Procedure: https://en.wikipedia.org/wiki/Standard_operating_procedure.
10.
Ontology: https://en.wikipedia.org/wiki/Ontology.
11.
Ontology in Computer Science: https://en.wikipedia.org/wiki/Ontology_(information_science).
12.
Part of Speech Tagging: https://en.wikipedia.org/wiki/Part-of-speech_tagging.
13.
Query Language: https://en.wikipedia.org/wiki/Query_language.
14.
Ontology: https://en.wikipedia.org/wiki/Ontology.
15.
Ontology in Computer Science: https://en.wikipedia.org/wiki/Ontology_(information_science).
16.
Apache Lucene Core: https://lucene.apache.org/core/.
17.
Apache Solr: http://lucene.apache.org/solr/features.html.
18.
gensim, Topic Modeling for Humans, https://radimrehurek.com/gensim/.
19.
Domain Names: https://en.wikipedia.org/wiki/Domain_name.
20.
Apache Nutch Web Crawler: http://nutch.apache.org/.
21.
NLTK Natural Language Toolkit: http://www.nltk.org/.
22.
Industrial-Strength Natural Language Processing: https://spacy.io/.
23.
Angular cross platform web framework: https://angular.io/.
24.
MAUDE – Manufacturer and User Facility Device Experience: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm.

References

Herre H (2010) General formal ontology (GFO): a foundational ontology for conceptual modelling. In: Poli R, Healy M, Kameas A (eds) Theory and applications of ontology: computer applications. Springer, Dordrecht, pp 297–345
Chapter Google Scholar
Uciteli A, Goller C, Burek P, Siemoleit S, Faria B, Galanzina H, Weiland T, Drechsler-Hake D, Bartussek W, Herre H (2014) Search ontology, a new approach towards semantic search. In: Plödereder E, Grunske L, Schneider E, Ull D (eds) FoRESEE: Future Search Engines 2014–44. Annual meeting of the GI, Stuttgart – GI edition proceedings LNI. Köllen, Bonn, pp 667–672
Google Scholar
Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L (2017) pp 1–175
Google Scholar

Download references

Acknowledgements

Acknowledgements go to all participants of the OntoPMS consortium. With respect to ontologies, accompanying work flows, and available technologies I would like to thank Prof. Heinrich Herre, Alexandr Uciteli, and Stephan Kropf from the IMISE at the University of Leipzig for many inspiring conversations. I wouldn’t have had much chance to understand medical regulations in Europe without the help of the novineon personnel Timo Weiland (consortium project lead), Prof. Marc O. Schurr, Stefanie Meese, Klaus Gräf, and the quality manager from Ovesco, Matthias Leenen. The participants from the BfArM, the German Federal Institute for Drugs and Medical Devices, with Prof. Wolfgang Lauer and Robin Seidel helped me understand the MAUDE^{Footnote 24} database and how to connect it to the CorpusBuilder. IntraFind (Christoph Goller and Philipp Blohm) developed an ingenious enhancement to the search engine exploiting the corpus; and MT2IT (Prof. Jörg-Uwe Meyer, Michael Witte) will provide the structures of the overall system where the CorpusBuilder will be embedded. I also would like to thank my colleagues at OntoPort, Anatol Reibold and Günter Lutz-Misof for their astute remarks on earlier versions of this chapter.

Author information

Authors and Affiliations

OntoPort UG, Sulzbach, Germany
Wolfram Bartussek

Authors

Wolfram Bartussek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolfram Bartussek .

Editor information

Editors and Affiliations

Datenlabor Berlin, Berlin, Germany
Thomas Hoppe
Fachbereich Informatik, Hochschule Darmstadt, Darmstadt, Germany
Bernhard Humm
Ontoport UG, Sulzbach, Germany
Anatol Reibold

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bartussek, W. (2018). Building Concise Text Corpora from Web Contents. In: Hoppe, T., Humm, B., Reibold, A. (eds) Semantic Applications. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55433-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-662-55433-3_8
Published: 14 April 2018
Publisher Name: Springer Vieweg, Berlin, Heidelberg
Print ISBN: 978-3-662-55432-6
Online ISBN: 978-3-662-55433-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics