Skip to main content

Building Concise Text Corpora from Web Contents

  • Chapter
  • First Online:
Semantic Applications

Abstract

This is a report on ongoing work done in a research project for Small and Medium-sized Enterprises (SMEs), funded by the German Federal Ministry of Education and Research (Funding ID: 01IS15056D; project duration: Jan 2016 – Dec 2017). The project, named OntoPMS, is targeted at post market surveillance (PMS) of medical devices as required by the medical device regulation (Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L, pp 1–175, 2017) which entered into power following formal publication in May 2017. Being a regulation, it is immediately legally binding in all member states of the European Union. This project aims at providing both technical support and assisting procedures to satisfy article 4 of the MDR: “Key elements of the existing regulatory approach, such as the supervision of notified bodies, conformity assessment procedures, clinical investigations and clinical evaluation, vigilance and market surveillance should be significantly reinforced, whilst provisions ensuring transparency and traceability regarding medical devices should be introduced, to improve health and safety.” This chapter focuses on one component of the software system under development, the corpus builder. This component retrieves scientific publications of interest from the web and other sources, checks them for relevance and transfers them to a linguistic corpus and in parallel to a search engine based on the open source package Elasticsearch. The challenge was, in this case, not to take everything that one can get hold of (whole web crawling) but to find and to take only those publications that really belong to the domain of interest and are relevant with respect to surveillance aspects. So, the dictum was to build comprehensive yet minimal corpora for the purposes at hand. Although the software has been developed in the context of medical device PMS, its use is not bound in any way to this specific application area.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 49.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Web Crawler: https://en.wikipedia.org/wiki/Web_crawler.

  2. 2.

    Unified Resource Locator: https://en.wikipedia.org/wiki/URL.

  3. 3.

    Web Scraping: https://en.wikipedia.org/wiki/Web_scraping.

  4. 4.

    Inverted Index: https://en.wikipedia.org/wiki/Inverted_index.

  5. 5.

    Vertical Search: https://en.wikipedia.org/wiki/Vertical_search.

  6. 6.

    Markup Language: https://en.wikipedia.org/wiki/Markup_language.

  7. 7.

    Hypertext Markup Language: https://en.wikipedia.org/wiki/HTML.

  8. 8.

    Portable Document Format: https://en.wikipedia.org/wiki/Portable_Document_Format.

  9. 9.

    Standard Operating Procedure: https://en.wikipedia.org/wiki/Standard_operating_procedure.

  10. 10.

    Ontology: https://en.wikipedia.org/wiki/Ontology.

  11. 11.

    Ontology in Computer Science: https://en.wikipedia.org/wiki/Ontology_(information_science).

  12. 12.

    Part of Speech Tagging: https://en.wikipedia.org/wiki/Part-of-speech_tagging.

  13. 13.

    Query Language: https://en.wikipedia.org/wiki/Query_language.

  14. 14.

    Ontology: https://en.wikipedia.org/wiki/Ontology.

  15. 15.

    Ontology in Computer Science: https://en.wikipedia.org/wiki/Ontology_(information_science).

  16. 16.

    Apache Lucene Core: https://lucene.apache.org/core/.

  17. 17.

    Apache Solr: http://lucene.apache.org/solr/features.html.

  18. 18.

    gensim, Topic Modeling for Humans, https://radimrehurek.com/gensim/.

  19. 19.

    Domain Names: https://en.wikipedia.org/wiki/Domain_name.

  20. 20.

    Apache Nutch Web Crawler: http://nutch.apache.org/.

  21. 21.

    NLTK Natural Language Toolkit: http://www.nltk.org/.

  22. 22.

    Industrial-Strength Natural Language Processing: https://spacy.io/.

  23. 23.

    Angular cross platform web framework: https://angular.io/.

  24. 24.

    MAUDE – Manufacturer and User Facility Device Experience: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm.

References

  1. Herre H (2010) General formal ontology (GFO): a foundational ontology for conceptual modelling. In: Poli R, Healy M, Kameas A (eds) Theory and applications of ontology: computer applications. Springer, Dordrecht, pp 297–345

    Chapter  Google Scholar 

  2. Uciteli A, Goller C, Burek P, Siemoleit S, Faria B, Galanzina H, Weiland T, Drechsler-Hake D, Bartussek W, Herre H (2014) Search ontology, a new approach towards semantic search. In: Plödereder E, Grunske L, Schneider E, Ull D (eds) FoRESEE: Future Search Engines 2014–44. Annual meeting of the GI, Stuttgart – GI edition proceedings LNI. Köllen, Bonn, pp 667–672

    Google Scholar 

  3. Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L (2017) pp 1–175

    Google Scholar 

Download references

Acknowledgements

Acknowledgements go to all participants of the OntoPMS consortium. With respect to ontologies, accompanying work flows, and available technologies I would like to thank Prof. Heinrich Herre, Alexandr Uciteli, and Stephan Kropf from the IMISE at the University of Leipzig for many inspiring conversations. I wouldn’t have had much chance to understand medical regulations in Europe without the help of the novineon personnel Timo Weiland (consortium project lead), Prof. Marc O. Schurr, Stefanie Meese, Klaus Gräf, and the quality manager from Ovesco, Matthias Leenen. The participants from the BfArM, the German Federal Institute for Drugs and Medical Devices, with Prof. Wolfgang Lauer and Robin Seidel helped me understand the MAUDEFootnote 24 database and how to connect it to the CorpusBuilder. IntraFind (Christoph Goller and Philipp Blohm) developed an ingenious enhancement to the search engine exploiting the corpus; and MT2IT (Prof. Jörg-Uwe Meyer, Michael Witte) will provide the structures of the overall system where the CorpusBuilder will be embedded. I also would like to thank my colleagues at OntoPort, Anatol Reibold and Günter Lutz-Misof for their astute remarks on earlier versions of this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolfram Bartussek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bartussek, W. (2018). Building Concise Text Corpora from Web Contents. In: Hoppe, T., Humm, B., Reibold, A. (eds) Semantic Applications. Springer Vieweg, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-55433-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-55433-3_8

  • Published:

  • Publisher Name: Springer Vieweg, Berlin, Heidelberg

  • Print ISBN: 978-3-662-55432-6

  • Online ISBN: 978-3-662-55433-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics