Repository | Book | Chapter

196905

(2018) Semantic applications, Dordrecht, Springer.

Building concise text corpora from web contents

Wolfram Bartussek

pp. 95-109

This is a report on ongoing work done in a research project for Small and Medium-sized Enterprises (SMEs), funded by the German Federal Ministry of Education and Research (Funding ID: 01IS15056D; project duration: Jan 2016 – Dec 2017). The project, named OntoPMS, is targeted at post market surveillance (PMS) of medical devices as required by the medical device regulation (Medical Device Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, OJ. L, pp 1–175, 2017) which entered into power following formal publication in May 2017. Being a regulation, it is immediately legally binding in all member states of the European Union. This project aims at providing both technical support and assisting procedures to satisfy article 4 of the MDR: "Key elements of the existing regulatory approach, such as the supervision of notified bodies, conformity assessment procedures, clinical investigations and clinical evaluation, vigilance and market surveillance should be significantly reinforced, whilst provisions ensuring transparency and traceability regarding medical devices should be introduced, to improve health and safety." This chapter focuses on one component of the software system under development, the corpus builder. This component retrieves scientific publications of interest from the web and other sources, checks them for relevance and transfers them to a linguistic corpus and in parallel to a search engine based on the open source package Elasticsearch. The challenge was, in this case, not to take everything that one can get hold of (whole web crawling) but to find and to take only those publications that really belong to the domain of interest and are relevant with respect to surveillance aspects. So, the dictum was to build comprehensive yet minimal corpora for the purposes at hand. Although the software has been developed in the context of medical device PMS, its use is not bound in any way to this specific application area.

Publication details

DOI: 10.1007/978-3-662-55433-3_8

Full citation:

Bartussek, W. (2018)., Building concise text corpora from web contents, in T. Hoppe, B. Humm & A. Reibold (eds.), Semantic applications, Dordrecht, Springer, pp. 95-109.

This document is unfortunately not available for download at the moment.