The COVID-19 epidemiology and monitoring ontology

The novel COVID-19 infectious disease emerged and spread, causing high mortality and morbidity rates worldwide. In the OBO Foundry, there are more than one hundred ontologies to share and analyse large-scale datasets for biological and biomedical sciences. However, this pandemic revealed that we lack tools for an efficient and timely exchange of this epidemiological data which is necessary to assess the impact of disease outbreaks, the efficacy of mitigating interventions and to provide a rapid response. In this study we present our findings and contributions for the bio-ontologies community.

O ne year ago, the novel COVID-19 infectious disease emerged and spread, causing high mortality and morbidity rates worldwide. In the OBO Foundry, there are more than one hundred ontologies to share and analyse large-scale datasets for biological and biomedical sciences. However, this pandemic revealed that we lack tools for an efficient and timely exchange of this epidemiological data which is necessary to assess the impact of disease outbreaks, the efficacy of mitigating interventions and to provide a rapid response [1]. Recently, several new COVID-19 ontologies have developed such as the IDO extension [2] or CIDO [3]. Hence, our research question was to determine if there was a good representation of epidemiological quantitative concepts in OBO ontologies. Our objectives were to identify missing COVID-19 epidemiological terms and implement axiom patterns for extensions to existing ontologies or to build a new, logically well-formed, and accurate ontology in OBO. In this study we present our findings and contributions for the bio-ontologies community.

II. Method
This work was conceived and mainly developed during open community hackathons 1 , 2 , 3 . Our approach was based on first, extracting a list of relevant epidemiological terms through manual curation of recent COVID-19 epidemiological studies published in peer-reviewed journals, medRxiv and public health surveillance 1 Virtual-biohackathon covid-19-bh20 2 BioHackathon-EU-2020 3 SWAT4HCLS 2021 websites, and mapping them to existing OBO ontologies. Curation was focused on quantitative data and indicators. Second, developing a minimal ontological representation of COVID-19 epidemiological quantitative information. And third, to refine and evaluate the model with domain expert input.
Our formal modeling followed a rationale already used in other studies: 1) determine the domain and scope of the ontology; 2) ontology reuse and addressing poor ontological coverage of COVID-19 epidemiology; and 3) development of a conceptual model [4,5]. We extracted core domain knowledge concepts from [6,7,8]. We re-used ontological terms and models as much as possible using ontology search engines 4 , 5 . To build an interoperable biomedical ontology, we decided to build an OBO ontology and use the OWL 2, a DL-based formalism and semantic web standard for knowledge representation to enable data sharing and formal reasoning. We used knowledge-engineering best practices following the OBO principles 6 and modularization guidelines [9] to achieve a logically wellformed model. Finally, we based our decisions on building a FAIR resource for health data and research following recent recommendations published by international data standard organizations [10,11]. More information on the method, the list of sources used for curation and extracted terms, and the developed OWL ontology are open and publicly available for reproducibility and community re-use on GitHub.

III. Results
We provide a formal ontological model for COVID-19 epidemiology and monitoring (graphical and OWL representations are in our GitHub). With the rise of new variants of the virus that may challenge vaccine efficacy, a compatible logical model for quantities that enables researchers to represent and share machine-readable patient monitoring and epidemiology surveillance data for rapid analysis, modeling and response is an urgent need. In this work we re-used the SIO design pattern for measurements 7 , a model already applied to patient health data for rare diseases in the EJP RD 8 , clinical research data in the LUMC [12] and the measurements schema in the new GA4GH Phenopackets release [13]. The taxonomic structure is extended from IDO, a core ontology for infectious diseases. For domain concepts we re-used GFO [14] to formalize timelines concepts using the 'chronoid' and the GFO-based 'mortality' model approach [15]. To link patient-population is an RDA COVID-19 recommendation on data sharing, thus we checked common data models such as OMOP 9 and re-used the relationship used in Phenopackets based on composition semantics.
We filled the gap for epidemiological surveillance terms in OBO adding 100 new terms. From an initial set of 138 manually extracted terms, only 38 are covered by bio-ontologies, 21% (30 terms) IDO [16] and 24% STATO (33 terms) [17] (although including fallbacks this percentage could increase to 50%) and the rest by epidemiological-related ontologies such as APOLLO_SV [18] and GENEPIO [19]. We noticed that EPO [20] is not maintained since its publication and has been deprecated from OBO Foundry, and IDO is working towards epidemiological enrichment [21]. While interoperability within the OBO landscape is fostered by adopting the BFO backbone structure, the link with GFO can lead to incompatible temporal regions due to logical inconsistency [22]. Another issue that may be improved is the current absence of axioms and definition patterns that relate epidemiology (i.e., observations of a population) to clinical ontologies (i.e., observations on an individual) and allow reasoning for discovery. The re-use of the EQ model [23] or the adaptation of the REA model [24] will be evaluated. In the future, we will evaluate our ontology with domain experts and logical competency questions [25]. Moreover, we expect to use this model in FAIR-based projects such as TWOC [26] to publish epidemiological claims as nanopublications for trust [27]. We aim at FAIR reasoning and analytics of person-level real world observations over epidemiological surveillance information [28]. Therefore, checking common data models such as Phenopackets or OHDSI standards was done to enable the development of applications to discover patterns with ontology-guided machine learning algorithms and translational research.

IV. Conclusion
In the context of an infectious disease outbreak it is imperative to have these data as FAIR as possible to facilitate rapid analysis and support timely evidence-based decision making and trust. To enable the community to provide machine-readable epidemiological quantitative data and make it easier to share, we contributed with the development of an ontological representation, which was built based on ontology engineering best-practices such as reuse and ontology formalization through upper-level ontologies (i.e., GFO, SIO).

V. Acknowledgements
The authors would like to specially thank Dr. Birgit Meldal for her input and ideas. This initiative has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement N°825575 (the European Joint Programme Rare Diseases), and the Trusted World of Corona (TWOC; LSH