Due to the continuously increasing demands for higher spatial and temporal resolution, data volumes from new generations of space instrumentation (e.g., Meteosat Third Generation and TROPOMI) and numerical models (e.g., Harmony and EC-Earth) grow faster than available storage, network and computing capacities.
In order to provide scientists with tools to handle these data, new technologies for computationally intensive sciences (e-Science) are under development. The concept of distributed computing (clouds, Grid) and storage of datasets in (virtual) data centres requires that both the ‘findability’ and the data exchange mechanisms must be optimized and harmonized. The findability aspect is tackled by various initiatives to harmonize metadata while the exchange mechanism optimization and harmonization is tackled by defining requirements to promote the interoperability of web services. These developments are endorsed by both governmental (e.g. the WMO and the European Union through the INSPIRE framework) and non-governmental organizations (standardizing organisations like Open Grid Forum (OGF) for Grid computing and Open Geospatial Consortium (OGC) for geospatial services). These harmonization efforts will foster interdisciplinary research. For example, it will become possible to couple climate model results (almost) directly to climate effect models. With the increasing possibilities to access datasets (insight in) the quality of the data becomes increasingly important. Therefore, tools are being developed to annotate datasets with standardized metadata and other annotations.
Over the past years KNMI has been involved in both national and international consortia to help develop the above mentioned new technologies. This has resulted in various web services supporting the climate and seismology communities. Examples are the web services from the ADAGUC project; these provide access to various satellite data ranging from monthly averaged ozone to soil moisture maps. Also new data standards have been developed that will affect the way KNMI data are disseminated in the near future. This highlight describes some of these developments and provides an outlook to the future.
Metadata, annotations and formats
Metadata provide a description of data. When exchanging data, it is important that the data requestor is able to quickly find the data of interest. The best way of enabling searches on data is the provision of metadata and annotations. Providing these data to registers and search engines enables ‘findability’ of the data and, at the same time, enhances the usability and quality of a dataset. For metadata various, more or less extensive, standards and models are used. Below we discuss two important metadata models.
ISO 191153) standard for metadata defines a set of attributes. Attributes can be mandatory, conditional or optional. Only a small set is mandatory and communities like WMO (4), can expand on the core set. ISO 19115 metadata will be the backbone for emerging data infrastructures, like the WMO Information System, INSPIRE, the EUROCONTROL Weather Information Exchange Model and the National Geo-Register. In the model data can be described on ‘dataset’ level, where a dataset is defined as ‘an identifiable collection of data’. WMO and OGC have embraced the Observation and Measurement data model, which has now become the ISO 19156 standard.
Another important metadata standard is the Climate and Forecast convention (CF) (5), primarily used in Network Common Data Format files (6). This community standard is widely accepted within climate modelling.
It is important to note that the above mentioned metadata standards are not in competition: all serve the same goal on a different level. Where ISO is more generic and covers descriptive metadata like points of contact, coverage and time information, CF covers domain-specific parameters within the provided dataset.
Both metadata models are combined in the ADAGUC data format specification (7) (Figure 1). The ISO metadata is used for describing the dataset’s metadata, the CF metadata fields are used to describe the content of the individual data product. The format specification is in use and under revision control by KNMI, the VU University Amsterdam and the SRON Netherlands Institute for Space Research.
Portals, services and networks at KNMI
KNMI has many successful data access facilities e.g. ECA&D, PROMOTE, Climate Explorer, KODAC, FTPPRO, CESAR, ADAGUC, NERIES, LVNL, and the KNMI website itself. Here we will focus on two recently developed portals: the NERIES (Network of Research Infrastructures for European Seismology) portal for seismological data and the ADAGUC (Atmospheric Data Access for the Geospatial User Community) portal for atmospheric data. Both are built using state-of–the-art services and provide and use metadata, but use different technology standards.
The NERIES portal brings together diverse distributed European seismological data to provide a single access facility from which researchers can search for and download selected data and data products (Figure 2). It also provides a framework through which new tools and processing can be included and accessed, including linkages to external processing resources and e-science infrastructures. The NERIES portal aggregates individual standardized portlets (8). Portlets are specific targeted web applications that conform to a standard application programming interface allowing them to be aggregated within a portlet container (runtime environment for portlets). The NERIES portlets are deployed both locally to the portal as well as remotely at participating data centres. These portlet applications access data archives through local and remote web services. These web services provide the integration points for access by other science infrastructures, such as GEMS and GEOSS.
The ADAGUC portal (Figure 3) brings together data from different OGC compliant services. The services accessible from the ADAGUC portal run both locally at KNMI and remotely at the VU. The OGC services used are Web Map Service for map display of the data and Web Coverage Service/Web Feature Standard interface for data downloading. All are accessible through the standard HTTP protocol. Others can integrate these web services into their application, which is done by e.g. the GIS software Company ESRI (Figure 4) and RIVM. The ADAGUC portal is based on ADAGUC formatted netCDF4 datasets. From these files, the services can provide the data in a large number of other formats using the Geospatial Data Abstraction Library for conversion. By implementing innovative new technologies valuable experience can be gained in terms of advantages and disadvantages of specific technologies. This experience will be used for e.g. building the next generation of the KNMI Data Centre and Satellite Data Centre.
Distributed Computing infrastructures
Distributed Computing infrastructures (DCI) are composed of services for sharing computer power and data storage capacity over the Internet, e.g. Grid services, Cloud services, High Performance Computing services. Although Earth Sciences (ES) have been active in the area of Grid computing since the year 2000 this has not resulted in a community-wide use of DCI platforms. The main reasons for this are the time investment for learning to use DCI platforms and the technical complexity of these platforms (9). Moreover, in an independent development, ES communities have been engaged over recent decades in developing domain-specific data and processing infrastructures.
These developments have two important drawbacks: the limited use of DCI platforms inhibits an optimal use of computing resources, while the domain-specific infrastructure development impedes interdisciplinary scientific collaboration (Figure 5a). These issues were investigated in the FP6 project DEGREE, which proposed an ES-Grid roadmap (10,11) in order to improve Grid usage in the ES community. One of the major milestones in this roadmap is the establishment of an “Earth Science Grid platform”, envisioning a collaborative environment where researchers can easily make use of the benefits of both DCI platforms and ES data. KNMI has participated in many DCI projects and experiments (DataGrid, DEGREE, SciaGrid).
With this experience, KNMI has started the Earth Science Gateway (ES-G) initiative, aiming at developing a modular component framework for building ES community-specific science gateways (Figure 5b). This framework will have components for integrating ES-specific services (e.g., OGC), security, SLA and services for providing access to Grid functionality. These components will provide the crucial missing link that allows the ES community to use DCI platforms, while at the same time retaining access to existing community-based resources. Where possible, ES-G will reuse existing components. Close cooperation between all involved communities (technology providers like DCI and user communities like Earth Science) is essential and is foreseen to be one of the main activities in this initiative. To open the development process for contributions of others, an open-source software development approach will be taken.
It is foreseen that this initiative can provide components for the ambitious National Model and Data Centre (NMDC), in which KNMI plays an important role. The RapidSeis project (partners: NERIES, University of Liverpool and the National e-Science Centre in the UK) has proved the feasibility of the integration of computational applications within an existing data portal. RapidSeis has prototyped a scientific gateway via a web portal that allows seismologists to pick up data from Orfeus—the central repository for earthquake data in Europe hosted at KNMI —and then run several analyses on these data (on DCI infrastructures). RapidSeis was funded by the UK Joint Information Systems Committee. JISC is an independent advisory body that works with further and higher education by providing strategic guidance, advice and opportunities to use ICT to support learning, teaching, research and administration.
Future role of data centres: curation of data
Data centres, like KNMI, are sources of information for scientists and have the important task of data preservation and curation for future research. In order to support researchers, data centres nowadays provide basic search, browse and download services for datasets. However, researchers often need more information and functionality than is generally provided. The increasing use of datasets in large (cross-domain) projects, and European Commission initiatives like INSPIRE, requires data centres to interconnect and harmonize their services. Such an approach is adopted in the NERIES and ADAGUC projects. This will make datasets easier accessible for scientific users but makes, at the same time, the amount of available data even more overwhelming.
Therefore, the future data centres need to provide additional information in the form of annotations to assist scientists in selecting the optimal datasets for their work. By allowing the possibility to annotate datasets, the relation between scientists and data archives is developing from a one-way relation (download services) to a truly interactive relation in which scientists are offered functionality to upload information and contribute to the data archive. We aim at initiating developments towards the next generation of data archives in which an active relation exists between the archive and its users. This interactive aspect is one of the main requirements of the next generation KNMI Data Centre. The first steps involve the development of standards for data annotation and upload services and demonstrating the advantages of this new way of working to the end-users.