2. Data Management¶
This chapter describes all technical aspects related to data within the SOSPilot project. The “data” as mentioned here applied initially to LML RIVM data, i.e. Dutch Air Quality Data. In 2015 this setup was also used to handle air quality data from the Smart Emission (Nijmegen) project. This is described in a separate chapter: Smart Emission Project.
- obtaining raw source data: observations&measurements (O&M), metadata (stations etc).
- data transformation (ETL) to target various (OWS) services, INSPIRE, E-reporting and custom services
- tooling (databases and ETL-tools) for the above
Source code for data management and ETL can be found in GitHub: https://github.com/Geonovum/sospilot/tree/master/src
2.1. Architecture¶
Figure 1 sketches the overall SOSPilot architecture with emphasis on the flow of data (arrows). Circles depict harvesting/ETL processes. Server-instances are in rectangles. Datastores are shown with “DB”-icons.
The main challenge/flow of data is from Raw Source APIs such as raw XML files with Air Quality data provided by the RIVM LML server. In a later stage of the project this architecture was reused for other raw O&M sources such as the CityGIS Sensor REST API in the Smart Emission project (see Smart Emission Project).
The RIVM LML file server provides a directory of XML files with hourly measurements of the past month. The big circle “ETL” embeds the data transformation processes and storage that is required to deliver OGC services for WMS, WFS (via GeoServer) and SOS (via the 52 North SOS Server) and other Custom Services at a later stage (dashed rectangle) such as EU IPR-compliant E-reporting.
2.2. ETL Design¶
In this section the ETL as used for RIVM LML data is elaborated in more detail as depicted in the figure below.
The ETL design comprises three main processing steps and three datastores. The three ETL Steps are:
- File Harvester: fetch source data from RIVM e.g. from http://lml.rivm.nl/xml and store locally
- AQ ETL: transform this local source data to intermediate “Core AQ Data” in PostGIS
- SOS ETL: transform and publish “Core AQ Data” to the 52N SOS DB via SOS-Transactions (SOS-T)
The detailed dataflow from source to destination is as follows:
- The File Harvester fetches XML files with AQ/LML measurements from the RIVM server
- The File Harvester puts these files as XML blobs 1-1 in a Postgres/PostGIS database
- The AQ ETL process reads these file blobs and transforms these to the Core AQ DB (Raw Measurements)
- The Core AQ DB contains measurements + stations in regular tables 1-1 with original data, including a Time column
- The Core AQ DB can be used for OWS (WMS/WFS) services via GeoServer (possibly using VIEW by Measurements/Stations JOIN)
- The SOS ETL process transforms core AQ data to SOS Observations and publishes Observations using SOS-T InsertObservation
- These three processes run continuously (via cron)
- Each process always knows its progress and where it needs to resume, even after it has been stopped (by storing a progress/checkpoint info)
These last two ETL processes manage their last sync-time
using a separate progress table
within the database.
The first (Harvester) only needs to check if a particular XML file (as they have a unique file name) has already been stored.
Advantages of this approach:
- backups of source data possible
- incrementally build up of history past the last month
- in case of (design) errors we can always reset the ‘progress timestamp(s)’ and restart anew
- simpler ETL scripts than “all-in-one”, e.g. from “Core AQ DB” to “52N SOS DB” may even be in plain SQL
- migration with changed in 52N SOS DB schema simpler
- prepared for op IPR/INSPIRE ETL (source is Core OM DB)
- OWS server (WMS/WFS evt WCS) can directly use op Core OM DB (possibly via Measurements/Stations JOIN VIEW evt, see below)
The Open Source ETL tool Stetl, Streaming ETL , is used for most of the transformation steps. Stetl provides standard modules for building an ETL Chain via a configuration file. This ETL Chain is a linkage of Input, Filter and Output modules. Each module is a Python class derived from Stetl base classes. In addition a developer may add custom modules where standard Stetl modules are not available or to specialize processing aspects.
Stetl has been used sucessfully to publish BAG (Dutch Addresses and Buildings) to INSPIRE Addresses via
XSLT and WFS-T (to the deegree WFS server
) but also for transformation of Dutch topography (Top10NL and BGT)
to PostGIS. As Stetl is written in Python it is well-integrated with standard ETL and Geo-tools like GDAl/OGR, XSLT and
PostGIS.
At runtime Stetl (via the stetl
command) basically reads the config file,
creates all modules and links their inputs and outputs. This also makes for an easy programming model
as one only needs to concentrate on a single ETL step.
2.2.1. ETL Step 1. - Harvester¶
The RIVM data server provides measurements of the past month in a collection of XML files served by an Apache HTTP server. See figure below.
The LML Harvester will continuously read these XML files and store
these in the DB as XML Blobs with their filename in the Postgres
table lml_files
.
This can be effected by a simple Stetl process activated every 30 mins via the linux
cron
service. Stetl has a built-in module for Apache dir listing reading.
Only a derived version needed to be developed in order to track which files have been
read already. This is implemented in the file https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/apachedirinput.py.
Note: there are two data streams with AQ Data from RIVM: “XML” oriented and “SOS” oriented. We will use the “XML” oriented as the file format is simpler to process and less redundant with station info. The URL is http://www.lml.rivm.nl/xml.
For completeness, the “SOS” oriented are identical in measurements, though not rounded, but that should be within error range.
There also seem to be differences, for example “SOS”:
<ROW>
<OPST_OPDR_ORGA_CODE>RIVM</OPST_OPDR_ORGA_CODE>
<STAT_NUMMER>633</STAT_NUMMER>
<STAT_NAAM>Zegveld-Oude Meije</STAT_NAAM>
<MCLA_CODE>regio achtergr</MCLA_CODE>
<MWAA_WAARDE>-999</MWAA_WAARDE>
<MWAA_BEGINDATUMTIJD>20140527120000</MWAA_BEGINDATUMTIJD>
<MWAA_EINDDATUMTIJD>20140527130000</MWAA_EINDDATUMTIJD>
</ROW>
vs “XML”:
<meting>
<datum>27/05/2014</datum>
<tijd>13</tijd>
<station>633</station>
<component>CO</component>
<eenheid>ug/m3</eenheid>
<waarde>223</waarde>
<gevalideerd>0</gevalideerd>
</meting>
Gotcha: there is a file called actueel.xml
in the XML stream. This file has to be skipped to avoid double records.
2.2.2. ETL Step 2 - Raw Measurements¶
This step produces raw AQ measurements, “AQ ETL” in Figure 2, from raw source (file) data harvested
in the table lml_files
(see Step 1).
Two tables: stations
and measurements
. This is a 1:1 transformation from the raw XML.
The measurements
refers to the stations
by a FK station_id
. The table etl_progress
is
used to track the last file processed from lml_files
.
2.2.2.1. Stations¶
Station info is available from Eionet as a CSV file. Coordinates are in EPSG:4258 (also used in INSPIRE).
To create “clean” version of eionet RIVM stations understood by ogr2ogr to read into PostGIS:
- download CSV from http://cdr.eionet.europa.eu/Converters/run_conversion?file=nl/eu/aqd/d/envurreqq/REP_D-NL_RIVM_20131220_D-001.xml&conv=450&source=remote
- this file saves as
REP_D-NL_RIVM_20131220_D-001.csv
- copy to stations.csv for cleaning
- stations.csv: remove excess quotes, e.g. “”“
- stations.csv: replace in CSV header
Pos
withLat,Lon
- stations.csv: replace space between coordinates with comma: e.g
,51.566389 4.932792,
becomes,51.566389,4.932792,
- fix DateTime formatting to comply with OGR: replace
T00:00:00
to `` 00:00:00`` so1976-04-02T00:00:00+01:00
becomes1976-04-02 00:00:00+01:00
- test result stations.csv by uploading in e.g. Geoviewer: http://kadviewer.kademo.nl
- create or update
stations.vrt
for OGR mapping - use stations2postgis.sh to map to PostGIS table
- use stations2gml.sh to map to GML file
See details in GitHub: https://github.com/Geonovum/sospilot/tree/master/data/rivm-lml/stations
Test first by uploading and viewing in a geoviewer, for example in http://kadviewer.kademo.nl See result.
Reading into PostGIS
Note that not all stations may be active. This is indicated by the activity_end
column. We create a VIEW with
only active stations to be used for SOS Sensors (see below).
-- create a view with active stations
DROP VIEW IF EXISTS rivm_lml.active_stations CASCADE;
CREATE VIEW rivm_lml.active_stations AS
SELECT * FROM rivm_lml.stations WHERE activity_end is NULL;
2.2.2.2. Measurements¶
Reading raw measurements from the files stored in the lml_files
table is done with a Stetl
process. A specific Stetl Input module was developed to effect reading and parsing the files
and tracking the last id of the file processed.
https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/lmlfiledbinput.py
The Stetl process is defined in https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/files2measurements.cfg
The invokation of that Stetl process is via shell script: https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/files2measurements.sh
The data is stored in the measurements
table, as below. station_id
is a foreign key
into the stations
table.
Tracking ETL progress for the worker files2measurements
is done in the etl_progress
table.
The last_id
field is the identifier of the last record in the lml_files
table
processed. On each new run the ETL process starts from new records since that last record.
Some query examples:
-- Laatste 24 uur aan metingen voor station en component
SELECT * FROM rivm_lml.measurements
WHERE sample_time > current_timestamp::timestamp without time zone - '1 day'::INTERVAL
AND component = 'NO' AND station_id = '136' order by sample_time desc;
-- Laatste meting voor station en component
SELECT * FROM rivm_lml.measurements
WHERE sample_time > current_timestamp::timestamp without time zone - '1 day'::INTERVAL
AND component = 'NO' AND station_id = '136' order by sample_time desc limit 1;
2.2.3. ETL Step 3 - SOS Publication¶
In this step the Raw Measurements data (see Step 2) is transformed to “SOS Ready Data”, i.e. data that can be handled by the 52North SOS server. Three options:
- direct transform into the SOS database of the 52N SOS server
- via “SOS Transactions” i.e. publishing via SOS-protocol (ala WFS-T)
- via REST
Discussion:
- Direct publication into the SOS DB (39 tables!) seems to be cumbersome and error prone and not future-proof
- via “SOS Transactions” (SOS-T) seems a good and standard option
- Using the REST-API seems the quickest/most efficient way to go, but the status of the REST implementation is unsure.
So from here on publication via SOS-T is further expanded.
2.2.3.1. SOS Transaction - PoC¶
A small Proof-of-Concept using the available requests and sensor ML as example was quite promising. This also provides an example for the mapping strategy.
We have created JSON insert-sensor
and insert-observation
requests and executed these
in the Admin SOS webclient. Each Sensor denotes a single station with Input just “Air” and one
Output for each chemical Component (here O3, MO, NO2, PM10). These files can serve later as templates
for the ETL via Stetl. The insert-sensor
needs to be done once per Station before invoking any InsertObservation
.
The insert-observation
is performed per measurement, though we may consider using an
insert-result-template
and then insert-result
or SOS-Batch
operations for efficiency.
See the images below.
And the observation insert below.
2.2.3.2. SOS Publication - Stetl Strategy¶
As Stetl only supports WFS-T, not yet SOS, a SOS Output module sosoutput.py
was developed derived
from the standard httpoutput.py
module. See https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/sosoutput.py.
Most importantly, the raw RIVM-LML data
from Step 2 needs to be transformed to OWS O&M data. The easiest is to use substitutable templates
, like the
Stetl config itself also applies. This means we develop files with SOS Requests in which all variable parts get a
symbolic value like {sample_value}
. These templates can be found under
https://github.com/Geonovum/sospilot/tree/master/src/rivm-lml/sostemplates in particular
- https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/sostemplates/insert-sensor.json InsertSensor
- https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/sostemplates/procedure-desc.xml Sensor ML
- https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/sostemplates/insert-observation.json InsertObservation
These templates were derived from the sample SOS requests available in the 52N SOS Admin Client. Note that we use JSON for the requests, as this is simpler than XML. The Sensor ML is embedded in the insert-sensor JSON request.
2.2.3.3. SOS Publication - Sensors¶
This step needs to be performed only once, or when any of the original Station data (CSV) changes.
The Stetl config https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/stations2sensors.cfg
uses a Standard Stetl module, inputs.dbinput.PostgresDbInput
for obtaining Record data from a Postgres database.
{{
"request": "InsertSensor",
"service": "SOS",
"version": "2.0.0",
"procedureDescriptionFormat": "http://www.opengis.net/sensorML/1.0.1",
"procedureDescription": "{procedure-desc.xml}",
"observableProperty": [
"http://sensors.geonovum.nl/rivm-lml/observableProperty/benzeen",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/CO",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/NH3",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/NO",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/NO2",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/O3",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/PM10",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/PM25",
"http://sensors.geonovum.nl/rivm-lml/observableProperty/SO2"
],
"observationType": [
"http://www.opengis.net/def/observationType/OGC-OM/2.0/OM_Measurement"
],
"featureOfInterestType": "http://www.opengis.net/def/samplingFeatureType/OGC-OM/2.0/SF_SamplingPoint"
}}
The SOSTOutput module will expand {procedure-desc.xml}
with the Sensor ML template.
2.2.3.4. SOS Publication - Observations¶
The Stetl config https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/measurements2sos.cfg
uses an extended Stetl module (inputs.dbinput.PostgresDbInput
) for obtaining Record data from a Postgres database:
https://github.com/Geonovum/sospilot/blob/master/src/rivm-lml/measurementsdbinput.py.
This is required to track progress in the etl_progress
table similar as in Step 2.
The last_id
is remembered.
The Observation template looks as follows.
{{
"request": "InsertObservation",
"service": "SOS",
"version": "2.0.0",
"offering": "http://sensors.geonovum.nl/rivm-lml/offering/{station_id}",
"observation": {{
"identifier": {{
"value": "{unique_id}",
"codespace": "http://www.opengis.net/def/nil/OGC/0/unknown"
}},
"type": "http://www.opengis.net/def/observationType/OGC-OM/2.0/OM_Measurement",
"procedure": "http://sensors.geonovum.nl/rivm-lml/procedure/{station_id}",
"observedProperty": "http://sensors.geonovum.nl/rivm-lml/observableProperty/{component}",
"featureOfInterest": {{
"identifier": {{
"value": "http://sensors.geonovum.nl/rivm-lml/featureOfInterest/{station_id}",
"codespace": "http://www.opengis.net/def/nil/OGC/0/unknown"
}},
"name": [
{{
"value": "{municipality}",
"codespace": "http://www.opengis.net/def/nil/OGC/0/unknown"
}}
],
"geometry": {{
"type": "Point",
"coordinates": [
{station_lat},
{station_lon}
],
"crs": {{
"type": "name",
"properties": {{
"name": "EPSG:4326"
}}
}}
}}
}},
"phenomenonTime": "{sample_time}",
"resultTime": "{sample_time}",
"result": {{
"uom": "ug/m3",
"value": {sample_value}
}}
}}
}}
It is quite trivial in sosoutput.py
to substitute these values from the measurements
-table records.
Like in ETL Step 2 the progress is remembered in the table rivm_lml.etl_progress
by updating the last_id
field
after publication, where that value represents the gid
value of rivm_lml.measurements
.
2.2.3.5. SOS Publication - Results¶
We can observe the database being filled:
Via the standard SOS protocol the results can be tested:
- GetCapabilities: http://sensors.geonovum.nl/sos/service?service=SOS&request=GetCapabilities
- DescribeSensor (station 807, Hellendoorn): http://tinyurl.com/mmsr9hl (URL shortened)
- GetObservation: http://tinyurl.com/ol82sxv (URL shortened)
2.2.3.6. REST API¶
For now the REST API will not be used since SOS-T is used (see above). Below is for possible future reference.
Documentation REST API: http://52north.org/files/sensorweb/docs/sos/restful/restful_sos_documentation.pdf
REST root URL: http://sensors.geonovum.nl/sos/service/rest
From the documentation the mapping seems to make sense as follows:
sensor-create
- to create new sensor resources –> map fromstations
tableobservation-create
- to create observation resources –> map frommeasurements
table
Design:
- use Stetl: input Postgres Query, output SOS-REST module
- similar to ETL step 2
- track progress in
etl_progress
table - new Stetl output, similar to WFS-T and deegree-publisher
- use Python XML templates for the requests
- problem: make SML, Sensor per Station, or Sensor per Station-Component ?