This project is about normalising and merging data from different sources, that we call here dataset. Each dataset has an ID (int) and a name.

The aim of this documentation is to give an overview of the process. To see how to run these steps in practice, refer to the README of the gitlab project. The data of each dataset goes through a series of steps:

overview of the data process: data_flow

Each of the steps are folders in the gitlab repository (see here ). In each folder the different steps of the datasets are stored as csv file, always like [dataset_id]__[dataset_name].csv (Note the 2 underscores between the id and name)

Please note that the gitlab repository only stores the raw data, not the files in readable, normalised and final folders. This would be too heavy and this is not needed, since we have the scripts to produce them in the project.

Each dataset has a normaliser and can have an extractor. You can find the links to these directly on the datasets page, or on the gitlab project in the src/extract and src/normalise folders.

If we had to make any choice in the processing of a particular dataset, this should be explicitly mentioned in the doc of the extract or normalise method. These comments are automatically extracted and are shown on the datasets page.

Steps

Extract

This step prepares the data until the result is one readable CSV (meaning: one line of header, one information per column, etc.) Sometimes there is a source datafile in the data/raw folder, sometimes the extract step consists in scrapping a website or using an API to get the data.

In the case where the raw file is already a readable CSV, then there is no need to create a specific extractor for the dataset, a generic one will be used to extract the dataset.

Normalise

In this class you will find a description of the columns from the source dataset, the static values that should be applied to all lines, the method that will be used to parse the values to numeric, the id of the EPSG id of the projection that is used in the source file, etc. This can be seen as source file descriptor.

The idea is that the normaliser contains the parameters (the description of which column for example), but that the code happens in the abstract classes Normaliser, SLNormaliser and MLNormaliser. However, it is still possible to change the behaviour for some edge cases by implementing one or more methods of the abstract classes.

The result of the normalise step is a csv in the data format described here.

There are 2 types of normalisers because there are 2 types of source files, that needs to be processed quite differently.

Single line normalisers

They correspond to datasets in which each sampling is represented by one line. The PFAS concentrations are stored in different columns, one per substance.

Example:

date matrix lat lon name pfbs pfhps pfhxs pfos
14/02/2022 Surface water 54,116628 -2,5110948 Fire marshal training ground 1,2 1,9 25,5 36,1

Multiple lines normalisers

They correspond to datasets in which each sampling is represented by many lines. The PFAS concentrations are all in the same column, with another column indicating the substance that is measured.

Example

date matrix lat lon name param value
14/02/2022 Surface water 54,116628 -2,5110948 Fire marshal training ground pfbs 1,2
14/02/2022 Surface water 54,116628 -2,5110948 Fire marshal training ground pfhps 1,9
14/02/2022 Surface water 54,116628 -2,5110948 Fire marshal training ground pfhxs 25,5
14/02/2022 Surface water 54,116628 -2,5110948 Fire marshal training ground pfos 36,1

Merge

This last step combines all normalised datasets into a single file (full.parquet and full.csv). It also generates the csv file datasets.csv, that has some stats about each dataset in it.

These files are the “output” of the pdh_data project, they are used as the input for the pdh_web project (this website).