Metadata#
Each measurement file should be annotated with metadata, including data descriptors (units, description of the measured parameter), the experimental conditions (samples, instruments, environment parameters, …), and possibly additional metadata (users, project, …). Usually, only a limited number of descriptors will change in a series of measurements. For example, you could perform a series of measurements at different temperatures. Hence we suggest storing all information in a template, preferably in YAML. The YAML format has the advantage of being easily readable by humans and machines. A detailed description of how to write YAML can be found elsewhere.
A simple example
could be as follows:
experimentalist: Max Doe
supervisor: John Mustermann
research question: Resistance of a resistor connected in series to a power supply.
figure description:
fields:
- name: t
unit: s
- name: U
unit: mV
description: Voltage across resistor 1.
Designing a metadata template#
For further processing of your data, it is helpful to acquire as many details on the measurement in a structured way. The following categories are used by the echemdb metadata-schema and serve as examples to create your schema.
curation: Details on the people involved in the data acquisition process.
Note
The name curation refers to the curation process of the measured data. The list can be extended later, when data is modified, reused or curated by other people.
Note
In principle, a list of ORCIDs would be sufficient. From that unique identifier, any other information on the users can be retrieved.
process:
- role: experimentalist # the first role is usually the experimentalist
name: Max Doe
orcid: https://orcid.org/0000-0002-9686-3948
- role: supervisor
name: The Boss
orcid: https://orcid.org/0000-0002-9686-3948
- role: curation
name: The Data Steward
orcid: https://orcid.org/0000-0002-9686-3948
projects: List of projects related to the source data, which can be a large third-party funded project, simply a PhD thesis or a small project within a PhD thesis.
# Descriptor of projects linked to the data.
- name: PhD
ORCID: url to ORCID
identifier: MDO_phd # An internal identifier for this Thesis, such as the user id suffixed with `phd`.
url: https://some.internal.website.org # link to the description of the project, for example in an ELN
- name: PPS # an acronym or unique identifier
title: A title # A more elaborate title
type: internal
url: https://some.internal.website.org # link to the description of the project, for example in an ELN
- name: Serious EU project
title: A title # A more elaborate title
type: external
url: https://some.public.website.org # project homepage
grant number: XXX
identifuer: https://public.website.org/with/json/content.json
figure description: Contains descriptors for the underlying data, which are necessary for exporting the data in a meaningful way. It also contains information on what would be displayed in the figure and which other data is associated with the present figure/data.
Note
To create unitpackages
, the structure for the fields must be followed. The name
for each field must match a field/column name in the CSV (datafile).
type: raw # raw, digitized, computed
measurement type: temperature variation
simultaneous measurements:
- humidity
- light intensity
fields:
- name: t
unit: s
description: relative time
- name: T
unit: K
description: Room temperature
experimental: Details on the equipment and experimental procedures applied to the system that is described below.
instrumentation:
- type: potentiostat
manufacturer: Electrocemistry Supplier
name: Poti1 # a unique internal name of the device
- type: themometer
model: xyz
manufacturer: Electronic Shop
name: TProbe2
url: https://doi.org/10.1016/0039-6028 # For example: doi referring to a published work describing this setup or configuration
description: Experimental description # A string or a new set of metadata. Can be outsourced to a separate markdown file.
system: Details on the system on which the experiment is performed, including anything that is in contact with or any external parameters that can have an impact on the system/measurement. Consider putting a banana in a beaker to study how its color changes when the external parameters, such as contact with microorganisms during transport, as well as the atmosphere, temperature and light intensity/source are varied in different experiments. Also, include information that does not seem to be relevant at the time of the data acquisition.
Note
This part is presumably the most important one. With increasing amount of information, more correlations can be drawn and possible issues can be found. It also allows elucidating possible variations in future measurements. For example, assume a new student uses a new beaker in a new study on the same system, but does not receive the same results. If the beaker would be the problem, but its properties (glass, thickness, supplier, an internal ID, …) were not recorded properly, one could not so easily pin down the origin of the problem.
container:
type: beaker
identifier: beaker1
components:
- name: beaker
manufacturer: Glass Company
material: glass
LOT: abc123
- name: lid
manufacturer: 3D printed
material: some plastic
description: lid with holes
LOT: def456
purity:
impurities:
- name: Some chemical
concentration: 1 ppm
atmosphere:
components:
- name: air
temperature:
# chose a unit system such as that from astropy or pint for further processing
unit: K
value: 300
humidity:
unit: pct
value: 25
light source:
name: sun
An extensive example of a more complete YAML template for an electrochemical system can be found here.
A complete YAML would contain all of the above (and other custom) categories.
curation:
...
projects:
...
figure description:
...
expeimental:
...
system:
...
Loading templates#
The YAML files can be loaded as a Python dictionary.
import yaml
with open('../data/files/data.csv.meta.yaml', 'rb') as f:
metadata = yaml.load(f, Loader=yaml.SafeLoader)
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[8], line 2
1 import yaml
----> 2 with open('../data/files/data.csv.meta.yaml', 'rb') as f:
3 metadata = yaml.load(f, Loader=yaml.SafeLoader)
File /usr/share/miniconda3/envs/test/lib/python3.12/site-packages/IPython/core/interactiveshell.py:324, in _modified_open(file, *args, **kwargs)
317 if file in {0, 1, 2}:
318 raise ValueError(
319 f"IPython won't let you open fd={file} by default "
320 "as it is likely to crash IPython. If you know what you are doing, "
321 "you can use builtins' open."
322 )
--> 324 return io_open(file, *args, **kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '../data/files/data.csv.meta.yaml'
Annotating data automatically#
When (measurement) files are created in the file system, preferably the content of the YAML template is directly associated with the file. This can be achieved by monitoring the file system for newly created files. Following, we illustrate the approach by using the Python watchdog package. Besides, we also provide a solution with a graphical user interface.
Watchdog#
The following code observes the folder /data
for newly created .csv
files.
On file creation, the content from the YAML template is written in the same folder where the file is created.
The name will be identical to the newly created file and a .meta.yaml
is appended to the existing filename.
Note
We decided to append both suffixes meta
and yaml
to the original suffix, to clearly indicate that this file
contains metadata for the recorded CSV and that the content is YAML.
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from pathlib import Path
import yaml
# adapt accordingly
observed_dir = './data/'
yaml_template = '/files/yaml_templates/demo.yaml'
suffix = '.csv' # mind the dot
def create_metadata(filename):
# load the metadata from a yaml template
with open(yaml_template, 'rb') as f:
metadata = yaml.load(f, Loader=yaml.SafeLoader)
# Add further methods that enhance or modifiy the YAML template
# write an output YAML file
outyaml = Path(filename).with_suffix(suffix + '.meta.yaml')
with open(outyaml, 'w') as f:
yaml.dump(metadata, f)
class NewFileHandler(FileSystemEventHandler):
def on_created(self, event):
if Path(event.src_path).suffix == suffix:
# print the filename
print(event.src_path, ' ' , Path(event.src_path).suffix)
# When a new file is created we catch the filename and parse it to a method
# that generates output yaml files and markdown files for additional notes
create_metadata(event.src_path)
# create an observer
observer = Observer()
# schedule the observer to observe the folder
observer.schedule(NewFileHandler(), path=observed_dir, recursive=False)
# start the observer
observer.start()
To stop watching the folder execute in a separate cell
observer.stop()
GUI#
We are experimenting with different approaches to create applications that annotate data automatically.
autotag-metadata: A standalone application, based on Qt for Python. It includes an editor for modifying YAML templates. This approach is preferable for end-users, for example, in the laboratory.
autoquetado-voila: Based on ipywidgets, which can be served, for example, via voila. It can be included in Jupyter notebooks and due to its modular structure can be adapted more simply to specific needs. (Early development stage)