Creating Unitpackages

Unitpackages (frictionless Data Packages with unit-annotated fields and additional metadata describing the data) can be created from CSV files or pandas DataFrames. After creation, metadata and field descriptions (including units) can be added to produce a complete, self-describing data package.

Quick start

A typical workflow to create a complete unitpackage from a CSV file involves:

  1. Creating an entry from the CSV or pandas DataFrame

  2. Adding metadata

  3. Adding field descriptions (units)

  4. Saving the result

from unitpackage.entry import Entry

# 1. Create entry from CSV
entry = Entry.from_csv(csvname="../files/demo_package.csv")

# 2. Load metadata from a YAML file
entry = entry.load_metadata("../files/demo_package.csv.yaml")

# 3. Update field descriptions (units)
fields = [{'name': 't', 'unit': 's'}, {'name': 'j', 'unit': 'A / cm2'}]
entry = entry.update_fields(fields=fields)

# 4. Save
entry.save(outdir="../generated/files/csv_entry/")

The saved entry now consists of a CSV and a JSON file in the output directory.

import os
os.listdir("../generated/files/csv_entry/")
['demo_package.json', 'demo_package.csv']

The following sections describe each step in more detail.

From a CSV file

An entry can be created directly from a CSV file.

from unitpackage.entry import Entry

entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry
Entry('demo_package')

The entry’s field descriptions are inferred from the CSV.

entry.fields
[{'name': 't', 'type': 'integer'}, {'name': 'j', 'type': 'integer'}]

The data can be accessed as pandas dataframe

entry.df.head()
t j
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4

or from the resource data

entry.resource.data
t j
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6

From a pandas DataFrame

Similarly, an entry can be created from a pandas DataFrame. A basename must be provided to name the entry.

import pandas as pd
from unitpackage.entry import Entry

data = {'x': [1, 2, 3], 'v': [1, 3, 2]}
df = pd.DataFrame(data)

entry = Entry.from_df(df, basename='df_data')
entry
Entry('df_data')

Adding field descriptions

Field descriptions such as units can be added or updated using update_fields(). Only fields with matching names are updated; non-matching fields are ignored.

fields = [{'name': 'x', 'unit': 'm'}, {'name': 'v', 'unit': 'm / s', 'description': 'velocity'}]
entry = entry.update_fields(fields=fields)
entry.fields
[{'name': 'x', 'type': 'integer', 'unit': 'm'},
 {'name': 'v', 'type': 'integer', 'description': 'velocity', 'unit': 'm / s'}]

Note

update_fields() returns a new entry — the original entry is not modified.

Adding metadata

Metadata can be added to an entry in several ways.

From a Python dictionary

entry.metadata.from_dict({'experimentInfo': {'user': 'Max Doe', 'date': '2021-07-09'}})
entry.metadata
{'experimentInfo': {'user': 'Max Doe', 'date': '2021-07-09'}}

From a YAML or JSON file

Metadata can be loaded from a YAML file using load_metadata(), which supports method chaining.

entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry = entry.load_metadata("../files/demo_package.csv.yaml")
entry.metadata
{'user': 'Max Doe', 'dataDescription': {'fields': [{'name': 't', 'unit': 's'}, {'name': 'j', 'unit': 'A cm-2'}]}}

A key can be specified to store the loaded metadata under a specific key. This is useful when metadata should be organized according to a certain schema, keeping different metadata sources separated. See the unitpackage structure description for more details on how metadata schemas are organized within a resource.

entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry = entry.load_metadata("../files/demo_package.csv.yaml", key="experiment")
entry.metadata['experiment']
{'user': 'Max Doe', 'dataDescription': {'fields': [{'name': 't', 'unit': 's'}, {'name': 'j', 'unit': 'A cm-2'}]}}

The same works with a JSON file — the format is auto-detected from the file extension.

entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry = entry.load_metadata("../files/demo_package.json", key="source")
entry.metadata['source']
{'resources': [{'name': 'demo_package', 'type': 'table', 'path': 'demo_package.csv', 'scheme': 'file', 'format': 'csv', 'mediatype': 'text/csv', 'encoding': 'utf-8', 'schema': {'fields': [{'name': 't', 'type': 'number', 'unit': 's'}, {'name': 'j', 'type': 'number', 'unit': 'A / m2'}]}, 'metadata': {'echemdb': {'dataDescription': {'fields': [{'name': 't', 'type': 'number', 'unit': 's'}, {'name': 'j', 'type': 'number', 'unit': 'A / m2'}]}, 'description': 'Sample data for the unitpackage module.', 'curation': {'process': [{'role': 'experimentalist', 'name': 'John Doe', 'laboratory': 'Institute of Good Scientific Practice', 'date': '2021-07-09'}]}}}}]}

Note

metadata.from_dict(), metadata.from_yaml(), and metadata.from_json() modify the entry’s metadata in-place. In contrast, load_metadata() is a convenience method on the entry that returns self for method chaining.