Creating Unitpackages
Unitpackages (frictionless Data Packages with unit-annotated fields and additional metadata describing the data) can be created from CSV files or pandas DataFrames. After creation, metadata and field descriptions (including units) can be added to produce a complete, self-describing data package.
Quick start
A typical workflow to create a complete unitpackage from a CSV file involves:
Creating an entry from the CSV or pandas DataFrame
Adding field descriptions (units)
Saving the result
from unitpackage.entry import Entry
# 1. Create entry from CSV
entry = Entry.from_csv(csvname="../files/demo_package.csv")
# 2. Load metadata from a YAML file
entry = entry.load_metadata("../files/demo_package.csv.yaml")
# 3. Update field descriptions (units)
fields = [{'name': 't', 'unit': 's'}, {'name': 'j', 'unit': 'A / cm2'}]
entry = entry.update_fields(fields=fields)
# 4. Save
entry.save(outdir="../generated/files/csv_entry/")
The saved entry now consists of a CSV and a JSON file in the output directory.
import os
os.listdir("../generated/files/csv_entry/")
['demo_package.json', 'demo_package.csv']
The following sections describe each step in more detail.
From a CSV file
An entry can be created directly from a CSV file.
from unitpackage.entry import Entry
entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry
Entry('demo_package')
The entry’s field descriptions are inferred from the CSV.
entry.fields
[{'name': 't', 'type': 'integer'}, {'name': 'j', 'type': 'integer'}]
The data can be accessed as pandas dataframe
entry.df.head()
| t | j | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
or from the resource data
entry.resource.data
| t | j | |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
| 5 | 5 | 5 |
| 6 | 6 | 6 |
From a pandas DataFrame
Similarly, an entry can be created from a pandas DataFrame.
A basename must be provided to name the entry.
import pandas as pd
from unitpackage.entry import Entry
data = {'x': [1, 2, 3], 'v': [1, 3, 2]}
df = pd.DataFrame(data)
entry = Entry.from_df(df, basename='df_data')
entry
Entry('df_data')
Adding field descriptions
Field descriptions such as units can be added or updated using update_fields().
Only fields with matching names are updated; non-matching fields are ignored.
fields = [{'name': 'x', 'unit': 'm'}, {'name': 'v', 'unit': 'm / s', 'description': 'velocity'}]
entry = entry.update_fields(fields=fields)
entry.fields
[{'name': 'x', 'type': 'integer', 'unit': 'm'},
{'name': 'v', 'type': 'integer', 'description': 'velocity', 'unit': 'm / s'}]
Note
update_fields() returns a new entry — the original entry is not modified.
Adding metadata
Metadata can be added to an entry in several ways.
From a Python dictionary
entry.metadata.from_dict({'experimentInfo': {'user': 'Max Doe', 'date': '2021-07-09'}})
entry.metadata
{'experimentInfo': {'user': 'Max Doe', 'date': '2021-07-09'}}
From a YAML or JSON file
Metadata can be loaded from a YAML file using load_metadata(), which supports method chaining.
entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry = entry.load_metadata("../files/demo_package.csv.yaml")
entry.metadata
{'user': 'Max Doe', 'dataDescription': {'fields': [{'name': 't', 'unit': 's'}, {'name': 'j', 'unit': 'A cm-2'}]}}
A key can be specified to store the loaded metadata under a specific key.
This is useful when metadata should be organized according to a certain schema, keeping different metadata sources separated.
See the unitpackage structure description for more details on how metadata schemas are organized within a resource.
entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry = entry.load_metadata("../files/demo_package.csv.yaml", key="experiment")
entry.metadata['experiment']
{'user': 'Max Doe', 'dataDescription': {'fields': [{'name': 't', 'unit': 's'}, {'name': 'j', 'unit': 'A cm-2'}]}}
The same works with a JSON file — the format is auto-detected from the file extension.
entry = Entry.from_csv(csvname="../files/demo_package.csv")
entry = entry.load_metadata("../files/demo_package.json", key="source")
entry.metadata['source']
{'resources': [{'name': 'demo_package', 'type': 'table', 'path': 'demo_package.csv', 'scheme': 'file', 'format': 'csv', 'mediatype': 'text/csv', 'encoding': 'utf-8', 'schema': {'fields': [{'name': 't', 'type': 'number', 'unit': 's'}, {'name': 'j', 'type': 'number', 'unit': 'A / m2'}]}, 'metadata': {'echemdb': {'dataDescription': {'fields': [{'name': 't', 'type': 'number', 'unit': 's'}, {'name': 'j', 'type': 'number', 'unit': 'A / m2'}]}, 'description': 'Sample data for the unitpackage module.', 'curation': {'process': [{'role': 'experimentalist', 'name': 'John Doe', 'laboratory': 'Institute of Good Scientific Practice', 'date': '2021-07-09'}]}}}}]}
Note
metadata.from_dict(), metadata.from_yaml(), and metadata.from_json() modify the entry’s metadata in-place.
In contrast, load_metadata() is a convenience method on the entry that returns self for method chaining.