Unitpackage Structure
The unitpackage
extends the Python API of the frictionless framework.
To create a unitpackage
entry or a unitpackage
collection a frictionless datapackages must have a certain structure or follow a certain schema. We briefly illustrate the structure of the frictionless datapackages, describe which changes were necessary to adopt the schema to scientific data, and describe the structure of the datapackage for use with unitpackage
.
Frictionless Datapackage
The description of the frictionless datapackage presented here, is based on adapted examples found in the frictionless documentation.
A minimal datapackage in your file system consists of two files:
data.csv
datapackage.json
The CSV file contains some data. For the unitpackge we focus on CSV files which contain numbers. Such data is usually found in natural sciences.
var1,var2,var3
1,2.1
3,4.5
In the corresponding JSON file the data in the CSV is described in a resource.
{
"profile": "tabular-data-package",
"name": "my-dataset",
// here we list the data files in this dataset
"resources": [
{
"path": "data.csv",
"name": "data",
"profile": "tabular-data-resource",
"schema": {
"fields": [
{
"name": "var1",
"type": "string"
},
{
"name": "var2",
"type": "integer"
},
{
"name": "var3",
"type": "number"
}
]
}
}
]
}
A frictionless datapackage can have multiple resources.
{
"profile": "tabular-data-package",
"name": "my-dataset",
// here we list the data files in this dataset
"resources": [
{
"path": "data.csv",
"name": "data",
"profile": "tabular-data-resource",
"schema": {
"fields": [
{
"...":"..."
}
]
}
},
{
"path": "data2.csv",
"name": "data2",
"profile": "tabular-data-resource",
"schema": {
"fields": [
{
"...":"..."
}
]
}
}
]
}
Requirements for Scientific Data
Tabular scientific data are often time series data, where one or more properties are recorded over a certain time scale, such as the temperature \(T\) or pressure \(p\) in a laboratory. In some cases one variable is changed with time and one or more variables are recorded to observe the change induced to a system. This could, for example, be the change in current \(I\) in a load by varying a voltage \(V\).
A CVS contains the underlying data.
t,U,I
1,2.1
3,4.5
Warning
The unitpackage
currently only supports CSV files with a single header line. CSV files with headers, including additional information must be converted before. (see #23)
The units are often not included in the filed/column names. These can be included in the datapackage in the resource schema.
Note
We suggest providing the units according to the astropy unit notation.
{
"profile": "tabular-data-package",
"name": "my-dataset",
"resources": [
{
"path": "data.csv",
"name": "data",
"profile": "tabular-data-resource",
"schema": {
"fields": [
{
"name": "t",
"type": "string",
"unit": "s"
},
{
"name": "U",
"type": "integer",
"unit": "mV"
},
{
"name": "I",
"type": "number",
"unit": "uA"
}
]
}
}
]
}
Additional metadata describing the underlying data or its origin is stored in the resource metadata
descriptor. The metadata
can contain a list with metadata descriptors following different kinds of metadata schemas. This allows storing metadata in different formats or from different sources.
{
"resources": [
{
"name": "demo_package_metadata",
"type": "table",
"path": "demo_package_metadata.csv",
"scheme": "file",
"format": "csv",
"mediatype": "text/csv",
"encoding": "utf-8",
"schema": {
"fields": [
{
"name": "t",
"type": "string",
"unit": "s"
},
{
"name": "U",
"type": "integer",
"unit": "mV"
},
{
"name": "I",
"type": "number",
"unit": "uA"
}
]
},
"metadata": {
"echemdb": {
"description": "Sample data for the unitpackage module.",
"curation": {
"process": [
{
"role": "experimentalist",
"name": "John Doe",
"laboratory": "Institute of Good Scientific Practice",
"date": "2021-07-09"
}
]
}
},
"generic": {
"description": "Sample data for the unitpackage module.",
"experimentalist": "John Doe",
"laboratory": "Institute of Good Scientific Practice",
"date recorded": "2021-07-09"
}
}
}
]
}
Warning
The unitpackage
module currently only provides direct access to the resource metadata stored within the echemdb
descriptor. (see #20)
The above example can be found here named demo_package_metadata
. To demonstrate how the different properties of the datapackage can be accessed we load the specific entry.
from unitpackage.collection import Collection
db = Collection.from_local('../files/')
entry = db['demo_package_metadata']
entry
Entry('demo_package_metadata')
The keys within the echemdb
metadata descriptor are directly accessible as properties from the main entry
.
entry.curation
{'process': [{'role': 'experimentalist', 'name': 'John Doe', 'laboratory': 'Institute of Good Scientific Practice', 'date': '2021-07-09'}]}
Other metadata schemas are currently only accessible via the frictionless framework.
entry.package.get_resource("demo_package_metadata").custom["metadata"]["generic"]
{'description': 'Sample data for the unitpackage module.',
'experimentalist': 'John Doe',
'laboratory': 'Institute of Good Scientific Practice',
'date recorded': '2021-07-09'}
Unitpackage Interface
Upon closer inspection of the entry created with unitpackage
you notice that it actually contains two resources.
entry.package
{'resources': [{'name': 'demo_package_metadata',
'type': 'table',
'path': 'demo_package_metadata.csv',
'scheme': 'file',
'format': 'csv',
'mediatype': 'text/csv',
'encoding': 'utf-8',
'schema': {'fields': [{'name': 't',
'type': 'number',
'unit': 's'},
{'name': 'U',
'type': 'number',
'unit': 'mV'},
{'name': 'I',
'type': 'number',
'unit': 'uA'}]},
'metadata': {'echemdb': {'description': 'Sample data for the '
'unitpackage module.',
'curation': {'process': [{'role': 'experimentalist',
'name': 'John '
'Doe',
'laboratory': 'Institute '
'of '
'Good '
'Scientific '
'Practice',
'date': '2021-07-09'}]}},
'generic': {'description': 'Sample data for the '
'unitpackage module.',
'experimentalist': 'John Doe',
'laboratory': 'Institute of Good '
'Scientific Practice',
'date recorded': '2021-07-09'}}},
{'name': 'echemdb',
'type': 'table',
'data': [],
'format': 'pandas',
'mediatype': 'application/pandas',
'schema': {'fields': [{'name': 't',
'type': 'number',
'unit': 's'},
{'name': 'U',
'type': 'number',
'unit': 'mV'},
{'name': 'I',
'type': 'number',
'unit': 'uA'}]}}]}
One resource is named according to the CSV filename. The units provided in that resource are describing the data within that CSV.
The second resource is called echemdb
. It is created upon loading a datapackage with the unitpackage
module and stores the data of the CSV as a pandas dataframe. The dataframe is directly accessible from the entry and shows the data from the echemdb
resource. Upon loading the data, both the numbers and units in the CSV and pandas dataframe are identical.
entry.df.head(3)
t | U | I | |
---|---|---|---|
0 | 0 | 1 | 0 |
1 | 1 | 2 | 1 |
2 | 2 | 3 | 2 |
The reason for the separation into two resources is as follows. With unitpackage we can transform the values within the dataframe to new units. This process leaves the data in CSV unchanged. The pandas dataframe in turn is adapted, as well as the units of the different fields of the echemdb
resource.
rescaled_entry = entry.rescale({'U': 'V', 'I': 'mA'})
rescaled_entry.df.head(3)
t | U | I | |
---|---|---|---|
0 | 0 | 0.001 | 0.000 |
1 | 1 | 0.002 | 0.001 |
2 | 2 | 0.003 | 0.002 |
rescaled_entry.package.get_resource('echemdb')
{'name': 'echemdb',
'type': 'table',
'data': [],
'format': 'pandas',
'mediatype': 'application/pandas',
'schema': {'fields': [{'name': 't', 'type': 'number', 'unit': 's'},
{'name': 'U', 'type': 'number', 'unit': 'V'},
{'name': 'I', 'type': 'number', 'unit': 'mA'}]}}
Refer to the usage section for a more detail description of the unitpackage
API.