Unitpackage Structure

The unitpackage extends the Python API of the frictionless framework. To create a unitpackage entry or a unitpackage collection a frictionless datapackages must have a certain structure or follow a certain schema. We briefly illustrate the structure of the frictionless datapackages, describe which changes were necessary to adopt the schema to scientific data, and describe the structure of the datapackage for use with unitpackage.

Frictionless Datapackage

The description of the frictionless datapackage presented here, is based on adapted examples found in the frictionless documentation.

A minimal datapackage in your file system consists of two files:

data.csv
datapackage.json

The CSV file contains some data. For the unitpackge we focus on CSV files which contain numbers. Such data is usually found in natural sciences.

var1,var2,var3
1,2.1
3,4.5

In the corresponding JSON file the data in the CSV is described in a resource.

{
  "profile": "tabular-data-package",
  "name": "my-dataset",
  // here we list the data files in this dataset
  "resources": [
    {
      "path": "data.csv",
      "name": "data",
      "profile": "tabular-data-resource",
      "schema": {
        "fields": [
          {
            "name": "var1",
            "type": "string"
          },
          {
            "name": "var2",
            "type": "integer"
          },
          {
            "name": "var3",
            "type": "number"
          }
        ]
      }
    }
  ]
}

A frictionless datapackage can have multiple resources.

{
  "profile": "tabular-data-package",
  "name": "my-dataset",
  // here we list the data files in this dataset
  "resources": [
    {
      "path": "data.csv",
      "name": "data",
      "profile": "tabular-data-resource",
      "schema": {
        "fields": [
            {
          "...":"..."
            }
        ]
      }
    },
    {
      "path": "data2.csv",
      "name": "data2",
      "profile": "tabular-data-resource",
      "schema": {
        "fields": [
            {
          "...":"..."
            }
        ]
      }
    }
  ]
}

Requirements for Scientific Data

Tabular scientific data are often time series data, where one or more properties are recorded over a certain time scale, such as the temperature \(T\) or pressure \(p\) in a laboratory. In some cases one variable is changed with time and one or more variables are recorded to observe the change induced to a system. This could, for example, be the change in current \(I\) in a load by varying a voltage \(V\).

A CVS contains the underlying data.

t,U,I
1,2.1
3,4.5

Warning

The unitpackage currently only supports CSV files with a single header line. CSV files with headers, including additional information must be converted before. (see #23)

The units are often not included in the filed/column names. These can be included in the datapackage in the resource schema.

Note

We suggest providing the units according to the astropy unit notation.

{
  "profile": "tabular-data-package",
  "name": "my-dataset",
  "resources": [
    {
      "path": "data.csv",
      "name": "data",
      "profile": "tabular-data-resource",
      "schema": {
        "fields": [
          {
            "name": "t",
            "type": "string",
            "unit": "s"
          },
          {
            "name": "U",
            "type": "integer",
            "unit": "mV"
          },
          {
            "name": "I",
            "type": "number",
            "unit": "uA"
          }
        ]
      }
    }
  ]
}

Additional metadata describing the underlying data or its origin is stored in the resource metadata descriptor. The metadata can contain a list with metadata descriptors following different kinds of metadata schemas. This allows storing metadata in different formats or from different sources.

{
    "resources": [
        {
            "name": "demo_package_metadata",
            "type": "table",
            "path": "demo_package_metadata.csv",
            "scheme": "file",
            "format": "csv",
            "mediatype": "text/csv",
            "encoding": "utf-8",
            "schema": {
                "fields": [
                    {
                        "name": "t",
                        "type": "string",
                        "unit": "s"
                    },
                    {
                        "name": "U",
                        "type": "integer",
                        "unit": "mV"
                    },
                    {
                        "name": "I",
                        "type": "number",
                        "unit": "uA"
                    }
                ]
            },
            "metadata": {
                "echemdb": {
                    "description": "Sample data for the unitpackage module.",
                    "curation": {
                        "process": [
                            {
                                "role": "experimentalist",
                                "name": "John Doe",
                                "laboratory": "Institute of Good Scientific Practice",
                                "date": "2021-07-09"
                            }
                        ]
                    }
                },
                "generic": {
                    "description": "Sample data for the unitpackage module.",
                    "experimentalist": "John Doe",
                    "laboratory": "Institute of Good Scientific Practice",
                    "date recorded": "2021-07-09"
                }
            }
        }
    ]
}

Warning

The unitpackage module currently only provides direct access to the resource metadata stored within the echemdb descriptor. (see #20)

The above example can be found here named demo_package_metadata. To demonstrate how the different properties of the datapackage can be accessed we load the specific entry.

from unitpackage.collection import Collection

db = Collection.from_local('../files/')
entry = db['demo_package_metadata']
entry
Entry('demo_package_metadata')

The keys within the echemdb metadata descriptor are directly accessible as properties from the main entry.

entry.curation
{'process': [{'role': 'experimentalist', 'name': 'John Doe', 'laboratory': 'Institute of Good Scientific Practice', 'date': '2021-07-09'}]}

Other metadata schemas are currently only accessible via the frictionless framework.

entry.package.get_resource("demo_package_metadata").custom["metadata"]["generic"]
{'description': 'Sample data for the unitpackage module.',
 'experimentalist': 'John Doe',
 'laboratory': 'Institute of Good Scientific Practice',
 'date recorded': '2021-07-09'}

Unitpackage Interface

Upon closer inspection of the entry created with unitpackage you notice that it actually contains two resources.

entry.package
{'resources': [{'name': 'demo_package_metadata',
                'type': 'table',
                'path': 'demo_package_metadata.csv',
                'scheme': 'file',
                'format': 'csv',
                'mediatype': 'text/csv',
                'encoding': 'utf-8',
                'schema': {'fields': [{'name': 't',
                                       'type': 'number',
                                       'unit': 's'},
                                      {'name': 'U',
                                       'type': 'number',
                                       'unit': 'mV'},
                                      {'name': 'I',
                                       'type': 'number',
                                       'unit': 'uA'}]},
                'metadata': {'echemdb': {'description': 'Sample data for the '
                                                        'unitpackage module.',
                                         'curation': {'process': [{'role': 'experimentalist',
                                                                   'name': 'John '
                                                                           'Doe',
                                                                   'laboratory': 'Institute '
                                                                                 'of '
                                                                                 'Good '
                                                                                 'Scientific '
                                                                                 'Practice',
                                                                   'date': '2021-07-09'}]}},
                             'generic': {'description': 'Sample data for the '
                                                        'unitpackage module.',
                                         'experimentalist': 'John Doe',
                                         'laboratory': 'Institute of Good '
                                                       'Scientific Practice',
                                         'date recorded': '2021-07-09'}}},
               {'name': 'echemdb',
                'type': 'table',
                'data': [],
                'format': 'pandas',
                'mediatype': 'application/pandas',
                'schema': {'fields': [{'name': 't',
                                       'type': 'number',
                                       'unit': 's'},
                                      {'name': 'U',
                                       'type': 'number',
                                       'unit': 'mV'},
                                      {'name': 'I',
                                       'type': 'number',
                                       'unit': 'uA'}]}}]}

One resource is named according to the CSV filename. The units provided in that resource are describing the data within that CSV.

The second resource is called echemdb. It is created upon loading a datapackage with the unitpackage module and stores the data of the CSV as a pandas dataframe. The dataframe is directly accessible from the entry and shows the data from the echemdb resource. Upon loading the data, both the numbers and units in the CSV and pandas dataframe are identical.

entry.df.head(3)
t U I
0 0 1 0
1 1 2 1
2 2 3 2

The reason for the separation into two resources is as follows. With unitpackage we can transform the values within the dataframe to new units. This process leaves the data in CSV unchanged. The pandas dataframe in turn is adapted, as well as the units of the different fields of the echemdb resource.

rescaled_entry = entry.rescale({'U': 'V', 'I': 'mA'})
rescaled_entry.df.head(3)
t U I
0 0 0.001 0.000
1 1 0.002 0.001
2 2 0.003 0.002
rescaled_entry.package.get_resource('echemdb')
{'name': 'echemdb',
 'type': 'table',
 'data': [],
 'format': 'pandas',
 'mediatype': 'application/pandas',
 'schema': {'fields': [{'name': 't', 'type': 'number', 'unit': 's'},
                       {'name': 'U', 'type': 'number', 'unit': 'V'},
                       {'name': 'I', 'type': 'number', 'unit': 'mA'}]}}

Refer to the usage section for a more detail description of the unitpackage API.