TL;DR: Easydata’s Dataset dependency hypergraph, described.
Hypergraph or Bipartite Graph?
For this post, I’m still talking about the hypergraph of data dependencies that I mentioned last time, however for this discussion, I’ll switch from a hypergraph-based description to a bipartite graph-based description of the dependencies.
Why? For starters, there’s not necessarily a commonly accepted notion of a directed hypergraph When I use the term, I mean a hypergraph, where the vertices of an edge are partitioned into two sets: the head-set and tail-set of the edge.
It’s perhaps interesting (and often surprising) to note the constructs that appear when trying to describe data flow as a directed hypergraph. In our case, we often end up with a hypergraph where data originates from a transformer function (like when we have synthetic, or downloaded data). This leads to a directed hyperedge with no input nodes, only output nodes; i.e the head-set is empty, but the tail-set is not. What does one even call that. A source edge?
Anyway, to avoid some of these rabbit holes, we can switch to a bipartite graph representation of this construct. These representations (hypergraph, bipartite graph) are interchangable. To construct this bipartite graph, list the transformers (the hyper “edges”) down one side of the page, Datasets (the hyper “nodes”) down the other, and join them with directed edges to indicate data dependencies (inbound edges to a transformer are input datasets, outbound edges are output datasets).
More on the Dataset Graph
A Dataset
is an on-disk object representing a point-in-time snapshot (a cached copy) of data and its associated metadata. The Dataset
objects themselves are serialized to data/processed
. Metadata about these objects are serialized to catalog/datasets.json
.
A Transformer
is a function that takes in zero or more Dataset
objects, and produces one or more Dataset
objects. While the functions themselves are stored in the source module (by default in src/user/transformers.py
), metadata describing these functions and their inputs/outputs Dataset
objects are serialized to the catalog file catalog/transformers.json
.
We’ll define the DatasetGraph
as the bipartite graph formed by the two distinct sets of vertices above: Dataset
objects, and Transformer
functions. The edges of this graph are directed, indicating the direction of dependency from the perspective of the Transformer
; i.e. since output_datasets
depend on input_datasets
so arrows are directed from input Dataset
objects to Transformer
functions, and from Transformer
functions to output Dataset
objects.
The whole goal of this exercise is to capture the information about the data transformations from raw data to processed data, in a way that can be serialized to disk, and committed as if it was code. These instructions are stored in the data catalog in JSON format. There is some trickiness here, as function objects don’t serialize in a platform-independent way, so we just some make assumptions about namespaces (we set up a standard location in the python module for user-generated functions: src.user.*
), and use Python introspection to map the serialization to function objects when the pipeline is loaded.
Transformer Serialization
Note that transformers can take zero datasets as input (but must produce at least one output). This special case occurs in one of two ways:
- Synthetic Data: The data is synthetic, and the transformer is actually generates a
Dataset
object from scratch. The JSON in this case looks like:"synthetic_source": { "output_dataset: "ds_name", "transformations": [ (synthetic_generator, kwargs_dict), (optional_function_2, kwargs_dict_2 ), ... ], }
- Data Conversion: The data originates from something that isn’t a
Dataset
(e.g. a DataSource object), and the transformer converts it to aDataset
. This is really no different than the synthetic data case, except we supply adataset_from_datasource()
wrapper so the user doesn’t have to constantly reimplement it:"datasource_edge": { "output_dataset: "ds_name", "transformations": [ (dataset_from_datasource, (datasource_name), **datasource_opts} ), (optional_function_2, kwargs_dict_2 ), ... ], }
In all other cases, a transformer consumes and emits one or more
Datasets
as both input and output:"hyperedge": { "input_datasets":[in_dset_1, in_dset_2], "output_datasets":[out_dset_1, out_dset_2], "transformations": [ (function_1, kwargs_dict_1 ), (function_2, kwargs_dict_2 ), ... ], "suppress_output": False, # defaults to True },
Dataset Serialization
A complete
Dataset
object contains both the data itself and an associated metadata dictionary. On disk, this is serialized to two files, typically located inpaths['processed_data_path']
: dataset_name.dataset
: The completeDataset
Objectdataset_name.metadata
: A copy of the metadata portion of theDataset
. As theDataset
can be quite large, metadata-only operations save time and memory by reading this file instead. If theDataset
has been reproducibly generated, this metadata should match whatever is serialized into the dataset catalog.
One of the design goals of Easydata is that this processed dataset can be deleted at any time and (reproducibly and deterministically) recreated when needed.
Dataset Metadata
The master copy of the generated metadata is stored in the catalog file: catalog/datasets.json
.
Dataset
metadata is fairly freeform. It is based on scikit-learn’s Bunch object (basically a dictionary where the keys can be accessed as attributes). This object typically contains 4 attributes: .data
, .target
(which is often None for unsupervised learning problems), .metadata
, and .hashes
. The latter contains a hash of all the non-metadata
attributes of the Dataset
; e.g.
"hashes": {
"data":"sha1:d1d5ac9a5872e09b3a88618177dccc481df022d1",
"target":"sha1:38f65f3b11da4851aaaccc19b1f0cf4d3806f83b",
},
where data and target are whatever data type makes sense for the problem at hand (e.g. matrix, pandas dataframe, nparray, etc.)