TL;DR: How the Dataset DAG became a hypergraph became the DatasetGraph.
DatasetGraph as a top-level object.
Recall from a few weeks ago, I described a bipartite graph (or Hypergraph), now called a DatasetGraph
, which describes how Dataset
objects are generated from other Dataset
objects. I originally named it a TransformerGraph
, because that’s how the directionality of the edges works out in the bipartite representation, but that turns out to be a little more confusing for the user. In the hypergraph, the Dataset
objects are the nodes, so DatasetGraph
it is.
One of the unintended consequences of introducing a DatasetGraph
class in Easydata is that it turns out to be the right place to do a lot of things. That’s why we ended up exposing it to the user, instead of just using it internally to the Dataset
.
Before we created the DatasetGraph
, we used to have a top-level functions add_transformer()
to add a dataset transformation to the global catalog. but it turns out a much more natural place to put it is in the DatasetGraph
class directly.
Sticking with the “edges are functions, nodes are datasets” hypergraph terminology, the API becomes something like this”
>>> dag = DatasetGraph()
>>> xp = create_transformer_pipeline([list, of, transformer, functions, or, partials, ...])
>>> dag.add_source(datasource_name="dsrc_name", datasource_opts={}, output_dataset="dset_name")
>>> dag.add_edge(input_dataset=None, input_datasets=(),
output_dataset=None, output_datasets=(),
transformer_pipeline=xp, **kwargs)
>>> dataset = dag.generate('node_name')
This gives us a clean separation between adding a source node to the graph, and adding an edge. Both technically add edges, but the idea of a “source edge” in this hypergraph just feels weird, so the details are hidden by this API. This is perhaps why describing the dependency graph as a bipartite graph is less troublesome. See my last post for more on that.