TL;DR: Caching is finicky, but magical when you get it right.
Cache is Magic
My self-declared milestone for an Easydata 1.0 release is being able to do this:
>>> ds = Dataset.load('dataset_name')
And having it “just work” regardless of whether the Dataset
is already on disk, or if it needs to be regenerated by traversing the DatasetGraph and regenerating some or all of the intermediate Dataset
objects (including raw data fetches, if necessary).
A magical component of this generation the caching, which, if a Dataset
is on disk and matches the hashes recorded in the Dataset
catalog, the generation step will be skipped. Seems easy enough, but as with most things in software, “do what I mean” turns out to be much, much harder than I secretly hoped. The good news is, the implementation is starting to just work.
After much usability wrangling, here’s how we cache Dataset
objects in Easydata.
Datasets and Metadata
Recall, a Dataset
is a set of binary blobs with standard names like .data
and .target
, along with its associated metadata.
Metadata is not an afterthought. It’s an essential component of the Dataset
. Metadata can be anything that is JSON–serializable (in fact, under the hood, it’s just a dict), but usually contains:
- the
.DESCR
(readme) text, describing what this dataset is all about. - the
.LICENSE
, listing the conditions under which this data can be used. .HASHES
: hash values for each of the binary attributes like data and target (essential for data provenance)- Any other information that you want to keep with the data itself, and preserve through the
Dataset
transformation process.
Though under the hood it’s implemented as a dict, we steal a great idea from the sklearn Bunch object and tweak it a bit to make metadata access easier. In addition to the standard dictionary-style access, metadata is accessible by referring to uppercase property names; e.g. ds.LICENSE
returns the metadata stored at ds.metadata['license']
It’s important (as you’ll see in a second) that this metadata is both hashable and JSON-serializable.
How caching works (in Easydata)
The global Dataset
catalog is a dictionary of the form:
{dataset_name: str, dataset_metadata:dict}
Caching works by hashing the metadata dictionary (which includes the data hashes) and using this hash as a filename for the cached copy of the dataset. Caches are stored in paths['cache_path']
, and consist of a pair of files: dataset_name.dataset
and dataset_name.metadata
.
-rw-r--r-- 1 hackalog staff 301636179 9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.dataset
-rw-r--r-- 1 hackalog staff 478 9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.metadata
-rw-r--r-- 1 hackalog staff 301636175 9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.dataset
-rw-r--r-- 1 hackalog staff 474 9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.metadata
The .dataset file is joblib serialization of the Dataset
object. The .metadata file is a JSON file containing just the metadata dictionary, useful if we don’t want to spend the time to load the whole dataset just to get at its hashes, say.
Once in a while, a Dataset
in in a polished enough form that we dump it directly to a named Dataset
in the paths['processed_data_path']
directory. We often do this at the end of a data cleaning session, or after an analysis. The idea being that we can blow away the paths['interim_data_path']
or paths['cache_path'] directory to get back disk space, and still have our generated
Dataset` objects available.
-rw-r--r-- 1 hackalog staff 301636179 9 May 14:22 beer_review_all.dataset
-rw-r--r-- 1 hackalog staff 478 9 May 14:22 beer_review_all.metadata
Note, these are exactly the same as their associated cache files: 1b1adb100d8088955878a9d7b3d071710c2db3bf.{dataset|metadata}
The end result is that we can accumulate multiple versions of a Dataset
in the cache directory, and continue to use them so long as we have the disk space.
At some point, we’d love for this cache to be shared within a workgroup, but that’s a feature for another day.