Cache is Magic

Hackalog · May 6, 2020

TL;DR: Caching is finicky, but magical when you get it right.

Cache is Magic

My self-declared milestone for an Easydata 1.0 release is being able to do this:

>>> ds = Dataset.load('dataset_name')

And having it “just work” regardless of whether the Dataset is already on disk, or if it needs to be regenerated by traversing the DatasetGraph and regenerating some or all of the intermediate Dataset objects (including raw data fetches, if necessary).

A magical component of this generation the caching, which, if a Dataset is on disk and matches the hashes recorded in the Dataset catalog, the generation step will be skipped. Seems easy enough, but as with most things in software, “do what I mean” turns out to be much, much harder than I secretly hoped. The good news is, the implementation is starting to just work.

After much usability wrangling, here’s how we cache Dataset objects in Easydata.

Datasets and Metadata

Recall, a Dataset is a set of binary blobs with standard names like .data and .target, along with its associated metadata.

Metadata is not an afterthought. It’s an essential component of the Dataset. Metadata can be anything that is JSON–serializable (in fact, under the hood, it’s just a dict), but usually contains:

  • the .DESCR (readme) text, describing what this dataset is all about.
  • the .LICENSE, listing the conditions under which this data can be used.
  • .HASHES: hash values for each of the binary attributes like data and target (essential for data provenance)
  • Any other information that you want to keep with the data itself, and preserve through the Dataset transformation process.

Though under the hood it’s implemented as a dict, we steal a great idea from the sklearn Bunch object and tweak it a bit to make metadata access easier. In addition to the standard dictionary-style access, metadata is accessible by referring to uppercase property names; e.g. ds.LICENSE returns the metadata stored at ds.metadata['license']

It’s important (as you’ll see in a second) that this metadata is both hashable and JSON-serializable.

How caching works (in Easydata)

The global Dataset catalog is a dictionary of the form:

{dataset_name: str, dataset_metadata:dict}

Caching works by hashing the metadata dictionary (which includes the data hashes) and using this hash as a filename for the cached copy of the dataset. Caches are stored in paths['cache_path'], and consist of a pair of files: dataset_name.dataset and dataset_name.metadata.

-rw-r--r--  1 hackalog  staff  301636179  9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.dataset
-rw-r--r--  1 hackalog  staff        478  9 May 14:22 1b1adb100d8088955878a9d7b3d071710c2db3bf.metadata
-rw-r--r--  1 hackalog  staff  301636175  9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.dataset
-rw-r--r--  1 hackalog  staff        474  9 May 14:21 756974a0ce41ffb9f53b47c234cd1e8b721dacfd.metadata

The .dataset file is joblib serialization of the Dataset object. The .metadata file is a JSON file containing just the metadata dictionary, useful if we don’t want to spend the time to load the whole dataset just to get at its hashes, say.

Once in a while, a Dataset in in a polished enough form that we dump it directly to a named Dataset in the paths['processed_data_path'] directory. We often do this at the end of a data cleaning session, or after an analysis. The idea being that we can blow away the paths['interim_data_path'] or paths['cache_path'] directory to get back disk space, and still have our generated Dataset` objects available.

-rw-r--r--  1 hackalog  staff  301636179  9 May 14:22 beer_review_all.dataset
-rw-r--r--  1 hackalog  staff        478  9 May 14:22 beer_review_all.metadata

Note, these are exactly the same as their associated cache files: 1b1adb100d8088955878a9d7b3d071710c2db3bf.{dataset|metadata}

The end result is that we can accumulate multiple versions of a Dataset in the cache directory, and continue to use them so long as we have the disk space.

At some point, we’d love for this cache to be shared within a workgroup, but that’s a feature for another day.

Twitter, Facebook