-
Notifications
You must be signed in to change notification settings - Fork 0
A minimalistic and hackable dependency manager that allows to specify targets in spirit similar to Makefiles, but directly within python.
License
nwaniek/snek
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
# SNEK - Python Dependency Manager
!img
!src sneklogo.png
`snek.py` is a dependency manager that allows to build declarative data science
or other processing pipelines. It's goal is to be as minimal as possible to
allow rapid prototyping and to not be intrusive, while being flexible enough to
build large and customizable pipelines that can be configured depending on
ad-hoc needs. In `snek.py` (or simply `snek`), targets are specified in spirit
similar to Makefiles, but directly within python.
Adhering to the goal of minimalism, serialization/deserialization for caching is
not part of `snek.py` itself. Rather, `snek.py` calls particular functions if
they are part of the return type of the steps of a pipeline or functions that
can be specified when declaring a particular target. Hence, users can implement
their own favorite methods and are not constrained by `snek.py`. Several
examples for serialization are provided in the demos that ship with `snek.py`,
including how to save dataclasses to numpy files and use an sqlite database as
caching backend.
## Use Cases
In data science, we often build pipelines with lots of configuration options.
Yet, most pipelines can be broken down into small functions that compute
artifacts, that are then later recombined in particular manners. Essentially,
such pipelines can be described as a computational graph, and intermediate
results might have to be cached to avoid expensive recomputation.
As an example of `snek`'s behavior, let's have a look at the first run of
[demo.py]@(file::demo/demo.py) with caching enabled, meaning it'll store
intermediate results, and verbose output:
:: sh
$ ./demo.py
Resolved: A cb890cae7cdc934e705dd064dad9e30843267181
Resolved: C d9d7fadf0eb3bde7bbe4763033f25432527ef81d
Resolved: B 2b5f5080f8750e1e6b721c46772c271aeba79a16
called compute_A()
Reading file demoinput.txt
called compute_C()
called compute_B()
BResult(value='Result B following Result A (101) C Result')
Subsequent runs of demo.py will then only look as follows
:: sh
$ ./demo.py
Resolved: A cb890cae7cdc934e705dd064dad9e30843267181
Resolved: C d9d7fadf0eb3bde7bbe4763033f25432527ef81d
Resolved: B 2b5f5080f8750e1e6b721c46772c271aeba79a16
Loaded B from cache
BResult(value='Result B following Result A (101) C Result')
Meaning instead of computing all intermediate artifacts required to generate B
(which are A and C in the example presented in [demo.py]@(file::demo.py)), it'll
load B from some cache. You can find more details about the `cache` and what
`snek` really provides below.
## Usage
Let's dive right in:
:: python
from snek import DependencyManager
dm = DependencyManager()
@dm.target(cacheable=False)
def awesomesauce():
print("Hello, World!")
if __name__ == "__main__":
dm.make('awesomesauce')
The above code illustrates how to use `snek` to make a particular target called
`asesomsauce`. In this example, the name of the target is automatically inferred
from the function name. You can find this demo in file
[demo2.py]@(file::demo/demo2.py)
Of course this might not be very exciting, but once we introduce other targets
that, themselves, need some targets, things get more interesting:
:: python
from snek import DependencyManager
dm = DependencyManager()
@dm.target(cacheable=False)
def awesomesauce():
return "Hello, World!"
@dm.target(cacheable=False)
def printstr(awesomesauce):
print(awesomesauce)
if __name__ == "__main__":
dm.make('printstr')
Now we call `dm.make` on the printstr target. awesomesauce itself doesn't
directly print any more, but returns a string. This string is the result of the
target `awesomesauce`. Then, we define a target `printstr`, which requires
as input `awesomesauce` -- inferred by the name of the argument to `printstr` --
and then actually prints it.
The above code can be found in [demo3.py]@(file::demo/demo3.py). `snek` will
figure out which function to call by itself if you don't specify a different
target name for what a function provides. Examples of changing the target name
via the `provides` argument to `dm.target` can be found further below.
You probably already see where this is going, but maybe another, longer, example
helps make things clear and introduces changing the name of the target that a
function provides:
:: python
#!/usr/bin/env python
from snek import DependencyManager
from dataclasses import dataclass
dm = DependencyManager()
@dataclass
class AResult:
value: str
@dataclass
class BResult:
value: str
@dataclass
class CResult:
value: str
@dm.target(provides='A', cacheable=False)
def compute_A() -> AResult:
print("called compute_A()")
return AResult("Result of A()")
@dm.target(provides='C', cacheable=False)
def compute_C() -> CResult:
print("called compute_C()")
return CResult("C Result")
@dm.target(provides='B', requires=['A', 'C'], cacheable=False)
def compute_B(A, C) -> BResult:
print("called compute_B()")
return BResult("Result B following " + A.value + " " + C.value)
if __name__ == "__main__":
result = dm.make('B')
print(result)
In the example above, we define three different targets A, B, and C (specified
using the `provides` argument to `dm.target`, because the functions have
different names than what they produce). While A and C can stand independently,
B requires A and B. Of course, this can be also expressed with simply chaining
some functions, but often -- especially in large data science pipelines -- this
gets cumbersome and its easier to think in terms of (intermediate) artifacts
that get generated, and then treat them as inputs for other functions. Also
maybe some configuration parameters for a pipeline influence how intermediate
artifacts are generated or not. The code for the example above is located in
file [demo4.py]@(file::demo/demo4.py).
## Cache
At this point, you might wonder what the `cacheable=False` argument to
`dm.target` is. Very often in data science pipelines, we create intermediate
artifacts which are expensive to compute - let the cost be time, space, or other
resources. It'd be nice avoid recomputation and simply cache them.
This is where the `cacheable` flag comes in. By default, all targets in `snek`'s
`DependencyManager` are assumed to be cacheable (because this is what I built it
for). For this to work, we need to provide some information to `snek` so that it
knows how to serialize and deserialize objects. In fact, `snek` doesn't really
care how you want to cache the intermediate objects, it simply invokes some
functions when it happens.
By default, `snek` calls function `serialize` to cache an (intermediate) object --
or more precisely the result of a target -- and `deserialize` to load it from
some cache on the disk. Note that on top of that, `snek`'s `DependencyManager`
also uses an in-memory cache to avoid loading things that were already
deserialized. Anyway, let's get back to the caching mechanism, illustrated with
the following example:
:: python
class Serializable:
def serialize(self, name: str, unique_id: str):
fpath = Path('cache') / (name + "_" + unique_id + ".npz")
dataclass_to_npz(self, fpath)
@classmethod
def deserialize(cls, name: str, unique_id: str):
fpath = Path('cache') / (name + "_" + unique_id + ".npz")
if fpath.exists():
return npz_to_dataclass(cls, fpath)
return None
@dataclass
class AResult(Serializable):
value: str
In the code above, which is part of [demo.py]@(file::demo/demo.py), we first
define some generic class that has a function `serialize` and a class function
`deserialize`. `snek`'s `DependencyManager` will look for these two methods for
any result of a target. The type of the target will be inferred from type hints,
but can also be specified as an argument to `dm.target`. Internally, `snek` will
then try to call these functions to load/store the result of a target. In the
example above, I made use of two small helper functions `dataclass_to_npz` and
`npz_to_dataclass` that I developed to turn, as their names imply, simple
dataclasses into npz files or the other way around to avoid using `pickle`. If
you want to read more about these functions and how they work, have a look at
[dcutils]@(file::demo/dcutils.py).
In the code above, you also see that `snek` doesn't really care about how your
caching mechanism should work. It merely tells you to serialize or deserialize
something based on their target name as well as the unique id that was computed
for that specific target. Which leads us to the next point.
### Unique IDs
One of the things you might notice above is the usage of a `unique_id`. `snek` was
designed with the knowledge that artifacts that are generated in any data
science or Makefile-like pipeline usually depend on some configuration options,
files, as well as their parents. For instance, if a pipeline has targets A, B and C,
then each of them might require some configuration. Let's say A and C are
required to make B, and B has a change in its configuration, then only B should
be re-build. However, if C has a change in configuration, then both C and B
would have to be rebuilt, because C is a dependency of B. To handle this, `snek`
expects you to tell it during calls to `dm.target`, which parameters should be
considered for this, as in the following example
:: python
@dm.target(provides='B', requires=['A', 'C'], params={'config_b': config.config_b})
def compute_B(A, C, config_b) -> BResult:
print("called compute_B()")
return BResult("Result B following " + A.value + " " + C.value)
In the code above, taken from [demo.py]@(file::demo/demo.py), we specify that a
target `compute_B`, which provides `B`, depends on two other things `A` and `C`.
Moreover, `compute_B` depends on some extra parameters which determine what is
computed, and we specify this using the `params` argument to `dm.target`.
If you have many parameters for a function, the above can become verbose to
write. For instance, take the following example:
:: python
@dm.target(provides='B', requires=['A', 'C'], params={'param1': config.param1, 'param2': config.param2, 'param3': config.param3})
def compute_B(A, C, param1, param2, param3) -> BResult:
print("called compute_B()")
return BResult("Result B following " + A.value + " " + C.value)
While this is obviously contrived, and the params are not used at all, it
highlights an issue: repetition. To avoid having to repeat parameter names and
let `snek` automatically figure out what to pass in, you can do the following:
:: python
@dm.target(provides='B', requires=['A', 'C'], params='auto', param_source=config)
def compute_B(A, C, param1, param2, param3) -> BResult:
print("called compute_B()")
return BResult("Result B following " + A.value + " " + C.value)
When using `params='auto'` in combination with `param_source`, then `snek` will
try to first map all "required" arguments to the function. Subsequently, it
will try to fetch all remaining arguments from `param_source`. Note that for
this to work, the names need to match, i.e. if `param1` is not an element of
`config`, the above might fail.
Combining auto with automatic deduction of target name and dependencies allows
to specify targets with as little effort as
:: python
@dm.target(params='auto', param_source=config)
def kerneldensityestimate(prefiltered, sidechannelinfo, param1, param2, param3) -> KDEResult:
# ...
return KDEResult(some_values)
Which will register a target with name `kerneldensityestimate`, that requires as
inputs `prefiltered` and `sidechannelinfo`, and takes additional parameters from
`config`. Thus, not much overhead.
### A note on Unique IDs and return types
The unique id that was mentioned above is not only based on the parameters that
go into a target, the parent unique ids, but also on the structure of the return
type if it is available. That is, if the return type of a target function
provides a class-method `structural_hash` or is a dataclass, then the structural
information will be taken into account as well (or the result of
`structural_hash`, which must be something hashable, like a string). Here's an
example:
:: python
class CustomTensor:
shape: tuple
dtype: str
# ...
@classmethod
def structural_hash(cls):
return hash_obj({"type": "CustomTensor", "shape": cls.default_shape, "dtype": cls.default_dtype})
In the example above, `hash_obj` is a function that ships with `snek.py`.
The reason to include the structure of the return type is that `snek.py` was
developed with rapid prototyping in mind. In this scenario, the return types
might not yet be fixed, and changing the structure of a return type should
require recomputation of the corresponding target.
Using a class method `structural_hash` provides flexibility and transparency.
When this class function is not provided, `snek` will extract relevant
information only if the return type is a dataclass, in which case the hash is
computed over the name of the fields and their type.
## File Dependencies
`snek` has a tiny bit of more magic: it understands that we're often working
with files. An example, again taken from [demo.py]@(file::demo/demo.py) will
show how to use files:
:: python
@dm.target(provides='A', requires=['@demoinput.txt'], params=asdict(config.config_a))
def compute_A(fpath, **kwargs) -> AResult:
print("called compute_A()")
print(f"Reading file {fpath}")
return AResult("Result A (" + str(kwargs['option_a_1']) + ")")
Essentially, making target `A` depends on some file `demoinput.txt`. This is
simply specified using the `@` symbol in the list of requirements for the
target.
## Analysing the compute graph
You might wish to analyze or visualize the compute graph that is constructed to
make a certain target. Internally, `snek` first builds that graph and then
resolves the node within the graph. Let's have a look at the implementation:
:: python
def make(self, name, verbose: bool = False, use_cache: bool = True):
node = self.build_graph(name, verbose)
result = self.resolve(node, use_cache, verbose)
return result
If you want to examine the tree, you can call `build_graph` and then walk the
tree given by the returned `node`. I invite you to have a look at
[snek.py]@(file::snek.py) what information a node contains.
## Parallel Builds
`snek` has some preliminary capabilities for building targets in parallel that
can be invoked using `make(..., parallel=True)`. Internally, `snek` will then
sort the compute graph topologically and build all targets that can be built
independently in parallel using python's `ThreadPoolExecutor`. Note that using
`ThreadPoolExecutor` is not always advisable, and you should carefully consider
if it applies in your case or not.
## Installation and Hackability
One of the main goals of `snek` is to be as minimal as possible. In fact, with <
400 lines of python code, snek is probably as short as it gets. You might not
even want to install it as a package and just throw [snek.py]@(file::snek.py)
into your project source folder, extend it, change it around, automate other
things.
If you want to install it properly, say into a virtual environment, then you
can use pip:
:: sh
$ # activate the desired virtual environment
$ git clone https://github.com/nwaniek/snek.git
$ cd snek
$ pip install .
## Contributing
Contributions are welcome!
If you have suggestions, bug reports, or want to contribute code, please open an
issue or submit a pull request on GitHub.
## License
`snek` is licensed under the MIT License. See the [LICENSE]@(file::LICENSE) file
for details.
About
A minimalistic and hackable dependency manager that allows to specify targets in spirit similar to Makefiles, but directly within python.
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published