GitHub - nwaniek/snek: A minimalistic and hackable dependency manager that allows to specify targets in spirit similar to Makefiles, but directly within python.

nwaniek / snek Public
Notifications You must be signed in to change notification settings
Fork 0
Star 1
A minimalistic and hackable dependency manager that allows to specify targets in spirit similar to Makefiles, but directly within python.
MIT license
1 star 0 forks Branches Tags Activity
Star
Notifications
Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
demo		demo
.gitignore		.gitignore
LICENSE		LICENSE
README.lmr		README.lmr
setup.py		setup.py
snek.py		snek.py
sneklogo.png		sneklogo.png
sneklogo.svg		sneklogo.svg
Repository files navigation

# SNEK - Python Dependency Manager

!img
!src sneklogo.png

`snek.py` is a dependency manager that allows to build declarative data science
or other processing pipelines. It's goal is to be as minimal as possible to
allow rapid prototyping and to not be intrusive, while being flexible enough to
build large and customizable pipelines that can be configured depending on
ad-hoc needs. In `snek.py` (or simply `snek`), targets are specified in spirit
similar to Makefiles, but directly within python. 

Adhering to the goal of minimalism, serialization/deserialization for caching is
not part of `snek.py` itself. Rather, `snek.py` calls particular functions if
they are part of the return type of the steps of a pipeline or functions that
can be specified when declaring a particular target. Hence, users can implement
their own favorite methods and are not constrained by `snek.py`. Several
examples for serialization are provided in the demos that ship with `snek.py`,
including how to save dataclasses to numpy files and use an sqlite database as
caching backend.


## Use Cases

In data science, we often build pipelines with lots of configuration options.
Yet, most pipelines can be broken down into small functions that compute
artifacts, that are then later recombined in particular manners. Essentially,
such pipelines can be described as a computational graph, and intermediate
results might have to be cached to avoid expensive recomputation.

As an example of `snek`'s behavior, let's have a look at the first run of
[demo.py]@(file::demo/demo.py) with caching enabled, meaning it'll store
intermediate results, and verbose output:

:: sh
    $ ./demo.py 
    Resolved: A cb890cae7cdc934e705dd064dad9e30843267181
    Resolved: C d9d7fadf0eb3bde7bbe4763033f25432527ef81d
    Resolved: B 2b5f5080f8750e1e6b721c46772c271aeba79a16
    called compute_A()
    Reading file demoinput.txt
    called compute_C()
    called compute_B()
    BResult(value='Result B following Result A (101) C Result')


Subsequent runs of demo.py will then only look as follows

:: sh
    $ ./demo.py
    Resolved: A cb890cae7cdc934e705dd064dad9e30843267181
    Resolved: C d9d7fadf0eb3bde7bbe4763033f25432527ef81d
    Resolved: B 2b5f5080f8750e1e6b721c46772c271aeba79a16
    Loaded B from cache
    BResult(value='Result B following Result A (101) C Result')

Meaning instead of computing all intermediate artifacts required to generate B
(which are A and C in the example presented in [demo.py]@(file::demo.py)), it'll
load B from some cache. You can find more details about the `cache` and what
`snek` really provides below.


## Usage

Let's dive right in:

:: python
    from snek import DependencyManager

    dm = DependencyManager()

    @dm.target(cacheable=False)
    def awesomesauce():
        print("Hello, World!")

    if __name__ == "__main__":
        dm.make('awesomesauce')

The above code illustrates how to use `snek` to make a particular target called
`asesomsauce`. In this example, the name of the target is automatically inferred
from the function name.  You can find this demo in file
[demo2.py]@(file::demo/demo2.py)

Of course this might not be very exciting, but once we introduce other targets
that, themselves, need some targets, things get more interesting:

:: python
    from snek import DependencyManager

    dm = DependencyManager()

    @dm.target(cacheable=False)
    def awesomesauce():
        return "Hello, World!"

    @dm.target(cacheable=False)
    def printstr(awesomesauce):
        print(awesomesauce)

    if __name__ == "__main__":
        dm.make('printstr')


Now we call `dm.make` on the printstr target. awesomesauce itself doesn't
directly print any more, but returns a string. This string is the result of the
target `awesomesauce`. Then, we define a target `printstr`, which requires
as input `awesomesauce` -- inferred by the name of the argument to `printstr` --
and then actually prints it.

The above code can be found in [demo3.py]@(file::demo/demo3.py). `snek` will
figure out which function to call by itself if you don't specify a different
target name for what a function provides. Examples of changing the target name
via the `provides` argument to `dm.target` can be found further below.

You probably already see where this is going, but maybe another, longer, example
helps make things clear and introduces changing the name of the target that a
function provides:

:: python
    #!/usr/bin/env python

    from snek import DependencyManager
    from dataclasses import dataclass

    dm = DependencyManager()

    @dataclass
    class AResult:
        value: str

    @dataclass
    class BResult:
        value: str

    @dataclass
    class CResult:
        value: str

    @dm.target(provides='A', cacheable=False)
    def compute_A() -> AResult:
        print("called compute_A()")
        return AResult("Result of A()")

    @dm.target(provides='C', cacheable=False)
    def compute_C() -> CResult:
        print("called compute_C()")
        return CResult("C Result")

    @dm.target(provides='B', requires=['A', 'C'], cacheable=False)
    def compute_B(A, C) -> BResult:
        print("called compute_B()")
        return BResult("Result B following " + A.value + " " + C.value)

    if __name__ == "__main__":
        result = dm.make('B')
        print(result)


In the example above, we define three different targets A, B, and C (specified
using the `provides` argument to `dm.target`, because the functions have
different names than what they produce). While A and C can stand independently,
B requires A and B. Of course, this can be also expressed with simply chaining
some functions, but often -- especially in large data science pipelines -- this
gets cumbersome and its easier to think in terms of (intermediate) artifacts
that get generated, and then treat them as inputs for other functions. Also
maybe some configuration parameters for a pipeline influence how intermediate
artifacts are generated or not. The code for the example above is located in
file [demo4.py]@(file::demo/demo4.py).


## Cache

At this point, you might wonder what the `cacheable=False` argument to
`dm.target` is. Very often in data science pipelines, we create intermediate
artifacts which are expensive to compute - let the cost be time, space, or other
resources. It'd be nice avoid recomputation and simply cache them.

This is where the `cacheable` flag comes in. By default, all targets in `snek`'s
`DependencyManager` are assumed to be cacheable (because this is what I built it
for). For this to work, we need to provide some information to `snek` so that it
knows how to serialize and deserialize objects. In fact, `snek` doesn't really
care how you want to cache the intermediate objects, it simply invokes some
functions when it happens.

By default, `snek` calls function `serialize` to cache an (intermediate) object --
or more precisely the result of a target -- and `deserialize` to load it from
some cache on the disk. Note that on top of that, `snek`'s `DependencyManager`
also uses an in-memory cache to avoid loading things that were already
deserialized. Anyway, let's get back to the caching mechanism, illustrated with
the following example:

:: python

    class Serializable:
        def serialize(self, name: str, unique_id: str):
            fpath = Path('cache') / (name + "_" + unique_id + ".npz")
            dataclass_to_npz(self, fpath)

        @classmethod
        def deserialize(cls, name: str, unique_id: str):
            fpath = Path('cache') / (name + "_" + unique_id + ".npz")
            if fpath.exists():
                return npz_to_dataclass(cls, fpath)
            return None

    @dataclass
    class AResult(Serializable):
        value: str

In the code above, which is part of [demo.py]@(file::demo/demo.py), we first
define some generic class that has a function `serialize` and a class function
`deserialize`. `snek`'s `DependencyManager` will look for these two methods for
any result of a target. The type of the target will be inferred from type hints,
but can also be specified as an argument to `dm.target`. Internally, `snek` will
then try to call these functions to load/store the result of a target. In the
example above, I made use of two small helper functions `dataclass_to_npz` and
`npz_to_dataclass` that I developed to turn, as their names imply, simple
dataclasses into npz files or the other way around to avoid using `pickle`. If
you want to read more about these functions and how they work, have a look at
[dcutils]@(file::demo/dcutils.py).

In the code above, you also see that `snek` doesn't really care about how your
caching mechanism should work. It merely tells you to serialize or deserialize
something based on their target name as well as the unique id that was computed
for that specific target. Which leads us to the next point.


### Unique IDs

One of the things you might notice above is the usage of a `unique_id`. `snek` was
designed with the knowledge that artifacts that are generated in any data
science or Makefile-like pipeline usually depend on some configuration options,
files, as well as their parents. For instance, if a pipeline has targets A, B and C,
then each of them might require some configuration. Let's say A and C are
required to make B, and B has a change in its configuration, then only B should
be re-build. However, if C has a change in configuration, then both C and B
would have to be rebuilt, because C is a dependency of B. To handle this, `snek`
expects you to tell it during calls to `dm.target`, which parameters should be
considered for this, as in the following example

:: python
    @dm.target(provides='B', requires=['A', 'C'], params={'config_b': config.config_b})
    def compute_B(A, C, config_b) -> BResult:
        print("called compute_B()")
        return BResult("Result B following " + A.value + " " + C.value)

In the code above, taken from [demo.py]@(file::demo/demo.py), we specify that a
target `compute_B`, which provides `B`, depends on two other things `A` and `C`.
Moreover, `compute_B` depends on some extra parameters which determine what is
computed, and we specify this using the `params` argument to `dm.target`.


If you have many parameters for a function, the above can become verbose to
write. For instance, take the following example:

:: python
    @dm.target(provides='B', requires=['A', 'C'], params={'param1': config.param1, 'param2': config.param2, 'param3': config.param3})
    def compute_B(A, C, param1, param2, param3) -> BResult:
        print("called compute_B()")
        return BResult("Result B following " + A.value + " " + C.value)


While this is obviously contrived, and the params are not used at all, it
highlights an issue: repetition. To avoid having to repeat parameter names and
let `snek` automatically figure out what to pass in, you can do the following:

:: python
    @dm.target(provides='B', requires=['A', 'C'], params='auto', param_source=config)
    def compute_B(A, C, param1, param2, param3) -> BResult:
        print("called compute_B()")
        return BResult("Result B following " + A.value + " " + C.value)


When using `params='auto'` in combination with `param_source`, then `snek` will
try to first map all "required" arguments to the function. Subsequently, it
will try to fetch all remaining arguments from `param_source`. Note that for
this to work, the names need to match, i.e. if `param1` is not an element of
`config`, the above might fail.

Combining auto with automatic deduction of target name and dependencies allows
to specify targets with as little effort as


:: python

    @dm.target(params='auto', param_source=config)
    def kerneldensityestimate(prefiltered, sidechannelinfo, param1, param2, param3) -> KDEResult:
        # ...
        return KDEResult(some_values)

Which will register a target with name `kerneldensityestimate`, that requires as
inputs `prefiltered` and `sidechannelinfo`, and takes additional parameters from
`config`. Thus, not much overhead.


### A note on Unique IDs and return types

The unique id that was mentioned above is not only based on the parameters that
go into a target, the parent unique ids, but also on the structure of the return
type if it is available. That is, if the return type of a target function
provides a class-method `structural_hash` or is a dataclass, then the structural
information will be taken into account as well (or the result of
`structural_hash`, which must be something hashable, like a string). Here's an
example:

:: python
    class CustomTensor:
        shape: tuple
        dtype: str
        # ...

        @classmethod
        def structural_hash(cls):
            return hash_obj({"type": "CustomTensor", "shape": cls.default_shape, "dtype": cls.default_dtype})

In the example above, `hash_obj` is a function that ships with `snek.py`.

The reason to include the structure of the return type is that `snek.py` was
developed with rapid prototyping in mind. In this scenario, the return types
might not yet be fixed, and changing the structure of a return type should
require recomputation of the corresponding target.

Using a class method `structural_hash` provides flexibility and transparency.
When this class function is not provided, `snek` will extract relevant
information only if the return type is a dataclass, in which case the hash is
computed over the name of the fields and their type.



## File Dependencies

`snek` has a tiny bit of more magic: it understands that we're often working
with files. An example, again taken from [demo.py]@(file::demo/demo.py) will
show how to use files:

:: python
    @dm.target(provides='A', requires=['@demoinput.txt'], params=asdict(config.config_a))
    def compute_A(fpath, **kwargs) -> AResult:
        print("called compute_A()")
        print(f"Reading file {fpath}")
        return AResult("Result A (" + str(kwargs['option_a_1']) + ")")

Essentially, making target `A` depends on some file `demoinput.txt`. This is
simply specified using the `@` symbol in the list of requirements for the
target.


## Analysing the compute graph

You might wish to analyze or visualize the compute graph that is constructed to
make a certain target. Internally, `snek` first builds that graph and then
resolves the node within the graph. Let's have a look at the implementation:

:: python
    def make(self, name, verbose: bool = False, use_cache: bool = True):
        node   = self.build_graph(name, verbose)
        result = self.resolve(node, use_cache, verbose)
        return result

If you want to examine the tree, you can call `build_graph` and then walk the
tree given by the returned `node`. I invite you to have a look at
[snek.py]@(file::snek.py) what information a node contains. 


## Parallel Builds

`snek` has some preliminary capabilities for building targets in parallel that
can be invoked using `make(..., parallel=True)`. Internally, `snek` will then
sort the compute graph topologically and build all targets that can be built
independently in parallel using python's `ThreadPoolExecutor`. Note that using
`ThreadPoolExecutor` is not always advisable, and you should carefully consider
if it applies in your case or not.


## Installation and Hackability

One of the main goals of `snek` is to be as minimal as possible. In fact, with <
400 lines of python code, snek is probably as short as it gets. You might not
even want to install it as a package and just throw [snek.py]@(file::snek.py)
into your project source folder, extend it, change it around, automate other
things. 

If you want to install it properly, say into a virtual environment, then you
can use pip:

:: sh
    $ # activate the desired virtual environment
    $ git clone https://github.com/nwaniek/snek.git
    $ cd snek
    $ pip install .


## Contributing

Contributions are welcome!
If you have suggestions, bug reports, or want to contribute code, please open an
issue or submit a pull request on GitHub.


## License

`snek` is licensed under the MIT License. See the [LICENSE]@(file::LICENSE) file
for details.