- Objective
- A docker image that downloads data from the web into a volume.
- State
- In progress
python src/main.py
https://raw.githubusercontent.com/greyhypotheses/discourses/develop/
reader/resources/images.ymlor
python src/main.py
https://raw.githubusercontent.com/greyhypotheses/discourses/develop/
reader/resources/images.yml --limit 31Wherein images.yml is an input argument of parameters that guides the downloading of data files, whilst the optional argument --limit is used to specify the number of files to download.
| parameter | type | Descriptions |
|---|---|---|
rootURL |
str | The root URL from whence files will be downloaded |
metadataFileURL |
str | A CSV file that includes a field of file names that would be downloaded |
fileStringsField |
str | The name of the field of file names |
fileStringsIncludeExt |
bool | Do the file names, in the file names field, include file extensions? |
archived |
bool | Archived files? If true, they will be dearchived. Presently, only zip files can be de-archived. |
ext |
str | File extension, e.g., .zip. This parameter is mandatory if fileStringsIncludeExt is false. |
-
The switch from
dasktomultiprocessing -
Dockerfile
-
GitHub Actions .yml: For (a) automated pytest, coverage, and pylint tests, (b) building & deploying docker images.
-
Automated Tests: GitHub Actions will highlight deficiencies w.r.t. tests and/or conventions
-
Brief, but comprehensible, docstrings throughout
- At present, data is always downloaded into a volume named
data. This set-up might be changed such that data is downloaded into a volume whose name is declared in the parameters file.
The local environment is
.../reader:conda create --prefix .../reader
and the requirements are summarised via filter.txt & requirements
pip freeze -r docs/filter.txt > requirements.txt
Note:
-
python-graphvizcan't be included in filter.txt/requirements.txt; the reason why GitHub Actions rejects it is unclear. -
Always ascertain that the
dasksetting in requirements.txt isdask[complete]; this avoids GitHub Actions errors.
The explicitly installed packages are listed in filter.txt. Foremost
conda activate reader
conda install -c anaconda dask==2021.10.0
conda install -c anaconda python==3.7.10
conda install -c anaconda pytest coverage pytest-cov pylint
conda install -c anaconda requests
# dotmap
pip install dotmap==1.3.23
# python-graphviz installs graphviz & python-graphiz
conda install -c anaconda python-graphvizA few points w.r.t. dask, dask might install an old version of Pillow that will trigger a GitHub security alert, hence
pip install PillowAdditionally, dask might install an old version of Jinja2 that will trigger a GitHub security alert, hence
pip install jinja2- Renaming conda environments: However, deleting then re-creatings seems to be the effective option