Skip to content

discourses/reader

Repository files navigation

Reader: Python

Objective
A docker image that downloads data from the web into a volume.
State
In progress

develop
Reader Project

master
Reader Project



Running

python src/main.py
    https://raw.githubusercontent.com/greyhypotheses/discourses/develop/
      reader/resources/images.yml

or

python src/main.py
    https://raw.githubusercontent.com/greyhypotheses/discourses/develop/
      reader/resources/images.yml --limit 31

Wherein images.yml is an input argument of parameters that guides the downloading of data files, whilst the optional argument --limit is used to specify the number of files to download.

parameter type Descriptions
rootURL str The root URL from whence files will be downloaded
metadataFileURL str A CSV file that includes a field of file names that would be downloaded
fileStringsField str The name of the field of file names
fileStringsIncludeExt bool Do the file names, in the file names field, include file extensions?
archived bool Archived files? If true, they will be dearchived. Presently, only zip files can be de-archived.
ext str File extension, e.g., .zip. This parameter is mandatory if fileStringsIncludeExt is false.


In Progress or Upcoming

  • The switch from dask to multiprocessing

  • Dockerfile

  • GitHub Actions .yml: For (a) automated pytest, coverage, and pylint tests, (b) building & deploying docker images.

  • Automated Tests: GitHub Actions will highlight deficiencies w.r.t. tests and/or conventions

  • Brief, but comprehensible, docstrings throughout



Considerations

  • At present, data is always downloaded into a volume named data. This set-up might be changed such that data is downloaded into a volume whose name is declared in the parameters file.


Environment

The local environment is

  • .../reader: conda create --prefix .../reader

and the requirements are summarised via filter.txt & requirements

  • pip freeze -r docs/filter.txt > requirements.txt

Note:

  • python-graphviz can't be included in filter.txt/requirements.txt; the reason why GitHub Actions rejects it is unclear.

  • Always ascertain that the dask setting in requirements.txt is dask[complete]; this avoids GitHub Actions errors.



Packages

The explicitly installed packages are listed in filter.txt. Foremost

  conda activate reader
    
  conda install -c anaconda dask==2021.10.0
  conda install -c anaconda python==3.7.10
  conda install -c anaconda pytest coverage pytest-cov pylint
  conda install -c anaconda requests 
  
  # dotmap
  pip install dotmap==1.3.23
  
  # python-graphviz installs graphviz & python-graphiz
  conda install -c anaconda python-graphviz

A few points w.r.t. dask, dask might install an old version of Pillow that will trigger a GitHub security alert, hence

  pip install Pillow

Additionally, dask might install an old version of Jinja2 that will trigger a GitHub security alert, hence

  pip install jinja2


References









About

Utilities: Unloading web files in parallel

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published