-
Notifications
You must be signed in to change notification settings - Fork 7
cache specification
- Java CLI interface code (Nobes)
- Review doc (All)
- Unit tests (All)
- Extract existing Java code (Jeremy)
- Paper that tests time series databases for IoT: https://arxiv.org/pdf/1901.08304.pdf
- Comment on caching by jeandet
- Possible locking mechanism: https://stackoverflow.com/questions/11787567/cache-locking-for-lots-of-processes
Usage
java -jar hapi-cache.jar --url "https://server/hapi/data?dataset=...parameters=...&start=...&stop=...&format={csv,bin}"
java -jar hapi-cache.jar \
--server "https://server/hapi" --dataset=... --parameters=... --start=... --stop=... --format={csv,bin}
Response is csv or binary according to format. Default behavior when used as client is to use HTTP headers + existing cache to make decision as to how to return data (use cache or make new request). For server is to use file timestamps (or HTTP headers on back-end server if used in pass thru mode).
Other options:
--cache-dir DIR
--write-cache {T,F} (write cache if not there)
--read-cache {T,F} (use cache if there)
--expire-after N{y,d,h,m,s} (use this word? Don't use cache if written > N{y,d,h,m,s} ago - this is a feature of Python `requests_cache` lib; default is never)
--cache-exact (only cache exact request; will lead to less cache hits, but fast cache response if exact request made again)
Some code that implements this is located in https://github.com/hapi-server/cache-tools
Issues:
- Should metadata (http headers) be cached as well?
- Should the scientist be able to lock the cache so that updates will not occur?
The following is a description of a recommended directory and file schema for programs that cache HAPI data.
HAPI_DATA should be the environment variable indicating the HAPI cache directory. If not specified, the logic of Python's tempfile module should be used to get the system temporary directory to which hapi_data should be appended, e.g., /tmp/hapi_data will be a common default.
Data directory naming: If cadence is given
-
cadence < PT1S- files should contain 1 hour of data and be in subdirectoryDATASET_ID/$Y/$m/$d/. File names should be$Y$m$dT$H.VARIABLE.EXT. -
PT1S <= cadence <= PT1H- files should contain 1 day of data and be in subdirectoryDATASET_ID/$Y/$m/. File names should be$Y$m$d.VARIABLE.EXT. -
cadence > PT1H- files should contain 1 month of data of data and be in a subdirectory ofDATASET_ID/$Y/. File names should be$Y$m.VARIABLE.EXT.
If cadence is not given, the caching software should (use the rule ... always do daily (Jeremy)? Or more well defined (Nobes)?) and choose the appropriate directory structure. Likewise, software using the cache should assume that other software may have different logic and should check all resolutions.
Files should contain only data for the parameter, e.g., 19991201.Time.csv will contain a single column with just the timestamps that are common to all parameters in the dataset. The file 19991201.Parameter1.csv would not contain timestamps. If a user requests Parameter1, a program reading the cache will need to read two files, the Time file and the Parameter1 file, to return the required data for Parameter1.
Directory structure for PT1S <= cadence <= PT1H:
hapi_data/
# http://hapi-server.org/servers/SSCWeb/hapi
http/
hapi-server.org/
servers/
SSCWeb/
hapi/
capabilities.json
catalog.json
data/
info/
# https://cdaweb.gsfc.nasa.gov/hapi
https/
cdaweb.gsfc.nasa.gov/
hapi/
capabilities.json
catalog.json
data/
A1_K0_MPA/2008/01/
20080103.csv{.gz} # All parameters
20080103.binary{.gz} # All parameters
20080103.Time.csv{.gz} # Single column
20080103.Time.binary{.gz}
20080103.sc_pot.csv{.gz} # Single column
20080103.sc_pot.binary{.gz}
...
AC_AT_DEF/2009/02/
...
info/
A1_K0_MPA.json
AC_AT_DEF.json
...
Thread safety. As we develop, continue to ask if this can be added later without complication.