Wanted to get your thoughts.
In my use case, I actually need two usage modes; one is like yours, which I affectionately term Upsert/Clobber mode and the other I term Upsert/Storage mode. I'm open to other names as well.
The current behavior is the Clobber mode. It watches s3 buckets that match the patterns in the config and bucket manifest, performs the upsert when file changes are found, and then deletes the s3 files leaving the manifest behind untouched.
Storage mode would be designed for s3 buckets that keep semi-permanent data and do something entirely different. Storage mode would also watch s3 buckets that match the patterns in the config and bucket manifest, but then perform the upsert, not delete the s3 data files but delete the s3 bucket's manifest (upon successful data load into redshift) such that the bucket is no longer watched to avoid loading this data again. This is useful for loading data buckets that contain dated subfolders (e.g. 2015-01-13) and files for that date therein. That way the data is not deleted but loaded into redshift, and need not be loaded again.
Other modes could potentially be created if necessary or desired.
I haven't implemented this yet, but am considering it.
My questions are:
1. What are your thoughts on this in general?
2. Do you desire this kind of behavior (or something similar)?
3. Would you eventually want something like this merged back into Blueshift?
4. What are your thoughts around semantics of how to specify these Modes? Config file semantics? Manifest file semantics?
Thanks for your thoughts.
-Avram