-
Notifications
You must be signed in to change notification settings - Fork 24
Issues while updating a dataset (ingesting again updated data files) #80
Description
Hi AGDC Team,
While I'm testing WOfS ingestion, I found an issue.
I have downloaded some WOfS file from http://dapds00.nci.org.au/thredds/catalog/fk4/wofs/current/extents in a directory on my machine.
Then I run the ingest command for the first time: e.g
agdc/ingest/wofs.py --source /home/adminprod/data1/rs0/tiles/wofs/
Ingestion of the data files is processed successfully.
Then I want to test an ingestion of existing data in the Data Cube (source files have been updated and I want to update my Data Cube).
To do this, I change the date of the source files (with the Linux 'touch' command).
The datetime of the data is now greater than the datetime of the dataset in the database.
I run again agdc/ingest/wofs.py --source /home/adminprod/data1/rs0/tiles/wofs/ and I get the following exception:
2015-08-04 11:56:02,123 agdc.ingest.tile_contents INFO Tile already in place: '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115_-035_2011-01-10T01-59-19.155557.tif'
2015-08-04 11:56:02,217 agdc.ingest.core INFO Ingestion complete for dataset '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115-035_2011-01-10T01-59-19.155557.tif' in 0:00:00.197192.
Traceback (most recent call last):
File "/home/adminprod/agdc-develop/agdc/ingest/wofs.py", line 97, in
agdc.ingest.run_ingest(WofsIngester)
File "/home/adminprod/agdc-develop/agdc/ingest/_core.py", line 586, in run_ingest
ingester.ingest(ingester.args.source_dir)
File "/home/adminprod/agdc-develop/agdc/ingest/_core.py", line 186, in ingest
self.ingest_individual_dataset(dataset_path)
File "/home/adminprod/agdc-develop/agdc/ingest/core.py", line 207, in ingest_individual_dataset
self.tile(dataset_record, dataset)
File "/home/adminprod/agdc-develop/agdc/ingest/pretiled.py", line 312, in tile
dataset_record.store_tiles([tile_contents])
File "/home/adminprod/agdc-develop/agdc/ingest/dataset_record.py", line 238, in store_tiles
return [self.create_tile_record(tile_contents) for tile_contents in tile_list]
File "/home/adminprod/agdc-develop/agdc/ingest/dataset_record.py", line 320, in create_tile_record
size_mb=tile_contents.get_output_size_mb(),
File "/home/adminprod/agdc-develop/agdc/ingest/tile_contents.py", line 174, in get_output_size_mb
return get_file_size_mb(path)
File "/home/adminprod/agdc-develop/agdc/cube_util.py", line 109, in get_file_size_mb
return os.path.getsize(path) // (1024 * 1024)
File "/usr/lib/python2.7/genericpath.py", line 49, in getsize
return os.stat(filename).st_size
OSError: [Errno 2] No such file or directory: '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115-035_2011-02-27T01-59-34.560472.tif'
2015-08-04 11:56:02,352 agdc.ingest.core ERROR Unexpected error during path '/home/adminprod/data1/rs0/tiles/wofs/LS7_ETM_WATER_115-035_2011-02-27T01-59-34.560472.tif'
After some investigation, I think the issue is due to the fact the data file is removed in the '__commit' function of the 'collection.py' module: i.e.
# Remove tile files just after the commit, to avoid removing
# tile files when the deletion of a tile record has been rolled
# back. Again, tile files without records are possible if there
# is an exception or crash just after the commit.
#
# The tile remove list is filtered against the tile create list
# to avoid removing a file that has just been re-created. It is
# a bad idea to overwrite a tile file in this way (in a single
# transaction), because it will be overwritten just before the
# commit (above) and the wrong file will be in place if the
# transaction is rolled back.
tile_create_set = {t.get_output_path()
for t in self.tile_create_list}
for tile_pathname in self.tile_remove_list:
if tile_pathname not in tile_create_set:
if os.path.isfile(tile_pathname):
os.remove(tile_pathname)
To be able to ingest again the updated data source files, I have comment the 'os.remove' instruction above.
Note if the data source have not been updated (i.e. data of the source file = date of the database dataset), there is no issue.
Note If I run again the ingestion, the issue doesn't occur always on the same file: sometimes on the first file, sometimes on the nth file.