RolnickLab · melisandeteng · Apr 12, 2024 · Apr 12, 2024 · Apr 12, 2024 · Apr 12, 2024
diff --git a/README.md b/README.md
@@ -83,10 +83,10 @@ You will first need to create a Cornell lab account to [ask for access to and do
 - Run `sbatch filter2.awk raw_ebd_data_file.txt `script to group observations by locality. The script on the repository was the one used to create the USA dataset, where we kept checklists of June, July, December and January to prepare the USA-summer and USA-winter dataset. For Kenya no month filter was applied.
 This will create one file per hotspots, the files will be organized based on the hotspot name in `split-XXX/`  folders where each folder should only contain .csv files for hotspots `L*XXX`. 
 - Run  `data_processing/ebird/find_checklists.py` to filter the checklists you want to keep based on the ebd sampling event data. Typically this will only keep complete checklists and will create (for the USA dataset) 2 .csv files with the hotspots to keep corresponding to observations made in the summer and winter. 
-- to get the targets, run `data_processing/ebird/get_winter_targets.py` . Change the paths to the one where you created the `split-XXX/` folders.
+- to get the targets, run `data_processing/ebird/get_targets.py` . Change the paths to the one where you created the `split-XXX/` folders.
 
 
-- Download the **Sentinel-2 data** using `data_processing/satellite/download_rasters_from_planetary_computer.py`. To reproduce all experiments, you will have to run it twice, specifying the BANDS as ["B02", "B03", "B04", "B08"] for the BGRNIR reflectance data, and "visual" for the RGB visual component. A useful function is `process_row` which will extract the least cloudy image (with your specified bands) over the specified period of time with a maximum of 10\% cloud cover. For some hotspots, it is possible that you will be able to extract the visual component but incomplte items of no item will be found for the R,B,G,NIR reflectance data with the cloud cover criterion of 10\%. For those hotspots, you can replace `process_row` with `process_row_mosaic`function to allow the extracted image to be a mosaic of the images found over the specified period of time.  
+- Download the **Sentinel-2 data** using `data_processing/satellite/download_rasters_from_planetary_computer.py`. To reproduce all experiments, you will have to run it twice, specifying the BANDS as ["B02", "B03", "B04", "B08"] for the BGRNIR reflectance data, and "visual" for the RGB visual component. A useful function is `process_row` which will extract the least cloudy image (with your specified bands) over the specified period of time with a maximum of 10\% cloud cover. For some hotspots, it is possible that you will be able to extract the visual component but incomplte items of no item will be found for the R,B,G,NIR reflectance data with the cloud cover criterion of 10\%. For those hotspots, you can replace `process_row` with `process_row_mosaic`function to allow the extracted image to be a mosaic of the images found over the specified period of time. We set the cloud cover criterion to 20\% in this second phase. 
 
 - You can further clean the dataset using the functions in `data_processing/ebird/clean_hotspots.py` and filter out:
     - hotspots that are not withing the bounds of a given shapefile geometry. This was used for the USA dataset to filter out hotspots that are in the middle of the ocean and not in the contiguous USA
@@ -98,6 +98,7 @@ This will create one file per hotspots, the files will be organized based on the
 - Get **range maps** using `data_processing/ebird/get_range_maps.py`. This will call for shapefiles that you can obtain through [ebirdst](https://ebird.github.io/ebirdst/). You can then save a csv of all combined range maps using `/satbird/data_processing/utils/save_range_maps_csv_v2.py`.
 - 
 - For the environmental data variables, download the rasters of the country of insterest from [WorldClim](https://www.worldclim.org/) (and [SoilGrids](https://soilgrids.org/) for the USA dataset). 
+For the USA dataset, we used the USA rasters that were available from the GeoLifeCLEF 2020 dataset, and for Kenya, we used the procedure described in data_processing/environmental/extract_bioclimatic_variables_kenya.ipynb 
 
 
 - Use `data_processing/environmental/get_csv_env.py` to get point data for the environmental variables (rasters of size 1 centered on the hotspots). Note that you should not fill nan values until you have done the train-validation-test split so you can fill the values with the means on the training set data. These point data variables are used for the mean encounter rates, environmental and MOSAIKS baselines 

diff --git a/data_processing/environmental/bound_data.py b/data_processing/environmental/bound_data.py
@@ -1,5 +1,11 @@
 """
 post-processing to environmental data after extraction
+This assumes the rasters have been originally extracted to a folder named "environmental"
+and filled with interpolation using fill_env_nans.py, and saved to a folder named "environmental_filled".
+The remaining NaN values in the rasters are filled with the mean of each environmental variable over the training set and the filled rasters are saved to "environmental_temp"
+The min and max ranges of the original (not filled)rasters are computed and used to bound the data, and saved in a folder names "environmental_temp_2"
+
+We subsequently renamed "environmental_temp_2" to "environmental" and this is what is in the released dataset
 """
 import argparse
 import functools
@@ -27,7 +33,7 @@ def bound_env_data(root_dir, mini, maxi):
     bound env data after the interpolation
     """
 
-    rasters = glob.glob(root_dir + "/environmental/*.npy")  # '/network/projects/_groups/ecosystem-
+    rasters = glob.glob(root_dir + "/environmental_temp/*.npy")  # '/network/projects/_groups/ecosystem-
 
     for raster_file in tqdm(rasters):
         file_name = os.path.basename(raster_file)
@@ -47,7 +53,7 @@ def fill_nan_values(root_dir, dataframe_name="all_summer_hotspots_withnan.csv"):
     """
     fill values that still have nans after interpolation with mean point values
     """
-    rasters = glob.glob(os.path.join(root_dir, "environmental_bounded_2", "*.npy"))
+    rasters = glob.glob(os.path.join(root_dir, "environmental_filled", "*.npy"))
     dst = os.path.join(root_dir, "environmental_temp")
 
     train_df = pd.read_csv(os.path.join(root_dir, dataframe_name))
@@ -83,7 +89,7 @@ def compute_min_max_ranges(root_dir):
     """
     computes minimum and maximum of env data
     """
-    rasters = glob.glob(os.path.join(root_dir, "environmental_bounded_2", "*.npy"))
+    rasters = glob.glob(os.path.join(root_dir, "environmental", "*.npy"))
 
     nan_count = 0
 
@@ -133,24 +139,12 @@ def remove_files(root_dir):
 
 
 if __name__ == '__main__':
+
     root_dir = "/network/projects/ecosystem-embeddings/SatBird_data_v2/USA_summer"
     fill_nan_values(root_dir=root_dir)
     # move_missing_file(root_dir=root_dir)
     # remove_files(root_dir=root_dir)
     mini, maxi = compute_min_max_ranges(root_dir=root_dir)
-
-    # mini = [-7.0958333, 1., 21.74928093, 0., 0.5,
-    #         -24.20000076, 1., -11.58333302, -15.30000019, -1.58333325,
-    #         -15.41666698, 54., 9., 0., 5.00542307,
-    #         24., 1., 2., 13., 0.,
-    #         221., 2., 0., 0., 34.,
-    #         0., 0.]
-    # maxi = [2.56291656e+01, 2.22333336e+01, 1.00000000e+02, 1.36807947e+03,
-    #         4.62000008e+01, 1.87999992e+01, 5.16999969e+01, 3.36666679e+01,
-    #         3.36666679e+01, 3.63833351e+01, 2.18833332e+01, 3.40200000e+03,
-    #         5.59000000e+02, 1.75000000e+02, 1.10832680e+02, 1.55500000e+03,
-    #         5.50000000e+02, 6.52000000e+02, 1.46100000e+03, 1.12467000e+05,
-    #         1.81500000e+03, 2.50000000e+02, 8.10000000e+01, 5.24000000e+02,
-    #         9.80000000e+01, 8.30000000e+01, 9.90000000e+01]
-
     bound_env_data(root_dir=root_dir, mini=mini, maxi=maxi)
+
+