Required Datasets

SABER requires GIS and discharge data. These data need to be collected and processed into a standard format the saber-hbc functions know how to process. The hardest part of SABER is preparing these datasets.

You provide these datasets to saber-hbc through a config.yml. An example of this file is found in the examples directory of the source repository and in these docs. These datasets need to be prepared independently of SABER and no scripts are provided because there are too many possible file formats to account for. An explanation of what each input dataset should look like is provided to assist in preparing your data.

# file paths to input data
workdir: ''

cluster_data: ''

drain_table: ''
gauge_table: ''
regulate_table: ''

drain_gis: ''
gauge_gis: ''

gauge_data: ''
hindcast_zarr: ''

# options for processing data
n_processes: 1

Required Datasets

`workdir`

The workdir should be a path to a directory on your computer where the results of the saber process should be cached. the saber-hbc package will create the necessary subfolders and populate them with files as functions are executed to produce them. You will need read/write access to this directory and <= 1 GB of free space.

`cluster_data`

cluster_data is a table of training data to cluster the watersheds/subbasins in parquet or csv format. The recommended features to include are z-score transformed (standard scalar) flow duration curve values for the exceedance probabilities from 100 to 0 in increments of 2.5 for a total of 41 features. Many other physical features can be included but were not as thoroughly investigated during the research of the SABER method. A mockup of the table structure and required properties is given below. You can find an example of this table in the zipped sample data.

It should be a table of data in usual machine learning shape of [n_samples, n_features], or 1 row per feature ( subbasin) and 1 column per feature.
Each subbasin should be z-scaled individually, not each column of the combined dataset.
The index is the model_id of each subbasin.
The columns are the features of each subbasin.
the data type should be float
the index should be unique

model_id	Q100	Q97.5	Q95	...	Q5	Q2.5
1	60	50	40	...	10	5
2	60	50	40	...	10	5
3	60	50	40	...	10	5
4	60	50	40	...	10	5

`drain_table`

drain_table is a table of properties from the stream/catchment network used in the hydrologic model in parquet or csv format. The table should have exactly the following features

model_id: A unique identifier/ID, any alphanumeric utf-8 string will suffice
downstream_model_id: The ID of the next downstream reach, used to trace the network programmatically
strahler_order: The strahler stream order of each reach
x: The x coordinate of the centroid of each feature (precalculated for faster results later)
y: The y coordinate of the centroid of each feature (precalculated for faster results later)

downstream_model_id	model_id	model_area	strahler_order	x	y
unique_stream_#	unique_stream_#	area in km^2	stream_order	##	##
...	...	...	...	...	...

`gauge_table`

gauge_table is a table of properties from the available river gauges in parquet or csv format. Each gauge needs to have a unique ID and needs be paired with the unique ID of the model subbasin. The table must have exactly the following features and names:

gauge_id: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
model_id: The ID of the stream segment which corresponds to that gauge.
latitude: The latitude of the gauge
longitude: The longitude of the gauge

model_id	gauge_id	latitude	longitude
unique_stream_num	unique_gauge_num	50	-20
unique_stream_num	unique_gauge_num	40	-70
...	...	...	...

`regulate_table`

regulate_table is a table of about the location of dams, reservoirs, or other regulatory structures in parquet or csv format. Each dam needs to have a unique ID and must be paired with the unique ID of the model subbasin that contains it. The table must have exactly the following features and names:

regulate_id: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
model_id: The ID of the stream segment which corresponds to this regulatory structure.

model_id	regulate_id
unique_stream_num	unique_regulate_num
unique_stream_num	unique_regulate_num
...	...

`drain_gis`

drain_gis is a geopackage or equivalent of the stream network used in the hydrologic model. It should have exactly the same properties listed in the drain_table above. This information is the same but can be used to make maps of the network and the SABER results.

`gauge_gis`

gauge_gis is a geopackage or equivalent of the river gauges used for validation. It should have exactly the same properties listed in the gauge_table above. This information is the same but can be used to make maps of the gauges and the SABER results.

model_id	gauge_id	latitude	longitude
unique_stream_num	unique_gauge_num	50	-20
unique_stream_num	unique_gauge_num	40	-70
...	...	...	...

`gauge_data`

gauge_data is a directory that contains the observed discharge information. It should contain a subdirectory called observed_data which contains the csv files of the river gauge data. Each file should be named with the gauge_id of the gauge it corresponds to. Other extra gis datasets and the gauge table may be included here as well.

`hindcast_zarr`

hindcast_zarr is a zarr file (directory). The SABER code expects this to be a series of zarr files which cover the same historical time frame but in separate chunks. SABER was developed to use zarr files that were converted from RAPID netCDF outputs. Each file should have globally unique IDs and the zarr's are concatenated along the rivid dimension.

Processing Options

`n_processes`

n_processes is an integer that specifies the number of processes to use for parallel processing. SABER computations are operations on dataframes which are easily parallelizable. This is the number of works used in a Python multiprocessing Pool. This number should probably be <= the number of cores on your machine.

FAQ, Tips, Troubleshooting

GIS Datasets

Be sure that all gis datasets:

Are in the same projected coordinate system
Only contain gauges and reaches within the area of interest. Clip/delete anything else.
You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for visualization purposes.

Discharge Datasets

Be sure that all the discharge datasets (simulated and observed):

Are in the same units (e.g. m3/s)
Are in the same time zone (e.g. UTC)
Are in the same time step (e.g. daily average)
Do not contain any non-numeric values (e.g. ICE, none, etc.)
Do not contain any negative values (e.g. -9999, etc.)
If the negative values are fill values for null, commonly -9999, delete them from the dataset
Do not contain rows with missing values (e.g. NaN or blank cells)
Have been cleaned of any other incorrect values (e.g. contain letters)
Do not contain any duplicate rows