SABER requires GIS and discharge data. These data need to be collected and processed into a standard format the
functions know how to process. The hardest part of SABER is preparing these datasets.
You provide these datasets to
saber-hbc through a
config.yml. An example of this file is found in the examples
directory of the source repository and in these docs. These datasets need to be prepared independently of SABER and no
scripts are provided because there are too many possible file formats to account for. An explanation of what each input
dataset should look like is provided to assist in preparing your data.
# file paths to input data workdir: '' cluster_data: '' drain_table: '' gauge_table: '' regulate_table: '' drain_gis: '' gauge_gis: '' gauge_data: '' hindcast_zarr: '' # options for processing data n_processes: 1
workdir should be a path to a directory on your computer where the results of the saber process should be cached.
saber-hbc package will create the necessary subfolders and populate them with files as functions are executed to
produce them. You will need read/write access to this directory and <= 1 GB of free space.
cluster_data is a table of training data to cluster the watersheds/subbasins in parquet or csv format. The recommended
features to include are z-score transformed (standard scalar) flow duration curve values for the exceedance
from 100 to 0 in increments of 2.5 for a total of 41 features. Many other physical features can be included but were not
as thoroughly investigated during the research of the SABER method. A mockup of the table structure and required
is given below. You can find an example of this table in the zipped sample data.
- It should be a table of data in usual machine learning shape of [n_samples, n_features], or 1 row per feature ( subbasin) and 1 column per feature.
- Each subbasin should be z-scaled individually, not each column of the combined dataset.
- The index is the
model_idof each subbasin.
- The columns are the features of each subbasin.
- the data type should be float
- the index should be unique
drain_table is a table of properties from the stream/catchment network used in the hydrologic model in parquet or csv
format. The table should have exactly the following features
model_id: A unique identifier/ID, any alphanumeric utf-8 string will suffice
downstream_model_id: The ID of the next downstream reach, used to trace the network programmatically
strahler_order: The strahler stream order of each reach
x: The x coordinate of the centroid of each feature (precalculated for faster results later)
y: The y coordinate of the centroid of each feature (precalculated for faster results later)
|unique_stream_#||unique_stream_#||area in km^2||stream_order||##||##|
gauge_table is a table of properties from the available river gauges in parquet or csv format. Each gauge needs to have
a unique ID and needs be paired with the unique ID of the model subbasin. The table must have exactly the following features
gauge_id: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
model_id: The ID of the stream segment which corresponds to that gauge.
latitude: The latitude of the gauge
longitude: The longitude of the gauge
regulate_table is a table of about the location of dams, reservoirs, or other regulatory structures in parquet or csv
format. Each dam needs to have a unique ID and must be paired with the unique ID of the model subbasin that contains it.
The table must have exactly the following features and names:
regulate_id: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
model_id: The ID of the stream segment which corresponds to this regulatory structure.
drain_gis is a geopackage or equivalent of the stream network used in the hydrologic model. It should have exactly the
same properties listed in the
drain_table above. This information is the same but can be used to make maps of the
network and the SABER results.
gauge_gis is a geopackage or equivalent of the river gauges used for validation. It should have exactly the same
properties listed in the
gauge_table above. This information is the same but can be used to make maps of the gauges and
the SABER results.
gauge_data is a directory that contains the observed discharge information. It should contain a subdirectory called
observed_data which contains the csv files of the river gauge data. Each file should be named with the
the gauge it corresponds to. Other extra gis datasets and the gauge table may be included here as well.
hindcast_zarr is a zarr file (directory). The SABER code expects this to be a series of zarr files which cover the same
historical time frame but in separate chunks. SABER was developed to use zarr files that were converted from RAPID netCDF
outputs. Each file should have globally unique IDs and the zarr's are concatenated along the rivid dimension.
n_processes is an integer that specifies the number of processes to use for parallel processing. SABER computations are
operations on dataframes which are easily parallelizable. This is the number of works used in a Python multiprocessing
Pool. This number should probably be <= the number of cores on your machine.
FAQ, Tips, Troubleshooting
Be sure that all gis datasets:
- Are in the same projected coordinate system
- Only contain gauges and reaches within the area of interest. Clip/delete anything else.
- You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for visualization purposes.
Be sure that all the discharge datasets (simulated and observed):
- Are in the same units (e.g. m3/s)
- Are in the same time zone (e.g. UTC)
- Are in the same time step (e.g. daily average)
- Do not contain any non-numeric values (e.g. ICE, none, etc.)
- Do not contain any negative values (e.g. -9999, etc.)
- If the negative values are fill values for null, commonly -9999, delete them from the dataset
- Do not contain rows with missing values (e.g. NaN or blank cells)
- Have been cleaned of any other incorrect values (e.g. contain letters)
- Do not contain any duplicate rows