bakaano.streamflow_trainer¶
Training pipeline for regional streamflow models.
Role: Build training datasets and train the TCN-based streamflow model.
- bakaano.streamflow_trainer.asym_laplace_plus_mse_sqrt(y_true, params, scale_min=0.0001, mse_weight=0.05)¶
- bakaano.streamflow_trainer.asym_laplace_nll(y_true, params, r_clip=5.0, scale_clip=(0.001, 5.0), peak_weight=0.3)¶
- class bakaano.streamflow_trainer.DataPreprocessor(working_dir, study_area, grdc_streamflow_nc_file, train_start, train_end, routing_method, catchment_size_threshold)[source]¶
Bases:
object- _extract_station_rowcol(lat, lon)[source]¶
Extract the row and column indices for a given latitude and longitude from given raster file.
- Parameters:
lat (float) – The latitude of the station.
lon (float) – The longitude of the station.
- Returns:
row (int) – The row index corresponding to the given latitude and longitude.
col (int) – The column index corresponding to the given latitude and longitude.
- _snap_coordinates(lat, lon)[source]¶
Snap the given latitude and longitude to the nearest river segment based on a river grid.
- Parameters:
lat (float) – The latitude to be snapped.
lon (float) – The longitude to be snapped.
- Returns:
snapped_lat (float) – The latitude of the nearest river segment.
snapped_lon (float) – The longitude of the nearest river segment.
- load_observed_streamflow(grdc_streamflow_nc_file)[source]¶
Load and filter observed GRDC streamflow data in a schema-robust way. Works for single- and multi-station NetCDFs.
- Parameters:
grdc_streamflow_nc_file (str) – Path to GRDC NetCDF file.
- Returns:
Filtered GRDC subset for the study area.
- Return type:
xarray.Dataset
- _open_grdc_dataset(grdc_streamflow_nc_file)[source]¶
Open GRDC NetCDF with backend fallback for Colab/Drive compatibility.
- load_observed_streamflow_from_csv_dir(csv_dir, lookup_csv, id_col='id', lat_col='latitude', lon_col='longitude', date_col='date', discharge_col='discharge', file_pattern='{id}.csv')[source]¶
Load observed streamflow from per-station CSV files using a lookup table.
The lookup table must include station identifiers and coordinates. The method filters stations to the study area, then loads per-station CSVs by ID.
- Parameters:
csv_dir (str) – Directory containing per-station CSV files.
lookup_csv (str) – CSV file with station ids and coordinates.
id_col (str) – Station id column in lookup CSV.
lat_col (str) – Latitude column in lookup CSV.
lon_col (str) – Longitude column in lookup CSV.
date_col (str) – Date column in station CSVs.
discharge_col (str) – Discharge column in station CSVs.
file_pattern (str) – Pattern for station CSV filenames (e.g.,
"{id}.csv").
- Returns:
Mapping of station_id to observed discharge DataFrame.
- Return type:
dict
- get_data()[source]¶
Extract and preprocess predictor and response variables for each station based on its coordinates.
- Returns:
A list containing two elements: - self.data_list: A list of tuples, each containing predictors (DataFrame) and response (DataFrame). - self.catchment: A list of tuples, each containing catchment data (accumulation and slope values).
- Return type:
list
- class bakaano.streamflow_trainer.StreamflowModel(working_dir, batch_size, num_epochs, learning_rate=0.0001, loss_function='huber', train_start=None, train_end=None, seed=100, area_normalize=True, lr_schedule=None, warmup_epochs=3, min_learning_rate=1e-05)[source]¶
Bases:
objectRole: Define and train the multi-scale TCN streamflow model.
Full-materialization training variant of the regional streamflow model.
Key characteristics (actual behavior): - Prepares per-station scaled series using area normalization (optional). - Materializes all valid 365-day sliding windows in memory. - Trains directly with in-memory NumPy arrays. - Enables XLA globally via tf.config.optimizer.set_jit(True).
- prepare_data(data_list)[source]¶
Prepare the data for training the streamflow prediction model.
This materializes all sliding windows (365), filters NaNs once, and concatenates across stations.