Data Manipulation#

Dataset manipulation#

Functions used in interacting with data, either pandas DataFrames or xarray DataArrays.

watershed_workflow.data.computeAverageYear(data: Dataset, time_column: str, start_year: int, output_nyears: int) Dataset[source]#
watershed_workflow.data.computeAverageYear(data: DataArray, time_column: str, start_year: int, output_nyears: int) DataArray
watershed_workflow.data.computeAverageYear(data: DataFrame, time_column: str, start_year: int, output_nyears: int) DataFrame

Average data values across years and repeat for specified number of years.

This function automatically selects the appropriate implementation based on the input data type: DataFrame, DataArray, or Dataset.

Parameters:
  • data (pandas.DataFrame, xr.DataArray, or xr.Dataset) – Input data with cftime noleap calendar dates at 1- or 5-day intervals.

  • time_column (str, optional) – For DataFrame: Name of the time column (required). For Dataset: Name of the time dimension (default: ‘time’). For DataArray: Ignored (always uses ‘time’ dimension).

  • start_year (int) – Start year for the output time series.

  • output_nyears (int, optional) – Number of years to repeat the averaged pattern. Default is 1.

Returns:

Same type as input with averaged values repeated for the specified number of years, starting from start_date.

Return type:

pandas.DataFrame, xr.DataArray, or xr.Dataset

Raises:
  • TypeError – If data is not a DataFrame, DataArray, or Dataset. If DataFrame is provided without time_column.

  • ValueError – If time column/dimension is not found or contains invalid data.

Notes

The function computes the average value for each day of year (1-365) across all years in the input data. For 5-day intervals, it averages values at days 1, 6, 11, etc. The resulting pattern is then repeated for output_nyears starting from start_date.

Missing values (NaN) are ignored in the averaging process.

For DataFrames, only numeric columns are included in the output

For DataArrays all attributes are preserved

For Datasets
  • Variables without the time dimension are preserved unchanged

  • All attributes and encodings are preserved

watershed_workflow.data.computeAverageYear_DataArray(da: DataArray, time_dim: str = 'time', start_date: str | datetime | datetime = '2020-1-1', output_nyears: int = 2) DataArray[source]#

Average DataArray values across years and repeat for specified number of years.

Parameters:
  • da (xr.DataArray) – Input DataArray with cftime noleap calendar dates at 1- or 5-day intervals.

  • start_date (str, datetime, or cftime.datetime) – Start date for the output time series. If string, should be ‘YYYY-MM-DD’ format.

  • output_nyears (int) – Number of years to repeat the averaged pattern.

  • time_dim (str, optional) – Name of the time dimension. Default is ‘time’.

Returns:

DataArray with averaged values repeated for the specified number of years, starting from start_date. All attributes are preserved.

Return type:

xr.DataArray

Raises:

ValueError – If time dimension is not found.

Notes

The function computes the average value for each day of year (1-365) across all years in the input data. The resulting 365-day pattern is then repeated for output_nyears starting from start_date.

This is particularly useful for creating climatological datasets or for generating synthetic time series based on historical patterns.

watershed_workflow.data.computeAverageYear_DataFrame(df: DataFrame, time_column: str = 'time', start_date: str | datetime | datetime = '2020-1-1', output_nyears: int = 2) DataFrame[source]#

Average DataFrame values across years and repeat for specified number of years.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with cftime noleap calendar dates at 1- or 5-day intervals.

  • time_column (str) – Name of the column containing cftime datetime objects.

  • start_date (str, datetime, or cftime.datetime) – Start date for the output time series. If string, should be ‘YYYY-MM-DD’ format.

  • output_nyears (int) – Number of years to repeat the averaged pattern.

Returns:

DataFrame with averaged values repeated for the specified number of years, starting from start_date. Only includes the time column and averaged numeric columns.

Return type:

pandas.DataFrame

Raises:

ValueError – If time_column is not found or contains invalid data.

Notes

The function computes the average value for each day of year (1-365) across all years in the input data. For 5-day intervals, it averages values at days 1, 6, 11, etc. The resulting pattern is then repeated for output_nyears starting from start_date.

Missing values (NaN) are ignored in the averaging process. Non-numeric columns are excluded from the output.

watershed_workflow.data.computeAverageYear_Dataset(ds: Dataset, time_dim: str = 'time', start_date: str | datetime | datetime = '2020-1-1', output_nyears: int = 2, variables: List[str] | None = None) Dataset[source]#

Average Dataset values across years and repeat for specified number of years.

Parameters:
  • ds (xr.Dataset) – Input Dataset with cftime noleap calendar dates at 1- or 5-day intervals.

  • start_date (str, datetime, or cftime.datetime) – Start date for the output time series. If string, should be ‘YYYY-MM-DD’ format.

  • output_nyears (int) – Number of years to repeat the averaged pattern.

  • time_dim (str, optional) – Name of the time dimension. Default is ‘time’.

  • variables (list of str, optional) – List of variables to average. If None, averages all variables with the time dimension.

Returns:

Dataset with averaged values repeated for the specified number of years, starting from start_date. All attributes are preserved.

Return type:

xr.Dataset

Raises:

ValueError – If time dimension is not found or if specified variables don’t exist.

Notes

Variables without the time dimension are preserved unchanged in the output.

For each variable with the time dimension, the function computes the average value for each day of year across all years in the input data.

watershed_workflow.data.computeMode(da: DataArray, time_dim: str = 'time') DataArray[source]#

Compute the mode along the time dimension of a DataArray.

Parameters:
  • da (xr.DataArray) – Input DataArray. Can contain any data type that scipy.stats.mode can handle.

  • time_dim (str, optional) – Name of the time dimension along which to compute the mode. Default is ‘time’.

Returns:

DataArray with the mode computed along the time dimension. The time dimension is removed from the output. All other dimensions, coordinates, and attributes are preserved. In case of multiple modes, returns the smallest value.

Return type:

xr.DataArray

Raises:

ValueError – If the specified time dimension is not found in the DataArray.

Notes

For continuous data, the mode may not be meaningful. This function is most useful for discrete or categorical data.

When multiple values have the same highest frequency (multiple modes), scipy.stats.mode returns the smallest of these values.

NaN values are ignored in the mode calculation.

watershed_workflow.data.convertTimesToCFTime(time_values: Sequence[datetime | Timestamp | datetime | datetime64]) ndarray[tuple[Any, ...], dtype[datetime]][source]#

Convert an iterable of datetime objects to cftime object.

This function accepts various datetime types and converts them to cftime Gregorian calendar.

Parameters:

time_values – Iterable of datetime objects (numpy datetime64, pandas Timestamp, Python datetime, or cftime objects). All elements must be the same type.

Returns:

List of cftime objects, likely in the Gregorian calendar.

Return type:

List[cftime._cftime.datetime]

watershed_workflow.data.convertTimesToCFTimeNoleap(time_values: Sequence[datetime]) ndarray[tuple[Any, ...], dtype[DatetimeNoLeap]][source]#

Convert an iterable of cftime objects on any calendar to cftime DatetimeNoLeap calendar.

This function accepts various datetime types and converts them to cftime NoLeap calendar while preserving the input container type. Raises an error if any input date represents the 366th day of a year (leap day in DayMet convention).

Parameters:

time_values – Sequence of datetime objects (numpy datetime64, pandas Timestamp, Python datetime, or cftime objects). All elements must be the same type.

Return type:

Container of cftime.DatetimeNoLeap objects.

Raises:

ValueError – If any date in the input represents the 366th day of a year (leap day).

watershed_workflow.data.createNoleapMask(time_values: Sequence[datetime | Timestamp | datetime | datetime64]) Tuple[ndarray[tuple[Any, ...], dtype[DatetimeNoLeap]], ndarray[tuple[Any, ...], dtype[bool]]][source]#

Create a mask that is true for any non-leap-day (day 366).

Parameters:

time_values (Sequence[ValidTime]) – Time values to filter for leap days.

Returns:

  • Sequence[cftime.DatetimeNoLeap] – Time values converted to cftime format with leap days filtered.

  • List[bool] – Boolean mask where True indicates non-leap days.

watershed_workflow.data.filterLeapDay(data: Dataset, time_column: str = 'time') Dataset[source]#
watershed_workflow.data.filterLeapDay(data: DataArray, time_column: str = 'time') DataArray
watershed_workflow.data.filterLeapDay(data: DataFrame, time_column: str = 'time') DataFrame

Remove day 366 (Dec 31) from leap years and convert time to CFTime noleap calendar.

This function automatically selects the appropriate implementation based on the input data type: DataFrame, DataArray, or Dataset.

Parameters:
  • data (pandas.DataFrame, xr.DataArray, or xr.Dataset) – Input data containing time series information to filter.

  • time_column (str, optional) – For DataFrame: Name of the column containing datetime data (required). For Dataset: Name of the time dimension (defaults to ‘time’). For DataArray: Ignored (always uses ‘time’ dimension).

Returns:

Same type as input with day 366 of leap years removed and time converted to cftime noleap calendar format.

Return type:

pandas.DataFrame, xr.DataArray, or xr.Dataset

Raises:
  • TypeError – If data is not a DataFrame, DataArray, or Dataset. If DataFrame is provided without time_column.

  • ValueError – If time column/dimension is not found or cannot be converted to datetime.

Notes

Day 366 only occurs on December 31st in leap years. The function assumes that the input time data is not already in cftime noleap format, as noleap calendars by definition do not have day 366.

For DataFrames:
  • time_column parameter is required

  • DataFrame index is reset after filtering

For DataArrays & Datasets:
  • time_column parameter specifies the time dimension (default: ‘time’)

  • All attributes including rasterio-specific ones are preserved

See also

filterLeapDay_DataFrame

DataFrame-specific implementation

filterLeapDay_DataArray

DataArray-specific implementation

filterLeapDay_Dataset

Dataset-specific implementation

watershed_workflow.data.filterLeapDay_DataFrame(df: DataFrame, time_column: str = 'time') DataFrame[source]#

Remove day 366 (Dec 31) from leap years and convert time column to CFTime noleap calendar.

Parameters:
  • df – Input DataFrame containing time series data.

  • time_column – Name of the column containing datetime data. Must be convertible to pandas datetime.

Returns:

DataFrame with day 366 of leap years removed and time column converted to cftime noleap calendar format.

Return type:

pandas.DataFrame

Raises:

ValueError – If time_column is not found in the DataFrame. If the time column cannot be converted to datetime format.

Notes

Day 366 only occurs on December 31st in leap years. The function assumes that the input time column is not already in cftime noleap format, as noleap calendars by definition do not have day 366.

The DataFrame index is reset after filtering to ensure continuous indexing.

watershed_workflow.data.filterLeapDay_xarray(da: DataArray | Dataset, time_dim: str = 'time') DataArray | Dataset[source]#

Remove day 366 (Dec 31) from leap years and convert time dimension to CFTime noleap calendar.

Parameters:

da (xr.DataArray) – Input DataArray with a time dimension. The time dimension must contain datetime-like values that can be converted to pandas datetime.

Returns:

DataArray with day 366 of leap years removed and time dimension converted to cftime noleap calendar format. All attributes, including rasterio-specific attributes like ‘nodata’ and ‘crs’, are preserved.

Return type:

xr.DataArray

Raises:

ValueError – If the DataArray does not have a ‘time’ dimension. If the time dimension cannot be converted to datetime format.

Notes

Day 366 only occurs on December 31st in leap years. The function assumes that the input time dimension is not already in cftime noleap format, as noleap calendars by definition do not have day 366.

For rasterio-based DataArrays, this function preserves the coordinate reference system (CRS) and nodata value attributes. All other attributes are also preserved.

The time dimension name is preserved, but its values are replaced with cftime noleap calendar objects.

watershed_workflow.data.imputeHoles2D(arr: DataArray, nodata: Any = nan, method: str = 'cubic') DataArray[source]#

Interpolate values for missing data in rasters using scipy griddata.

Parameters:
  • arr (xarray.DataArray) – Input raster data with missing values to interpolate.

  • nodata (Any, optional) – Value representing missing data. Default is numpy.nan.

  • method (str, optional) – Interpolation method for scipy.interpolate.griddata. Valid options: ‘linear’, ‘nearest’, ‘cubic’. Default is ‘cubic’.

Returns:

Interpolated array with missing values filled.

Return type:

numpy.ndarray

Notes

This function may raise an error if there are holes on the boundary. The interpolation is performed using scipy.interpolate.griddata with the specified method.

watershed_workflow.data.interpolate(data: Dataset, time_values: Sequence[datetime | Timestamp | datetime | datetime64] = Ellipsis, time_dim: str = 'time', method: Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'quintic', 'polynomial', 'barycentric', 'krogh', 'pchip', 'spline', 'akima', 'makima'] = 'linear') Dataset[source]#
watershed_workflow.data.interpolate(data: DataArray, time_values: Sequence[datetime | Timestamp | datetime | datetime64] = Ellipsis, time_dim: str = 'time', method: Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'quintic', 'polynomial', 'barycentric', 'krogh', 'pchip', 'spline', 'akima', 'makima'] = 'linear') DataArray
watershed_workflow.data.interpolate(data: DataFrame, time_values: Sequence[datetime | Timestamp | datetime | datetime64] = Ellipsis, time_dim: str = 'time', method: Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'quintic', 'polynomial', 'barycentric', 'krogh', 'pchip', 'spline', 'akima', 'makima'] = 'linear') DataFrame

Interpolate data to new times.

This function automatically selects the appropriate implementation based on the input data type: DataFrame, DataArray, or Dataset.

Parameters:
  • data (pandas.DataFrame, xr.DataArray, or xr.Dataset) – Input data containing time series with cftime calendar.

  • time_values (Sequence[ValidTime]) – Time values to interpolate to.

  • time_dim (str, optional) – For DataFrame: Name of the time column (required). For Dataset/DataArray: Name of the time dimension (default: ‘time’).

  • method (str, optional) – Interpolation method. Default is ‘linear’.

Returns:

Same type as input with regular time intervals and interpolated values.

Return type:

pandas.DataFrame, xr.DataArray, or xr.Dataset

Raises:
  • TypeError – If data is not a DataFrame, DataArray, or Dataset.

  • ValueError – If required parameters are missing or invalid.

See also

interpolate_DataFrame

DataFrame-specific implementation

interpolate_Dataset

Dataset-specific implementation

watershed_workflow.data.interpolateToRegular(data: DataFrame | DataArray | Dataset, interval: int = 1, time_dim: str = 'time', method: Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'quintic', 'polynomial', 'barycentric', 'krogh', 'pchip', 'spline', 'akima', 'makima'] = 'linear') DataFrame | DataArray | Dataset[source]#

Interpolate data to new times.

This function automatically selects the appropriate implementation based on the input data type: DataFrame, DataArray, or Dataset.

Parameters:
  • data (pandas.DataFrame, xr.DataArray, or xr.Dataset) – Input data containing time series with cftime calendar.

  • time_values (Sequence[ValidTime]) – Time values to interpolate to.

  • time_dim (str, optional) – For DataFrame: Name of the time column (required). For Dataset/DataArray: Name of the time dimension (default: ‘time’).

  • method (str, optional) – Interpolation method. Default is ‘linear’.

Returns:

Same type as input with regular time intervals and interpolated values.

Return type:

pandas.DataFrame, xr.DataArray, or xr.Dataset

Raises:
  • TypeError – If data is not a DataFrame, DataArray, or Dataset.

  • ValueError – If required parameters are missing or invalid.

See also

interpolate_DataFrame

DataFrame-specific implementation

interpolate_Dataset

Dataset-specific implementation

watershed_workflow.data.interpolateValues(points: ndarray, points_crs: CRS | None, data: DataArray, method: Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'quintic', 'polynomial', 'barycentric', 'krogh', 'pchip', 'spline', 'akima', 'makima'] = 'nearest') ndarray[source]#

Interpolate values from a 2D grid-based DataArray at given x, y or lat, lon points.

Parameters:
  • points (np.ndarray) – A (N, 2) array of coordinates. Each row should contain an (x, y) or (lon, lat) pair.

  • points_crs (CRS) – A coordinate system for the points.

  • data (xr.DataArray) – A DataArray with coordinates either (‘x’, ‘y’) or (‘lon’, ‘lat’).

  • method ({'linear', 'nearest', 'cubic'}) – Interpolation method to use.

Returns:

A 1D array of interpolated values with length N.

Return type:

np.ndarray

Raises:

ValueError – If DataArray does not have suitable coordinates for interpolation.

watershed_workflow.data.interpolate_DataFrame(df: DataFrame, time_values: Sequence[datetime | Timestamp | datetime | datetime64], time_column: str = 'time', method: Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'quintic', 'polynomial', 'barycentric', 'krogh', 'pchip', 'spline', 'akima', 'makima'] = 'linear') DataFrame[source]#

Interpolate DataFrame to arbitrary times.

NOTE: this is not the same as pandas.interpolate(), but more like pandas.reindex(time_values).interpolate() with scipy-based interpolation options.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with a time column containing cftime objects.

  • time_values (Sequence[ValidTime]) – Time values to interpolate to.

  • time_column (str) – Name of the column containing cftime datetime objects.

  • method (str, optional) – Interpolation method. Default is ‘linear’.

Returns:

DataFrame with regular time intervals and interpolated values.

Return type:

pandas.DataFrame

Raises:

ValueError – If time_column is not found or interval is not 1 or 5.

watershed_workflow.data.interpolate_Dataset(ds: Dataset | DataArray, time_values: Sequence[datetime | Timestamp | datetime | datetime64], time_dim: str = 'time', method: Literal['linear', 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'quintic', 'polynomial', 'barycentric', 'krogh', 'pchip', 'spline', 'akima', 'makima'] = 'linear') Dataset[source]#

Interpolate Dataset to arbitrary times.

Parameters:
  • ds (xr.Dataset) – Input Dataset with a time dimension containing cftime objects.

  • time_values (Sequence[ValidTime]) – Time values to interpolate to.

  • time_dim (str) – Name of the time dimension. Default is ‘time’.

  • method (str, optional) – Interpolation method. Default is ‘linear’.

Returns:

Dataset with regular time intervals and interpolated values.

Return type:

xr.Dataset

Raises:

ValueError – If time dimension is not found or interval is not 1 or 5.

watershed_workflow.data.rasterizeGeoDataFrame(gdf: GeoDataFrame, column: str, resolution: float, bounds: Tuple[float, float, float, float] | None = None, nodata: int | float | None = None) DataArray[source]#

Convert a GeoDataFrame to a rasterized DataArray based on a column’s values.

Parameters:
  • gdf (geopandas.GeoDataFrame) – Input GeoDataFrame containing geometries and data.

  • column (str) – Name of the column containing values to rasterize. Must be a numeric type.

  • resolution (float) – Spatial resolution of the output raster in the units of the GeoDataFrame’s CRS. This defines the size of each pixel.

  • bounds (tuple of float, optional) – Bounding box as (minx, miny, maxx, maxy). If None, bounds are computed from the GeoDataFrame’s total bounds.

  • nodata (int or float, optional) – Value to use for pixels not covered by any geometry. If None, defaults to NaN for float columns and -999 for integer columns.

Returns:

Rasterized data with dimensions (‘y’, ‘x’) and coordinates defined by the spatial extent and resolution. Areas outside geometries are set to the nodata value. The data type matches the column’s data type.

Return type:

xarray.DataArray

Raises:

ValueError – If column is not found in the GeoDataFrame. If column is not numeric type. If GeoDataFrame is empty. If resolution is not positive.

Notes

The function uses rasterio’s rasterization capabilities to burn geometries into a raster. When geometries overlap, the value from the last geometry in the GeoDataFrame is used.

The output DataArray includes the CRS information in its attributes if the GeoDataFrame has a CRS defined.

The dtype of the output array matches the dtype of the input column.

watershed_workflow.data.smooth2D(data: DataArray, dim1: str | None = None, dim2: str | None = None, method: Literal['uniform', 'gaussian', 'box'] = 'gaussian', variables: None = None, **kwargs) DataArray[source]#
watershed_workflow.data.smooth2D(data: Dataset, dim1: str | None = None, dim2: str | None = None, method: Literal['uniform', 'gaussian', 'box'] = 'gaussian', variables: list[str] | None = None, **kwargs) Dataset

Apply 2D spatial smoothing to data.

This function automatically selects the appropriate implementation based on the input data type: DataArray or Dataset.

Parameters:
  • data (xr.DataArray or xr.Dataset) – Input data with at least 2 spatial dimensions. Must not contain NaN values in data to be smoothed.

  • dim1 (str, optional) – First spatial dimension. If None, will try to find ‘x’ or ‘lon’.

  • dim2 (str, optional) – Second spatial dimension. If None, will try to find ‘y’ or ‘lat’.

  • method ({'uniform', 'gaussian', 'box'}, optional) – Smoothing method. Default is ‘gaussian’.

  • variables (list of str, optional) – For Dataset: Variables to smooth (default: all with both spatial dims). For DataArray: Ignored.

  • **kwargs (dict) –

    Method-specific parameters:

    For ‘uniform’:
    • size : int or tuple of int (default: 3) Filter size in pixels.

    For ‘gaussian’:
    • sigma : float or tuple of float (default: 1.0) Standard deviation of Gaussian kernel.

    • truncate : float (default: 4.0) Truncate filter at this many standard deviations.

    For ‘box’:
    • kernel_size : int or tuple of int (default: 3) Size of box filter.

Returns:

Same type as input with spatially smoothed data.

Return type:

xr.DataArray or xr.Dataset

Raises:
  • TypeError – If data is not a DataArray or Dataset.

  • ValueError – If spatial dimensions are not found or data contains NaN values.

Examples

Smooth a DataArray with Gaussian filter:

>>> da = xr.DataArray(data, dims=['time', 'lat', 'lon'])
>>> smoothed = smooth2D(da, method='gaussian', sigma=2.0)

Smooth specific variables in a Dataset:

>>> ds = xr.Dataset({'temp': da1, 'pressure': da2})
>>> smoothed = smooth2D(ds, variables=['temp'], method='uniform', size=5)

Use custom dimension names:

>>> smoothed = smooth2D(data, dim1='x_coord', dim2='y_coord')
watershed_workflow.data.smooth2D_DataArray(da: DataArray, dim1: str | None = None, dim2: str | None = None, method: Literal['uniform', 'gaussian', 'box'] = 'gaussian', **kwargs) DataArray[source]#

Apply 2D spatial smoothing to a DataArray.

Parameters:
  • da (xr.DataArray) – Input DataArray with at least 2 spatial dimensions. Must not contain NaN values.

  • dim1 (str, optional) – First spatial dimension. If None, will try to find ‘x’ or ‘lon’.

  • dim2 (str, optional) – Second spatial dimension. If None, will try to find ‘y’ or ‘lat’.

  • method ({'uniform', 'gaussian', 'box'}, optional) – Smoothing method. Default is ‘gaussian’.

  • **kwargs (dict) –

    Method-specific parameters passed to smoothing function.

    For ‘uniform’:
    • size : int or tuple of int (default: 3)

    For ‘gaussian’:
    • sigma : float or tuple of float (default: 1.0)

    • truncate : float (default: 4.0)

    For ‘box’:
    • kernel_size : int or tuple of int (default: 3)

Returns:

DataArray with smoothed spatial data. All attributes and coordinates are preserved.

Return type:

xr.DataArray

Raises:

ValueError – If spatial dimensions are not found or data contains NaN values.

Notes

The smoothing is applied in 2D to each slice along non-spatial dimensions. For example, if the data has dimensions (time, lat, lon), smoothing is applied to each time slice independently.

watershed_workflow.data.smooth2D_Dataset(ds: Dataset, dim1: str | None = None, dim2: str | None = None, method: Literal['uniform', 'gaussian', 'box'] = 'gaussian', variables: list[str] | None = None, **kwargs) Dataset[source]#

Apply 2D spatial smoothing to variables in a Dataset.

Parameters:
  • ds (xr.Dataset) – Input Dataset with at least 2 spatial dimensions. Variables to be smoothed must not contain NaN values.

  • dim1 (str, optional) – First spatial dimension. If None, will try to find ‘x’ or ‘lon’.

  • dim2 (str, optional) – Second spatial dimension. If None, will try to find ‘y’ or ‘lat’.

  • method ({'uniform', 'gaussian', 'box'}, optional) – Smoothing method. Default is ‘gaussian’.

  • variables (list of str, optional) – Variables to smooth. If None, smooths all variables that have both spatial dimensions.

  • **kwargs (dict) – Method-specific parameters passed to smoothing function.

Returns:

Dataset with smoothed spatial data. Variables without both spatial dimensions are preserved unchanged. All attributes are preserved.

Return type:

xr.Dataset

Raises:

ValueError – If spatial dimensions are not found, specified variables don’t exist, or any variable to be smoothed contains NaN values.

Notes

Only variables that contain both spatial dimensions are smoothed. Other variables are copied unchanged to the output.

watershed_workflow.data.smoothTimeSeries(data: DataArray, time_dim: str = 'time', method: Literal['savgol', 'rolling_mean'] = 'savgol', **kwargs) DataArray[source]#
watershed_workflow.data.smoothTimeSeries(data: Dataset, time_dim: str = 'time', method: Literal['savgol', 'rolling_mean'] = 'savgol', **kwargs) Dataset
watershed_workflow.data.smoothTimeSeries(data: DataFrame, time_dim: str = 'time', method: Literal['savgol', 'rolling_mean'] = 'savgol', **kwargs) DataFrame

Smooth time series data using specified method.

This function automatically selects the appropriate implementation based on the input data type: DataFrame, DataArray, or Dataset.

Parameters:
  • data (pandas.DataFrame, xr.DataArray, or xr.Dataset) – Input data with time series. Must not contain NaN values in data to be smoothed.

  • time_dim (str, optional) – For DataFrame: Name of the time column (required). For DataArray: Ignored. For Dataset: Ignored (use time_dim instead).

  • method ({'savgol', 'rolling_mean'}, optional) – Smoothing method. Default is ‘savgol’.

  • time_dim – For DataArray/Dataset: Name of time dimension (default: ‘time’). For DataFrame: Ignored (use time_dim instead).

  • **kwargs (dict) –

    Method-specific parameters:

    For ‘savgol’:
    • window_length : int, odd number (default: 7)

    • polyorder : int (default: 3)

    • mode : {‘mirror’, ‘constant’, ‘nearest’, ‘wrap’, ‘interp’} (default: ‘interp’)

    For ‘rolling_mean’:
    • window : int (default: 5)

    • center : bool (default: True)

Returns:

Same type as input with smoothed data.

Return type:

pandas.DataFrame, xr.DataArray, or xr.Dataset

Raises:
  • TypeError – If data is not a DataFrame, DataArray, or Dataset. If DataFrame is provided without time_dim.

  • ValueError – If time column/dimension is not found. If data contains NaN values. If smoothing parameters are invalid.

watershed_workflow.data.smoothTimeSeries_Array(data: ndarray, method: Literal['savgol', 'rolling_mean'] = 'savgol', axis: int = -1, **kwargs) ndarray[source]#

Smooth time series data using specified method.

Parameters:
  • data (numpy.ndarray) – Array of data to smooth. Must not contain NaN values.

  • method ({'savgol', 'rolling_mean'}) – Smoothing method to use.

  • axis (int, optional) – Axis along which to smooth. Default is -1 (last axis).

  • **kwargs (dict) –

    Method-specific parameters.

    For ‘savgol’:
    • window_length : int, odd number (default: 7)

    • polyorder : int (default: 3)

    • mode : str (default: ‘interp’)

    For ‘rolling_mean’:
    • window : int (default: 5)

    • center : bool (default: True)

Returns:

Smoothed data array.

Return type:

numpy.ndarray

Raises:

ValueError – If method is not recognized, parameters are invalid, or data contains NaN.

watershed_workflow.data.smoothTimeSeries_DataArray(da: DataArray, time_dim: str = 'time', method: Literal['savgol', 'rolling_mean'] = 'savgol', **kwargs) DataArray[source]#

Smooth time series data in a DataArray along the time dimension.

Parameters:
  • da (xr.DataArray) – Input DataArray with time series data. Must not contain NaN values.

  • time_dim (str, optional) – Name of the time dimension. Default is ‘time’.

  • method ({'savgol', 'rolling_mean'}, optional) – Smoothing method. Default is ‘savgol’.

  • **kwargs (dict) – Method-specific parameters passed to smoothing function.

Returns:

DataArray with smoothed data. All attributes and coordinates are preserved.

Return type:

xr.DataArray

Raises:

ValueError – If time dimension is not found or data contains NaN values.

Notes

For multidimensional arrays, smoothing is applied along the time dimension for each combination of other dimensions (e.g., each spatial point).

watershed_workflow.data.smoothTimeSeries_DataFrame(df: DataFrame, time_column: str = 'time', method: Literal['savgol', 'rolling_mean'] = 'savgol', **kwargs) DataFrame[source]#

Smooth time series data in a DataFrame along the time dimension.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame with time series data. Must not contain NaN values in columns to be smoothed.

  • time_column (str) – Name of the time column.

  • method ({'savgol', 'rolling_mean'}, optional) – Smoothing method. Default is ‘savgol’.

  • **kwargs (dict) –

    Method-specific parameters passed to smoothing function.

    For ‘savgol’:
    • window_length : int, odd number (default: 7)

    • polyorder : int (default: 3)

    • mode : {‘mirror’, ‘constant’, ‘nearest’, ‘wrap’, ‘interp’} (default: ‘interp’)

    For ‘rolling_mean’:
    • window : int (default: 5)

    • center : bool (default: True)

Returns:

DataFrame with smoothed data.

Return type:

pandas.DataFrame

Raises:

ValueError – If any column to be smoothed contains NaN values.

Notes

The Savitzky-Golay filter is useful for smoothing noisy data while preserving important features like peaks. The rolling mean provides simple moving average smoothing.

Data is sorted by time before smoothing to ensure correct temporal ordering.

watershed_workflow.data.smoothTimeSeries_Dataset(ds: Dataset, time_dim: str = 'time', method: Literal['savgol', 'rolling_mean'] = 'savgol', variables: List[str] | None = None, **kwargs) Dataset[source]#

Smooth time series data in a Dataset along the time dimension.

Parameters:
  • ds (xr.Dataset) – Input Dataset with time series data. Variables to be smoothed must not contain NaN values.

  • time_dim (str, optional) – Name of the time dimension. Default is ‘time’.

  • method ({'savgol', 'rolling_mean'}, optional) – Smoothing method. Default is ‘savgol’.

  • variables (list of str, optional) – Variables to smooth. If None, smooths all variables with the time dimension.

  • **kwargs (dict) – Method-specific parameters passed to smoothing function.

Returns:

Dataset with smoothed data. Variables without the time dimension are preserved unchanged. All attributes are preserved.

Return type:

xr.Dataset

Raises:

ValueError – If time dimension is not found, specified variables don’t exist, or any variable to be smoothed contains NaN values.

Notes

Variables without the time dimension are copied unchanged to the output.

Soil properties data manipulation#

Functions for manipulating soil properties.

Computes soil properties such as permeability, porosity, and van Genutchen parameters given texture properties using the Rosetta model.

Also provides functions for gap filling soil data via clustering, dataframe manipulations to merge soil type regions with shared values, etc.

watershed_workflow.soil_properties.cluster(rasters: ndarray, nbins: int) Tuple[ndarray, ndarray, Tuple[float, ndarray]][source]#

Given a bunch of raster bands, cluster into nbins.

Returns the coloring map of the clusters. This is used to fill in missing soil property data.

Parameters:
  • rasters (np.ndarray((nx,ny,nbands))) – nbands rasters providing spatial information on which to be clustered.

  • nbins (int) – Number of bins to cluster into.

Returns:

  • codebook (np.ndarray((nbins,nbands))) – The nbins centroids of the clusters.

  • codes (np.ndarray((nx, ny), int)) – Which cluster each point belongs to.

  • distortion ((float, np.ndarray((nx*ny))) – The distortion of the kmeans, and the distance between the observation and its nearest code.

watershed_workflow.soil_properties.computeVGAlphaFromPermeability(perm: ndarray, poro: ndarray) ndarray[source]#

Compute van Genuchten alpha from permeability and porosity.

Uses the relationship from Guarracino WRR 2007.

Parameters:
  • perm (array(double)) – Permeability, in [m^2]

  • poro (array(double)) – Porosity, [-]

Returns:

alpha – van Genuchten alpha, in [Pa^-1]

Return type:

array(double)

watershed_workflow.soil_properties.computeVanGenuchtenModelFromSSURGO(df: DataFrame) DataFrame[source]#

Get van Genutchen model parameters using Rosetta v3.

Parameters:

df (pd.DataFrame) – SSURGO properties dataframe, from manager_nrcs.FileManagerNRCS().get_properties()

Returns:

df with new properties defining the van Genuchten model. Note that this may be smaller than df as entries in df that have NaN values in soil composition (and therefore cannot calculate a VGM) will be dropped.

Return type:

pd.DataFrame

watershed_workflow.soil_properties.computeVanGenuchtenModel_Rosetta(data: ndarray) DataFrame[source]#

Return van Genuchten model parameters using Rosetta v3 model.

(Zhang and Schaap, 2017 WRR)

Parameters:

data (numpy.ndarray(nvar, nsamples)) – Input data.

Returns:

van Genuchten model parameters

Return type:

pd.DataFrame

watershed_workflow.soil_properties.convertRosettaToATS(df: DataFrame) DataFrame[source]#

Converts units from aggregated, Rosetta standard-parameters to ATS.

Parameters:

df (pd.DataFrame) – DataFrame with Rosetta parameters to convert.

Returns:

DataFrame with parameters converted to ATS units and naming conventions.

Return type:

pd.DataFrame

watershed_workflow.soil_properties.dropDuplicates(df: DataFrame) DataFrame[source]#

Search for duplicate soils which differ only by ID, and rename them, returning a new df.

Parameters:

df (pd.DataFrame) – A data frame that contains only properties (e.g. permeability, porosity, WRM) and is indexed by some native ID.

Returns:

df_new – After this is called, df_new will:

  1. have a new column, named by df’s index name, containing a tuple of all of the original indices that had the same properties.

  2. be reduced in number of rows relative to df such that soil properties are now unique

Return type:

pd.DataFrame

watershed_workflow.soil_properties.getDefaultBedrockProperties() DataFrame[source]#

Simple helper function to get a one-row dataframe with bedrock properties.

Returns:

Sane default bedrock soil properties.

Return type:

pd.DataFrame

watershed_workflow.soil_properties.mangleGLHYMPSProperties(shapes: GeoDataFrame, min_porosity: float = 0.01, max_permeability: float = inf, max_vg_alpha: float = inf, residual_saturation: float = 0.01, van_genuchten_n: float = 2.0) GeoDataFrame[source]#

GLHYMPs properties need their units changed and variables renamed.

Parameters:
  • shapes (gpd.GeoDataFrame)

  • min_porosity (float, optional) – Some GLHYMPS entries have 0 porosity; this sets a floor on that value. Default is 0.01.

  • max_permeability (float, optional) – If provided, sets a ceiling on the permeability.

  • max_vg_alpha (float, optional) – If provided, sets a ceiling on the vG alpha.

Returns:

The resulting properties in standard form, names, and units.

Return type:

pd.DataFrame

Meteorology data manipulation#

Manipulate DayMet data structures.

DayMet is downloaded in box mode based on watershed bounds, then it can be converted to hdf5 files that models can read.

watershed_workflow.meteorology.allocatePrecipitation(precip: DataArray, air_temp: DataArray, transition_temperature: float) Tuple[DataArray, DataArray][source]#

Allocates precipitation between rain and snow based on temperature.

Parameters:
  • precip (xr.DataArray) – Total precipitation data.

  • air_temp (xr.DataArray) – Air temperature data.

  • transition_temperature (float) – Temperature threshold for rain/snow transition. If < 100, assumed to be in Celsius; otherwise Kelvin.

Returns:

  • rain (xr.DataArray) – Rain precipitation (when temp >= transition_temperature).

  • snow (xr.DataArray) – Snow precipitation (when temp < transition_temperature).

watershed_workflow.meteorology.convertAORCToATS(dat: Dataset, transition_temperature: float = 0.0, resample_interval: int = 1, remove_leap_day: bool = False) Dataset[source]#

Convert xarray.Dataset AORC datasets to standard ATS format output.

  • computes specific humidity and surface pressure to vapor pressure

  • computes total wind speed from component wind speeds

  • converts precip units to m/s

  • allocates precip to snow and rain based on transition temp

Parameters:
  • dat (xr.Dataset) – Input including AORC raw data.

  • transition_temperature (float) – Temperature to transition from snow to rain [C]. Default is 0 C.

  • n_hourly (int) – Convert data from 1-hourly to n_hourly to reduce data needs. Defaults to 24 hours (daily data).

  • remove_leap_day (bool) – If True, removes day 366 any leap year (not Feb 30!). Deafult is False.

Returns:

Dataset with ATS-standard names/units met forcing.

Return type:

xr.Dataset

watershed_workflow.meteorology.convertDayMetToATS(dat: Dataset, transition_temperature: float = 0.0) Dataset[source]#

Convert xarray.Dataset Daymet datasets to daily average data in standard form.

This: - takes tmin and tmax to compute a mean - splits rain and snow precip based on mean air temp relative to transition_temperature [C] - standardizes units and names for ATS

Parameters:
  • dat (xr.Dataset) – Input Daymet dataset with variables: tmin, tmax, prcp, srad, dayl, vp.

  • transition_temperature (float, optional) – Temperature threshold for rain/snow split in Celsius. Default is 0.

Returns:

Dataset with ATS-compatible variable names and units.

Return type:

xr.Dataset