Utilities Package#

from diveplane.utilities import ...

This module contains various utilities for the Diveplane clients.

class diveplane.utilities.FeatureType#

Bases: Enum

Feature type enum.

class diveplane.utilities.ProgressTimer#

Bases: Timer

Monitor progress of a task.

Parameters:
  • total_ticks (int, default 100) – The total number of ticks in the progress meter.

  • start_tick (int, default 0) – The starting tick.

__init__(total_ticks=100, *, start_tick=0)#

Initialize a new ProgressTimer instance.

Parameters:
  • total_ticks (int) –

  • start_tick (int) –

property is_complete: bool#

If progress has reached completion.

property progress: float#

The current progress percentage.

reset()#

Reset the progress timer.

Return type:

None

start()#

Start the progress timer.

Return type:

ProgressTimer

property tick_duration: timedelta | None#

The duration since the last tick.

Returns:

The duration since the last tick, or None if not yet started.

Return type:

timedelta or None

property time_remaining: timedelta#

The estimated time remaining.

Returns:

The time estimated to be remaining.

Return type:

timedelta

Raises:

ValueError – If timer not yet started.

update(ticks=1)#

Update the progress by given ticks.

Parameters:

ticks (int, default 1) – The number of ticks to increment/decrement by.

Return type:

None

exception diveplane.utilities.StopExecution#

Bases: Exception

Raise a StopExecution as this is a cleaner exit() for Notebooks.

class diveplane.utilities.Timer#

Bases: object

Simple context manager to capture run duration of the inner context.

Usage:

with Timer() as my_timer:
    # perform time-consuming task here...
print(f"The task took {my_timer.duration}."

Results in:

"The task took 1:30:10.454419"
__init__()#

Initialize a new Timer instance.

property duration: timedelta | None#

The total computed duration of the timer.

Returns:

The total duration of the timer. When the timer has not yet ended, the duration between now and when the timer started will be returned. If the timer has not yet started, returns None.

Return type:

timedelta or None

end()#

End the timer.

Return type:

None

property has_ended: bool#

If the timer has ended.

property has_started: bool#

If the timer has started.

reset()#

Reset the timer.

Return type:

None

start()#

Start the timer.

Return type:

Timer

class diveplane.utilities.UserFriendlyExit#

Bases: object

Return a callable that, when called, simply prints msg and cleanly exits.

Parameters:

verbose (bool) – If True, emit more information

__init__(verbose=False)#

Construct a UserFriendlyExit instance.

diveplane.utilities.align_data(x, y=None)#

Check and fix type problems with the data and reshape it.

x is a Matrix and y is a vector.

Parameters:
  • x (numpy.ndarray) – Feature values ndarray.

  • y (numpy.ndarray, default None) – Target values ndarray.

Return type:

numpy.ndarray, numpy.ndarray or numpy.ndarray

diveplane.utilities.date_format_is_iso(f)#

Check if datetime format is ISO8601.

Does format match the iso8601 set that can be handled by the C parser? Generally of form YYYY-MM-DDTHH:MM:SS - date separator can be different but must be consistent. Leading 0s in dates and times are optional.

Sourced from Pandas: pandas-dev/pandas

diveplane.utilities.date_to_epoch(date_obj, time_format)#

Convert date into epoch (i.e seconds counted from Jan 1st 1970).

Note

If date_str is None or nan, it will be returned as is.

Parameters:
  • date_obj (str or datetime.date or datetime.time or datetime.datetime) – Time object.

  • time_format (str) – Specify format of the time. Ex: %a %b %d %H:%M:%S %Y

Returns:

The epoch date as a floating point value or ‘np.nan’, et al.

Return type:

Union[str, float]

diveplane.utilities.deserialize_cases(data, columns, features=None)#

Deserialize case data into a DataFrame.

If feature attributes contain original typing information, columns will be converted to the same data type as original training cases.

Parameters:
  • data (list of list or list of dict) – The context data.

  • columns (list of str) –

    The case column mapping.

    The order corresponds to how the data will be mapped to columns in the output. Ignored for list of dict where the dict key is the column name.

  • features (dict, default None) –

    (Optional) The dictionary of feature name to feature attributes.

    If not specified, no column typing will be attempted.

Returns:

The deserialized data.

Return type:

pandas.DataFrame

diveplane.utilities.dprint(debug, *argc, **kwargs)#

Print based on debug levels.

Parameters:
  • debug (bool or int) – If true, user_debug level would be 1. Possible levels: 1, 2, 3 (print all)

  • kwargs

    default_priorityint, default 1

    The message is printed only if the debug >= default_priority.

Examples

>>> dprint(True, "hello", "diveplane", priority=1)
`hello diveplane`
diveplane.utilities.epoch_to_date(epoch, time_format, tzinfo=None)#

Convert epoch to date if epoch is not None or nan else, return as it is.

Parameters:
  • epoch (Union[str, float]) – The epoch date as a floating point value (or str if np.nan, et al)

  • time_format (str) – Specify format of the time. Ex: %a %b %d %H:%M:%S %Y

  • tzinfo (datetime.tzinfo, optional) – Time zone information to include in datetime.

Returns:

A date string in the format similar to “Wed May 21 00:00:00 2008”

Return type:

str

diveplane.utilities.format_dataframe(df, features)#

Format DataFrame columns to original type using feature attributes.

Note

Modifies DataFrame in place.

Parameters:
  • df (pandas.DataFrame) – The DataFrame to format columns of.

  • features (Dict) – The dictionary of feature name to feature attributes.

Returns:

The formatted data.

Return type:

pandas.DataFrame

diveplane.utilities.guess_feature_attributes(*args, **kwargs)#

Shim the deprecated guess_feature_attributes to raise a warning.

diveplane.utilities.infer_feature_attributes(data, *, tables=None, time_feature_name=None, **kwargs)#

Return a dict-like feature attributes object with useful accessor methods.

The returned object is a subclass of FeatureAttributesBase that is appropriate for the provided data type.

Parameters:
  • data (Any) – The data source to infer feature attributes from. Must be a supported data type.

  • tables (Iterable of TableNameProtocol) –

    (Optional, required for datastores) An Iterable of table names to infer feature attributes for.

    If included, feature attributes will be generated in the form {table_name: {feature_attribute: value}}.

  • time_feature_name (str, default None) – (Optional, required for time series) The name of the time feature.

  • features (dict or None, default None) –

    (Optional) A partially filled features dict. If partially filled attributes for a feature are passed in, those parameters will be retained as is and the rest of the attributes will be inferred.

    For example:
    >>> from pprint import pprint
    >>> df.head(2)
    ... sepal-length  sepal-width  petal-length  petal-width  target
    ... 0           6.7          3.0           5.2          2.3       2
    ... 1           6.0          2.2           5.0          1.5       2
    >>> # Partially filled features dict
    >>> partial_features = {
    ...     "sepal-length": {
    ...         "type": "continuous",
    ...         'bounds': {
    ...             'min': 2.72,
    ...             'max': 3,
    ...             'allow_null': True
    ...         },
    ...     },
    ...     "sepal-width": {
    ...         "type": "continuous"
    ...     }
    ... }
    >>> # Infer rest of the attributes
    >>> features = infer_feature_attributes(
    ...     df, features=partial_features
    ... )
    >>> # Inferred Feature dictionary
    >>> pprint(features)
    ... {
    ...     'sepal-length', {
    ...         'bounds': {
    ...             'allow_null': True, 'max': 3, 'min': 2.72
    ...         },
    ...         'type': 'continuous'
    ...     },
    ...     'sepal-width', {
    ...         'bounds': {
    ...             'allow_null': True, 'max': 7.38905609893065,
    ...             'min': 1.0
    ...         },
    ...         'type': 'continuous'
    ...     },
    ...     'petal-length', {
    ...         'bounds': {
    ...             'allow_null': True, 'max': 7.38905609893065,
    ...             'min': 1.0
    ...         },
    ...         'type': 'continuous'
    ...     },
    ...     'petal-width', {
    ...         'bounds': {
    ...             'allow_null': True, 'max': 2.718281828459045,
    ...             'min': 0.049787068367863944
    ...         },
    ...         'type': 'continuous'
    ...     },
    ...     'target', {
    ...         'bounds': {'allow_null': True},
    ...         'type': 'nominal'
    ...     }
    ... }
    

  • infer_bounds (bool, default True) – (Optional) If True, bounds will be inferred for the features if the feature column has at least one non NaN value

  • datetime_feature_formats (dict, default None) –

    (Optional) Dict defining a custom (non-ISO8601) datetime format and an optional locale for features with datetimes. By default datetime features are assumed to be in ISO8601 format. Non-English datetimes must have locales specified. If locale is omitted, the default system locale is used. The keys are the feature name, and the values are a tuple of date time format and locale string.

    Example:

    {
        "start_date": ("%Y-%m-%d %A %H.%M.%S", "es_ES"),
        "end_date": "%Y-%m-%d"
    }
    

  • delta_boundaries (dict, default None) –

    (Optional) For time series, specify the delta boundaries in the form {“feature” : {“min|max” : {order : value}}}. Works with partial values by specifying only particular order of derivatives you would like to overwrite. Invalid orders will be ignored.

    Examples:

    {
        "stock_value": {
            "min": {
                '0' : 0.178,
                '1': 3.4582e-3,
                '2': None
            }
        }
    }
    

  • derived_orders (dict, default None) – (Optional) Dict of features to the number of orders of derivatives that should be derived instead of synthesized. For example, for a feature with a 3rd order of derivative, setting its derived_orders to 2 will synthesize the 3rd order derivative value, and then use that synthed value to derive the 2nd and 1st order.

  • dropna (bool, default False) – (Optional) If True, all features will be populated with ‘dropna’: True parameter. That would mean, rows containing NaNs will be automatically dropped when you train.

  • lags (list or dict, default None) –

    (Optional) A list containing the specific indices of the desired lag features to derive for each feature (not including the series time feature). Specifying derived lag features for the feature specified by time_feature_name must be done using a dictionary. A dictionary can be used to specify a list of specific lag indices for specific features. For example: {“feature1”: [1, 3, 5]} would derive three different lag features for feature1. The resulting lag features hold values 1, 3, and 5 timesteps behind the current timestep respectively.

    Note

    Using the lags parameter will override the num_lags parameter per feature

    Note

    A lag feature is a feature that provides a “lagging value” to a case by holding the value of a feature from a previous timestep. These lag features allow for cases to hold more temporal information.

  • num_lags (int or dict, default None) –

    (Optional) An integer specifying the number of lag features to derive for each feature (not including the series time feature). Specifying derived lag features for the feature specified by time_feature_name must be done using a dictionary. A dictionary can be used to specify numbers of lags for specific features. Features that are not specified will default to 1 lag feature.

    Note

    The num_lags parameter will be overridden by the lags parameter per feature.

  • orders_of_derivatives (dict, default None) – (Optional) Dict of features and their corresponding order of derivatives for the specified type (delta/rate). If provided will generate the specified number of derivatives and boundary values. If set to 0, will not generate any delta/rate features. By default all continuous features have an order value of 1.

  • rate_boundaries (dict, default None) –

    (Optional) For time series, specify the rate boundaries in the form {“feature” : {“min|max” : {order : value}}}. Works with partial values by specifying only particular order of derivatives you would like to overwrite. Invalid orders will be ignored.

    Examples:

    {
        "stock_value": {
            "min": {
                '0' : 0.178,
                '1': 3.4582e-3,
                '2': None
            }
        }
    }
    

  • tight_bound_features (list of str, default None) – (Deprecated) Explicit list of feature names that should have tight bounds. This argument is deprecated; please use tight_bounds instead.

  • tight_bounds (bool or Iterable of str, default False) – (Optional) Set tight min and max bounds on either the features specified in the Iterable, or on all continuous features if True, or none if False.

  • tight_time_bounds (bool, default False) – (optional) If True, will set tight_bounds flag on time_feature to True. The True flag will cause the bounds for the start and end times set to the same bounds as observed in the original data.

  • time_series_type_default (str, default 'rate') – (Optional) Type specifying how time series is generated. One of ‘rate’ or ‘delta’, default is ‘rate’. If ‘rate’, it uses the difference of the current value from its previous value divided by the change in time since the previous value. When ‘delta’ is specified, just uses the difference of the current value from its previous value regardless of the elapsed time.

  • time_series_types_override (dict, default None) – (Optional) Dict of features and their corresponding time series type, one of ‘rate’ or ‘delta’, used to override time_series_type_default for the specified features.

  • mode_bound_features (list of str, default None) – (Optional) Explicit list of feature names to use mode bounds for when inferring loose bounds. If None, assumes all features. A mode bound is used instead of a loose bound when the mode for the feature is the same as an original bound, as it may represent an application-specific min/max.

  • id_feature_name (str or list of str, default None) – (Optional) The name(s) of the ID feature(s).

  • time_invariant_features (list of str, default None) – (Optional) Names of time-invariant features.

  • attempt_infer_extended_nominals (bool, default False) –

    (Optional) If set to True, detections of extended nominals will be attempted. If the detection fails, the categorical variables will be set to int-id subtype.

    Note

    Please refer to kwargs for other parameters related to extended nominals.

  • nominal_substitution_config (dict of dicts, default None) – (Optional) Configuration of the nominal substitution engine and the nominal generators and detectors.

  • include_extended_nominal_probabilities (bool, default False) – (Optional) If true, extended nominal probabilities will be appended as metadata into the feature object.

  • datetime_feature_formats

    (optional) Dict defining a custom (non-ISO8601) datetime format and an optional locale for columns with datetimes. By default datetime columns are assumed to be in ISO8601 format. Non-English datetimes must have locales specified. If locale is omitted, the default system locale is used. The keys are the column name, and the values are a tuple of date time format and locale string:

    Example:

    {
        "start_date" : ("%Y-%m-%d %A %H.%M.%S", "es_ES"),
        "end_date" : "%Y-%m-%d"
    }
    

  • ordinal_feature_values (dict, default None) –

    (optional) Dict for ordinal string features defining an ordered list of string values for each feature, ordered low to high. If specified will set ‘type’ to be ‘ordinal’ for all features in this map.

    Example:

    {
        "grade" : [ "F", "D", "C", "B", "A" ],
        "size" : [ "small", "medium", "large", "huge" ]
    }
    

  • dependent_features (dict, default None) –

    (Optional) Dict of features with their respective lists of features that either the feature depends on or are dependent on them. Should be used when there are multi-type value features that tightly depend on values based on other multi-type value features.

    Examples:

    If there’s a feature name ‘measurement’ that contains measurements such as BMI, heart rate and weight, while the feature ‘measurement_amount’ contains the numerical values corresponding to the measurement, dependent features could be passed in as follows:

    {
        "measurement": [ "measurement_amount" ]
    }
    

    Since dependence directionality is not important, this will also work:

    {
        "measurement_amount": [ "measurement" ]
    }
    

Returns:

A subclass of FeatureAttributesBase (Single/MultiTableFeatureAttributes) that extends dict, thus providing dict-like access to feature attributes and useful accessor methods.

Return type:

FeatureAttributesBase

Examples

# 'data' is a DataFrame
>> attrs = infer_feature_attributes(data, dropna=True)
# Can access feature attributes like a dict
>> attrs
    {
        "feature_one": {
            "type": "continuous",
            "bounds": {"allow_null": True},
        },
        "feature_two": {
            "type": "nominal",
        }
    }
>> attrs["feature_one"]
    {
        "type": "continuous",
        "bounds": {"allow_null": True}
    }
# Or can call methods to do other stuff
>> attrs.get_parameters()
    {'dropna': True}

# Now 'data' is an object that implements SQLRelationalDatastoreProtocol
>> attrs = infer_feature_attributes(data, tables, tight_bounds=False)
>> attrs
    {
        "table_1": {
            "feature_one": {
                "type": "continuous",
                "bounds": {"allow_null": True},
            },
            "feature_two": {
                "type": "nominal",
            }
        },
        "table_2" : {...},
    }
>> attrs.to_json()
    '{"table_1" : {...}}'
diveplane.utilities.is_valid_uuid(value, version=4)#

Check if a given string is a valid uuid.

Parameters:
  • value (str or UUID) – The value to test

  • version (int, optional) – The uuid version (Default: 4)

Returns:

True if value is a valid uuid string

Return type:

bool

diveplane.utilities.num_list_dimensions(lst)#

Return number of dimensions for a list.

Assumption is that the input nested lists are also lists, or a list of dataframes.

Parameters:

lst (list) – The nested list of objects.

Returns:

The number of dimensions in the passed in list.

Return type:

int

diveplane.utilities.replace_doublemax_with_infinity(dat)#

Replace values of Double.MAX_VALUE (1.79769313486232E+308) with Infinity.

For use when retrieving data from Diveplane.

Parameters:

dat (A dict, list, number, or string) –

Return type:

A dict, list, number, or string - same as passed in for translation

diveplane.utilities.replace_nan_with_none(dat)#

Replace None values with NaN values.

For use when feeding data to Diveplane from the scikit module to account for the different ways diveplane and sklearn represent missing values.

Parameters:

dat (list of list of object) – A 2d list of values.

Return type:

list[list[object]]

diveplane.utilities.replace_none_with_nan(dat)#

Replace None values with NaN values.

For use when retrieving data from Diveplane via the scikit module to conform to sklearn convention on missing values.

Parameters:

dat (list of dict of key-values) –

Return type:

list[dict]

diveplane.utilities.reshape_data(x, y)#

Reshapes X as a matrix and y as a vector.

Parameters:
  • x (np.ndarray) – Feature values ndarray.

  • y (np.ndarray) – target values ndarray.

Returns:

X, y

Return type:

np.ndarray, np.ndarray

diveplane.utilities.seconds_to_time(seconds, *, tzinfo=None)#

Convert seconds to a time object.

Parameters:
  • seconds (int or float) – The seconds to convert to time.

  • tzinfo (datetime.tzinfo, optional) – Time zone to use for resulting time object.

Returns:

The time object.

Return type:

datetime.time

diveplane.utilities.serialize_cases(data, columns, features, *, warn=False)#

Serialize case data into list of lists.

Parameters:
  • data (pandas.DataFrame or numpy.ndarray or list of list) – The data to serialize.

  • columns (list of str) – The case column mapping. The order corresponds to the order of cases in output.

  • features (dict) – The dictionary of feature name to feature attributes.

  • warn (bool, default False) – If warnings should be raised by serializer.

Returns:

The serialized data from DataFrame.

Return type:

list of list or None

diveplane.utilities.time_to_seconds(time)#

Convert a time object to seconds since midnight.

Parameters:

time (datetime.time) – The time to convert.

Returns:

Seconds since midnight.

Return type:

float

diveplane.utilities.trainee_from_df(df, features=None, action_features=None, name=None, persistence='allow', trainee_metadata=None)#

Create a Trainee from a dataframe.

Assumes floats are continuous and all other values are nominal.

Parameters:
  • df (pandas.DataFrame) – A pandas Dataframe with column names corresponding to feature names.Features that are considered to be continuous should have a dtype of float.

  • features (Optional[Mapping[str, Mapping]]) – (Optional) A dictionary of feature names to a dictionary of parameters.

  • action_features (List of String, Default None) – (Optional) List of action features. Anything that’s not in this list will be treated as a context feature. For example, if no action feature is specified the trainee won’t have a target.

  • name (str or None, defaults to None) – (Optional) The name of the trainee.

  • persistence (str: default "allow") – The persistence setting to use for the trainee. Valid values: “always”, “allow”, “never”.

  • trainee_metadata (Mapping, optional) – (Optional) mapping of key/value pairs of metadata for trainee.

Returns:

A trainee object

Return type:

diveplane.openapi.models.Trainee

diveplane.utilities.validate_features(features, extended_feature_types=None)#

Validate the feature types in features.

Parameters:
  • features (dict) –

    The dict of feature name to feature attributes.

    The valid feature names are:

    1. ”nominal”

    2. ”continuous”

    3. ”ordinal”

    4. along with passed in extended_feature_types

  • extended_feature_types (list of str, optional) – (Optional) If a list is passed in, the feature types specified in the list will be considered as valid features.

Return type:

None

diveplane.utilities.validate_list_shape(values, dimensions, variable_name, var_types, allow_none=True)#

Validate the shape of a list.

Raise a ValueError if it does not match expected number of dimensions.

Parameters:
  • values (Collection or None) – A single or multidimensional list.

  • dimensions (int) – The number of dimensions the list should be.

  • variable_name (str) – The variable name for output.

  • var_types (str) – The expected type of the data.

  • allow_none (bool, default True) – If None should be allowed.

Return type:

None