# Recipe 1: Diveplane <span style="color:orange">REACTOR™</span> Intro 

## Overview 

 
Diveplane <span style="color:orange">REACTOR™</span> is a generalized Machine Learning (ML) and Artificial Intelligence platform that creates powerful decision-making models that are fully explainable, auditable, and editable. <span style="color:orange">REACTOR™</span> uses Instance-Based Machine Learning which stores instances, i.e., data points, in memory and makes predictions about new instances given their relationship to existing instances. This technology harnesses a fast spatial query system and information theory for performance and accuracy. 

 

<span style="color:orange">REACTOR™</span> is unique in that it utilizes a one-model approach (the data is the model) that enables the user to accomplish: 

 

- "Targetless Learning" - allows the user to predict and characterize any set of variables 

- Supervised Learning 

- Unsupervised Learning 

- Reinforcement Learning 

- Online Learning 

- Discriminative and Generative Modeling 

- Model Interpretation and Explanation 

- Data Imputation 

 

In this notebook we will explore the most basic workflow for training <span style="color:orange">REACTOR™</span> and making predictions. This will enable us to become familiar with basic <span style="color:orange">REACTOR™</span> terminologies, including: 

 

- Action Features 

- Context Features 

- Cases 

- Trainee 

- Train 

- Analyze 

- React 

## Recipe Goals:

This notebook will provide a demonstration of a basic Diveplane <span style="color:orange">REACTOR™</span> workflow.

In [1]:
import pandas as pd
from pmlb import fetch_data

from diveplane.reactor import Trainee
from diveplane.utilities import infer_feature_attributes

# Step 1: Import Data

Our example dataset for this recipe is the well known `Adult` dataset. This dataset consists of 14 Context Features and 1 Action Feature. The Action Feature in this version of the `Adult` dataset has been renamed to `target` and it takes the form of a binary indicator for whether a person in the data makes more than $50,000/year (*target*=1) or less (*target*=0).

**`Definitions`:**

**`Action Features`:** A set of Features that is used as a schema to describe values for the output of a Trainee when it Reacts. In traditional ML these are often referred to as targets, target features, dependent features, or labels.

**`Context Features`:** A set of Features that is used as a schema to describe values for input, observation, or other computation. In traditional ML these are often referred to as inputs, predictors, or independent features.



In [2]:
df = fetch_data('adult', local_cache_dir="data/adult")

# Subsample the data to ensure the example runs quickly
df = df.sample(2000)

df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
48141,30.0,4,186932.0,15,10.0,2,12,0,4,1,0.0,0.0,75.0,39,1
2685,27.0,1,409815.0,15,10.0,4,1,4,2,0,0.0,0.0,40.0,39,1
8016,20.0,4,154779.0,15,10.0,4,12,2,3,0,0.0,0.0,40.0,39,1
25896,42.0,4,54611.0,15,10.0,0,5,1,4,1,0.0,0.0,40.0,39,1
48226,21.0,4,103031.0,15,10.0,2,10,0,4,1,0.0,0.0,20.0,39,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1280,52.0,4,186785.0,11,9.0,4,12,1,4,1,0.0,1876.0,50.0,39,1
8326,20.0,4,33644.0,11,9.0,4,1,3,4,0,0.0,0.0,30.0,39,1
39101,52.0,4,123989.0,9,13.0,2,4,0,4,1,0.0,0.0,40.0,39,0
5821,53.0,4,157069.0,7,12.0,2,7,0,4,1,0.0,0.0,40.0,39,0


# Step 2: Feature Mapping

Typically, an exploratory analysis is done on the data to get a general feel of the descriptive statistics and data attributes. 

Methods like `describe` from a Pandas dataframe often automatically present these types of information of interest to a user, as shown below. While informative, these descriptive statistics are often used as a sanity check pre- and post-modeling and a model typically doesn't actually use any of these feature attributes.

In [3]:
df.describe()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,38.5325,3.8615,185865.9,10.154,9.98,2.6745,6.5525,1.5025,3.646,0.6615,816.0695,87.503,40.4465,36.8835,0.7585
std,14.088619,1.462322,103418.5,3.997408,2.585332,1.495884,4.275206,1.635032,0.878106,0.473318,5563.120359,398.288413,12.830528,7.679676,0.4281
min,17.0,0.0,19410.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
25%,27.0,4.0,116130.8,9.0,9.0,2.0,3.0,0.0,4.0,0.0,0.0,0.0,40.0,39.0,1.0
50%,37.0,4.0,175925.0,11.0,10.0,2.0,7.0,1.0,4.0,1.0,0.0,0.0,40.0,39.0,1.0
75%,48.0,4.0,233220.8,12.0,12.0,4.0,10.0,3.0,4.0,1.0,0.0,0.0,45.0,39.0,1.0
max,90.0,7.0,1125613.0,15.0,16.0,6.0,14.0,5.0,4.0,1.0,99999.0,2444.0,99.0,41.0,1.0


In the Diveplane <span style="color:orange">REACTOR™</span> workflow, feature attributes are an essential part of model building and usage. By incorporating certain feature attributes into training process itself, <span style="color:orange">REACTOR™</span> gains another layer of information that will help in fine-tuning the results. 

In order to assist the user with defining the feature attributes, Diveplane has an `infer_feature_attributes` tool that automatically processes the dataset for the user.

In <span style="color:orange">REACTOR™</span>, these feature attributes are model infrastructure feature parameters that are based on the existing data, rather than exact descriptive statistics. This is why, for example, the min and max bounds on continuous features are not the exact min and max values of the dataset, but rather an expanded version of those min and max values to allow for some variation.

In [4]:
# Infer features attributes
features = infer_feature_attributes(df)
features

{'age': {'type': 'continuous',
  'decimal_places': 0,
  'original_type': {'data_type': 'numeric', 'size': 8},
  'bounds': {'min': 7.38905609893065, 'max': 148.4131591025766}},
 'workclass': {'type': 'nominal',
  'data_type': 'number',
  'decimal_places': 0,
  'original_type': {'data_type': 'integer', 'size': 8},
  'bounds': {'allow_null': False}},
 'fnlwgt': {'type': 'continuous',
  'decimal_places': 0,
  'original_type': {'data_type': 'numeric', 'size': 8},
  'bounds': {'min': 8103.083927575384, 'max': 1202604.2841647768}},
 'education': {'type': 'nominal',
  'data_type': 'number',
  'decimal_places': 0,
  'original_type': {'data_type': 'integer', 'size': 8},
  'bounds': {'allow_null': False}},
 'education-num': {'type': 'continuous',
  'decimal_places': 0,
  'original_type': {'data_type': 'numeric', 'size': 8},
  'bounds': {'min': 1.0, 'max': 20.085536923187668}},
 'marital-status': {'type': 'nominal',
  'data_type': 'number',
  'decimal_places': 0,
  'original_type': {'data_type': '

In [5]:
# Specify Context and Action Features
action_features = ['target']
context_features = features.get_names(without=action_features)


> **Note:** Train Test Split

To gauge model performance, train-test splits are often used in traditional machine learning workflows. Diveplane <span style="color:orange">REACTOR™</span> does not require the use of train-test split for validation. Please see recipe `6-validation.ipynb` for further explanation. Therefore, we will not use train-test splits our recipes unless the test set serves a specific purpose.

# Step 3: Create Trainee

To begin the <span style="color:orange">REACTOR™</span> workflow, a Trainee is created to act as a base for all of our ML needs. In all subsequent notebooks and jupyter cells, we will refer to Diveplane <span style="color:orange">REACTOR™</span> 's model as Trainee.

**`Definitions`:**

**`Trainee`:** A collection of Cases that comprise knowledge. May include metadata and parameters. In traditional ML this is referred to as a model.

**`Cases`:** A set of feature values representing a situation observed.  In traditional ML, a Case is sometimes referred to as an "observation", "record", or "data point". In database terms, a Case would be a row of values. For supervised learning a Case is a set of Context Features and Action Features and for unsupervised learning a Case is just a set of features. 

In [6]:
# Create the Trainee
t = Trainee(
    features=features,  
    overwrite_existing=True
)

# Step 4: Preprocessing and Training

One benefit of Diveplane <span style="color:orange">REACTOR™</span> is that most standard forms of data pre-processing such as one hot encoding and standardizing are `NOT` needed, which is in contrast to many traditional ML models. This does not include more sophisticated forms of pre-processing such as feature selection or feature engineering, which may still be useful. Fitting is also done in two steps in Diveplane <span style="color:orange">REACTOR™</span>.

**`Definitions`:**

**`Train`:** Exposing a Trainee to a Case which may cause the ML algorithm to update the Trainee. This is a single training step; training may happen at each decision, at a certain sampling rate of observations per second, or at certain events.

**`Analyze`:** Tune internal parameters to improve performance and accuracy of predictions and metrics. Analysis may be targeted or targetless.  Targetless analysis provides the best balanced set of hyperparameters if an Action Feature is not specified, along with a performance boost while targeted analysis provides a boost to accuracy. 


In [7]:
# Train
t.train(df)

# Analyze the Trainee
t.analyze(context_features=context_features, action_features=action_features)

# Step 5: Results

Once <span style="color:orange">REACTOR™</span> is trained and analyzed, it provides the user with a variety of ML capabilities. At this stage in the <span style="color:orange">REACTOR™</span> workflow, a typical use case would be to evaluate the accuracy of the Trainee, which is performed by the `react` method. This is equivalent to `predict` in many traditional Machine Learning workflows, although the `react` method is not solely used for supervised predictions as detailed in subsequent recipes. 

Since we are not using a train-test split, we will use the `react_into_trainee` method, which performs a `react` on each of the cases that is trained into the model. Alternatively, a standard `react` call may be used on unseen data for prediction.
The accuracy is calculated internally as shown in the code below and this is the recommended accuracy metric. Further explanations are available in recipe `6-validation`.

**`Definitions`:**

**`React`:** Exposing a Trainee to a new Case's Context Features and an Action Feature for that case is returned. In traditional ML this is often referred to predicting or labeling.

In [8]:
# Recommended metrics
t.react_into_trainee(action_feature=action_features[0], residuals=True)
stats = t.get_prediction_stats().target.round(2)
print(f'Diveplane Prediction Results - Accuracy: {stats["accuracy"]}, Precision: {stats["recall"]}, and Recall: {stats["precision"]}')

Diveplane Prediction Results - Accuracy: 0.86, Precision: 0.81, and Recall: 0.79


In [9]:
stats_matrix = t.get_prediction_stats(stats=["confusion_matrix"])
print("Diveplane Prediction Results - Confusion Matrix")
matrix = pd.DataFrame(stats_matrix.loc["confusion_matrix", "target"])
matrix.index.name = "Predicted"
matrix.columns.name = "Actual"
display(matrix)

Diveplane Prediction Results - Confusion Matrix


Actual,0,1
Predicted,Unnamed: 1_level_1,Unnamed: 2_level_1
0,158,60
1,80,702


Another way to use the `get_prediction_stats` method is to utilize the `condition` parameter. Specifying a condition allows the user to get prediction stats on the subset of cases that matches the specified condition.

For example, we can use the `condition` parameter to find the accuracy, precision, and recall on the subsets of the dataset with each value for Sex.

In [10]:
sex_1_stats = t.get_prediction_stats(condition={'sex': 1})['target'].round(2)
print(f'Diveplane Prediction Results On Cases with Sex=1 - Accuracy: {sex_1_stats["accuracy"]}, Precision: {sex_1_stats["recall"]}, and Recall: {sex_1_stats["precision"]}')

sex_0_stats = t.get_prediction_stats(condition={'sex': 0})['target'].round(2)
print(f'Diveplane Prediction Results On Cases with Sex=0 - Accuracy: {sex_0_stats["accuracy"]}, Precision: {sex_0_stats["recall"]}, and Recall: {sex_0_stats["precision"]}')

Diveplane Prediction Results On Cases with Sex=1 - Accuracy: 0.83, Precision: 0.79, and Recall: 0.79
Diveplane Prediction Results On Cases with Sex=0 - Accuracy: 0.93, Precision: 0.81, and Recall: 0.78


# Conclusion and Next Steps

<span style="color:orange">REACTOR™</span> provide users with powerful tools to perform a variety of data analysis and prediction tasks. However, making predictions may not suffice if these tools are used for decision-making. As an example with this specific `Adult` dataset, if the user were a bank predicting whether an individual will make over or under $50k in the future for loan decisions, the outcome of prediction efforts might greatly affect the livelihood of that individual as well as the bank itself, if the prediction was inaccurate or biased. Ideally, a predictive tool should provide both accurate results and interpretability so that it can be held accountable for its predictions. Unfortunately, traditional ML techniques often leave a lot to be desired in terms of interpretability. Because of its instance-based ML nature, <span style="color:orange">REACTOR™</span> is inherently interpretable.

Thus, the next step to fully utilizing <span style="color:orange">REACTOR™</span> is to take advantage of its interpretability capabilities, which will be introduced in recipe `2-interpretability.ipynb`.

