# Recipe 4: Auditing and Editing
## Overview 

Diveplane  <span style="color:orange">REACTOR™</span>'s instance-based machine learning approach enables unique capabilities in addition to interpretability, which we learned about in the previous recipes. We were able to detect possible anomalies and investigate Influential Cases and features that may be of concern. Using that information combined with Diveplane  <span style="color:orange">REACTOR™</span> giving us dynamic control over our Trainee, we can take meaningful action without having to dramatically incur additional expenses of retrain the Trainee. This is in contrast to most machine learning models which, once trained, are difficult to update without retraining the entire model. 

In this notebook we demonstrate the editability of a Diveplane  <span style="color:orange">REACTOR™</span> Trainee to take advantage of the Trainee and data diagnostic results shown from the `2-interpretability.ipynb`. 

This can be done on a small scale where we show how a Case can be edited or removed to modify the behavior of the Trainee. A  <span style="color:orange">REACTOR™</span> Session allows us to toggle entire batches of training data and add/remove large chunks of training data. This can be very useful if we are continously adding data to our Trainee and we discover that certain batches are undesirable.

### REACTOR™ Session

A Trainee Session is associated with each modification to a Trainee, which is useful for auditability. A Session consists of the following information:  

- Unique identifier  

- The user for which the Session was created  

- Date the Session was created 

- Name, given by the user (Optional) 

- Metadata for the user to store information (Optional) 

When working with Trainees, a default Session will be automatically started for you unless you explicitly start (or create) your own. This Session will be used for all interactions with the Trainee, unless a new Session is explicitly started, for as long as your Client is running. Additionally, each instance of a Diveplane Client will use its own unique active Session. Starting a new Session explicitly is useful if you want to give it a name and/or metadata for your own reference later, or if you wish to use separate Sessions for different modifications of the Trainee. For example, using a unique Session each time you train would allow you to later reference the specific cases that were trained by a certain Session. 



## Recipe Goals:

This notebook will show how to edit cases in a Diveplane  <span style="color:orange">REACTOR™</span> Trainee, either individually or in batches through the use of Sessions. This will allow the user to take action on Cases they deemed necessary through use of the interpretability and auditing tools show in previous recipes.

In [1]:
import pandas as pd
from pmlb import fetch_data

from diveplane import reactor
from diveplane.utilities import infer_feature_attributes

# Section 1: Train and Analyze a Trainee

### 1. Load Data

Our example dataset for this recipe continues to be the well known `Adult` dataset. This dataset consists of 14 Context Features and 1 Action Feature. The Action Feature in this version of the `Adult` dataset has been renamed to `target` and it takes the form of a binary indicator for whether a person in the data makes over $50,000/year (*target*=1) or less (*target*=0).

In [2]:
df = fetch_data('adult', local_cache_dir="data/adult")

# subsample the data to ensure the example runs quickly
df = df.sample(1001, random_state=0)

df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
38113,41.0,4,151856.0,11,9.0,2,11,0,4,1,0.0,0.0,40.0,39,1
39214,57.0,6,87584.0,10,16.0,0,10,1,4,0,0.0,0.0,25.0,39,1
44248,31.0,2,220669.0,9,13.0,4,10,3,4,0,6849.0,0.0,40.0,39,1
10283,55.0,4,171355.0,8,11.0,2,7,0,4,1,0.0,0.0,20.0,39,1
26724,59.0,6,148626.0,0,6.0,2,5,0,4,1,0.0,0.0,40.0,39,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4721,60.0,0,204486.0,9,13.0,2,0,0,4,1,0.0,0.0,8.0,39,0
40113,48.0,2,93449.0,14,15.0,2,10,0,1,1,99999.0,0.0,40.0,30,0
17827,25.0,4,114838.0,14,15.0,4,10,1,4,0,0.0,0.0,8.0,22,1
35120,22.0,4,202125.0,11,9.0,2,12,0,4,1,0.0,0.0,50.0,39,1


### 2. Train Trainee with Sessions

In this section we will perform all of the steps needed to train Diveplane  <span style="color:orange">REACTOR™</span>'s Trainee.


In [3]:
# Infer features attributes
features = infer_feature_attributes(df)

# Specify Context and Action Features
action_features = ['target']
context_features = features.get_names(without=action_features)

# We extract one row for demonstrative purposes later
test_case = df.iloc[0]  
df = df.iloc[1:]  

# Split the data into Context Features (X) and Action Feature (y)
dfX = df[context_features]
dfy = df[action_features]

#### 2a. Experiment

To demonstrate how to edit Cases, we will break the training into two sessions. 

1. The first session is for the original dataset 
2. The second session is a modified dataset containing a target feature that is flipped from the true value

In [4]:
ind_session_1 = dfX.index[ :(dfX.shape[0] //2 )]
ind_session_2 = dfX.index[ (dfX.shape[0] //2 ):]

X_train_1 = dfX.loc[ind_session_1]
y_train_1 = dfy.loc[ind_session_1]

X_train_2 = dfX.loc[ind_session_2]

# Use a "random infer" that always infers wrong
y_train = dfy['target']
y_train_2 = pd.Series([int(not x) for x in y_train.loc[ind_session_2]], name=action_features[0], index=ind_session_2)

In [5]:
# Create the Trainee
t = reactor.Trainee(
    features=features, 
    overwrite_existing=True
)

session = reactor.Session('train_session_1', metadata={'data': 'original data'})
t.train(X_train_1.join(y_train_1))

session = reactor.Session('train_session_2', metadata={'data': 'modified data ("random inferer"'})
t.train(X_train_2.join(y_train_2))

# Analyze the Trainee
t.analyze()

### 3. React

In [6]:
t.react_into_trainee(residuals=True)

### 4. Results

In [7]:
accuracy = t.get_prediction_stats(stats=['accuracy'])['target'][0]

print("Test set prediction accuracy: {acc}".format(acc=accuracy))

Test set prediction accuracy: 0.517


### 4. Results

We can see that half of the training data have their target values flipped, which greatly reduces the accuracy compared to the expected accuracy shown in recipe `1-reactor-intro.ipynb`.

While it is unrealistic to know this ahead of time in a real world setting, we will use their stark results to clearly demonstrate the effect of Trainee editing.


# Section 2. Audit & Edit Trainee

There are many reasons to audit and edit a Trainee. In previous recipes, we highlighted several training cases that may be anomalous that may be candidates for removal. In this recipe, we have a entire chunk of training data that is incorrect. Diveplane  <span style="color:orange">REACTOR™</span> has the ability to edit data at different scales.

What sets Diveplane  <span style="color:orange">REACTOR™</span> apart from other machine learning models is there is no need for retraining. For example, if we use `Scikit-Learn`'s Logistic regression as shown in Recipe `1-reactor-intro.ipynb` and we discover that our training data consists of Cases we would like to remove, we would have to go back to the beginning of the workflow, removing the Case from the training data, and completely retraining the model.

In  <span style="color:orange">REACTOR™</span>, this is unnecessary unless a very large portion of the training data is altered. If that occurs, re-analyzing the Trainee may improve results, although it is not strictly necessary.

## A. Editing Individual Cases

In this first section of Part 2, we will demonstrate how to edit the Trainee on a Case by Case basis. Editing a Case allows the user to modify or "fix" the behavior of the Trainee by the targeted editing of a few select cases. The user has complete control over all data in the Trainee, making it dynamic and quickly adjustable. Users do not need to worry about minor mistakes as the Trainee can be fine tuned with this method after training.

In our use case, the anomalous cases identified for the `Adult` dataset in Recipe `3-anomaly_detection.ipynb` represent possible Cases we want to edit. We noticed certain cases with unusual values for `capital-gains` like 99999 that look like they are nominal values representing other values, such as blanks. Editing Cases allows us to easily correct these minor issues for an otherwise valid Case post training. 

If we believe that the Case is entirely invalid and warrants removal, Diveplane  <span style="color:orange">REACTOR™</span> can also remove the Case entirely.

### A1. React
To demonstrate the effect, we `react` to a single Case to compare the predictions before and after an edit.

In [8]:
test_case_X = test_case[context_features]
test_case_y = test_case[action_features]

details = {
    'influential_cases':True,
}

new_result = t.react(
    [test_case_X.values.tolist()], 
    context_features=context_features, 
    action_features=action_features,
    details=details
)

In [9]:
print('prediction: {}'.format(new_result['action']['target'][0]))
print('actual: {}'.format(test_case_y[0]))

prediction: 0
actual: 1.0


#### Results 

We can see that the the predicted value is incorrect. If we want to artifically correct this prediction using our Trainee, we can edit the Cases around it. This is for demonstrative purposes only and we do not recommend editing Influential Cases without fully investigating the Cases.

### A2. Influential Cases

To determine which Cases we want to edit, we identify the Influential Cases using the techniques from `2-interpretability.ipynb`

In [10]:
influence_df = pd.DataFrame(new_result['explanation']['influential_cases'][0])
influence_df.head(3)

Unnamed: 0,race,relationship,hours-per-week,native-country,capital-loss,fnlwgt,workclass,target,occupation,age,.session_training_index,education-num,marital-status,capital-gain,education,.session,sex,.influence_weight
0,4,0,40,39,0,288679,4,0,3,41,145,9,2,0,11,c99a6b9f-9c48-4d8a-9953-8e1d2c755650,1,0.798903
1,4,0,40,39,0,171351,4,0,6,42,410,9,2,0,11,c99a6b9f-9c48-4d8a-9953-8e1d2c755650,1,0.060183
2,4,0,40,39,0,154087,4,1,14,32,405,9,2,0,11,2f0ed4c4-4185-4c81-811a-261bf3dbf0b8,1,0.054215


We can see that two out of the three Influential Cases have a target value of 1, but we want the Case prediction to be a 0.

### A3. Case Modification

We will modify those two Influential Cases which have a different target value than what we want to predict by changing their target value from a 1 to 0. Having more Influential Cases with a target value of 0 will increase the chance of that Case being predicted to have a target value of 0.

In [11]:
# Modify case 1
session_id = influence_df.iloc[0]['.session']
session_training_index = influence_df.iloc[0]['.session_training_index']

# Flip the target in the original case
cases = t.get_cases(session=session_id, features=['.session_training_index', 'target'])
orig_target = cases.set_index('.session_training_index').loc[session_training_index][0]

# Flip the target
if str(orig_target) == '0':
    flipped = 1
else:
    flipped = 0

t.edit_cases(feature_values=[flipped], 
             case_indices=[(session_id, session_training_index.item())], 
             features=['target'])

print("Modifying training index {ind} of Session {session_id} target value to {tar}".format(ind=session_training_index, session_id=session_id, tar=flipped))

Modifying training index 145 of Session c99a6b9f-9c48-4d8a-9953-8e1d2c755650 target value to 1


In [12]:
# Modify case 2
session_id = influence_df.iloc[0]['.session']
session_training_index = influence_df.iloc[1]['.session_training_index']

# Flip the target in the original case
cases = t.get_cases(session=session_id, features=['.session_training_index', 'target'])
orig_target = cases.set_index('.session_training_index').loc[session_training_index][0]

# Flip the target
if str(orig_target) == '0':
    flipped = 1
else:
    flipped = 0

t.edit_cases(feature_values=[flipped], 
             case_indices=[(session_id, session_training_index.item())], 
             features=['target'])

print("Modifying training index {ind} of Session {session_id} target value to {tar}".format(ind=session_training_index, session_id=session_id, tar=flipped))

Modifying training index 410 of Session c99a6b9f-9c48-4d8a-9953-8e1d2c755650 target value to 1


### A4. Verification

We can audit one of the updated Cases to make sure the Case has been edited and to demonstrate how to retrieve the Case history. Editing Case history provides another layer of auditability and accountability to the Trainee.

In [13]:
updated_case = t.get_cases(
    case_indices=[(session_id, session_training_index.item())],
    features=df.columns.tolist() + ['.case_edit_history']
)

# audit edit history 
updated_case.loc[ 0, '.case_edit_history']

{'c99a6b9f-9c48-4d8a-9953-8e1d2c755650': [{'value': 1,
   'type': 'edit',
   'previous_value': 0,
   'feature': 'target'}]}

### A5. Predict

We will re-run the prediction and examine the Influential Cases to demonstrate the flipped value.

In [14]:
new_result = t.react(
    [test_case_X.values.tolist()], 
    context_features=context_features, 
    action_features=action_features, 
    details=details
)

### A6. Results

We can see that by editing those two Cases, we flipped the prediction for our original test Case without re-training or re-analyzing our Trainee. If done correctly, this provides a user with a surgical tool for Trainee corrections.

In [15]:
print('prediction: {}'.format(new_result['action']['target'][0]))
print('actual: {}'.format(test_case_y[0]))

prediction: 1
actual: 1.0


### A7. Case Deletion

In addition to editing a Case, Reactor can also delete a complete Case and remove it from the decision making process. This workflow is the same as the edit example in the section above, except we use `remove_cases` instead of `edit_cases`. 

In [16]:
# remove cases using ".session_training_index"
t.remove_cases(num_cases=1, condition={'.session': session_id, '.session_training_index': session_training_index.item()})

1

The dynamic edting and deleting of individual Cases allows the user to perform targeted modifiction of the Trainee and provides the user with unparalleled control over their data and Trainee. This should not be done lightly and we recommend that all Cases be investigated before performing this action.

## B. Editing Sessions

In the beginning of this notebook, we trained the data in two Sessions. The first Session used a normal sample of training data, however the second Session artifically flipped the target variable. This reduced the performance of our Trainee by using introducing a large portion of incorrect data.

Diveplane  <span style="color:orange">REACTOR™</span> has the capability to add or remove entire Sessions. In this situation, if we discovered that one of our Sessions had very poor quality data, like our example, we can easily remove that entire Session's data without having to individually alter Cases.

#### B1. Session Info

Let's first see how many Sessions are in this Trainee along with some details of each Session.

In [17]:
sessions = t.get_sessions()
sessions

[{'id': '2f0ed4c4-4185-4c81-811a-261bf3dbf0b8', 'name': 'train_session_1'},
 {'id': 'c99a6b9f-9c48-4d8a-9953-8e1d2c755650', 'name': 'train_session_2'}]

In [18]:
display(reactor.get_session(sessions[0]['id']))
display(reactor.get_session(sessions[1]['id']))

{'created_date': datetime.datetime(2023, 7, 28, 19, 19, 36, 453806),
 'id': '2f0ed4c4-4185-4c81-811a-261bf3dbf0b8',
 'metadata': {'data': 'original data',
              'trainee_id': 'cdb51591-ffd2-49f1-b2be-58cb66963715'},
 'modified_date': datetime.datetime(2023, 7, 28, 19, 19, 36, 453806),
 'name': 'train_session_1'}

{'created_date': datetime.datetime(2023, 7, 28, 19, 19, 36, 479483),
 'id': 'c99a6b9f-9c48-4d8a-9953-8e1d2c755650',
 'metadata': {'data': 'modified data ("random inferer"'},
 'modified_date': datetime.datetime(2023, 7, 28, 19, 19, 36, 479483),
 'name': 'train_session_2'}

We can see the two different Sessions which we trained earlier

### B2. Session Deletion

Deleting an entire Session is performed in one easy step once we retrieve the Session ID of the Session we want to delete.

In [19]:
### Delete a session

session_id = sessions[1]['id']
t.delete_session(session_id)

# Re-analyze the Trainee
t.analyze()

### B3. React

We then `react` to the test data again to get the predictions after the faulty training data from our Session is removed.

In [20]:
t.react_into_trainee(residuals=True)

### B4. Results

In [21]:
accuracy_new = t.get_prediction_stats(stats=['accuracy'])['target'][0]

print("Original accuracy: {acc}".format(acc=accuracy))
print("New accuracy: {acc}".format(acc=accuracy_new))

Original accuracy: 0.517
New accuracy: 0.838


In [22]:
# Check to make sure there is only 1 session
t.get_sessions()

[{'id': '2f0ed4c4-4185-4c81-811a-261bf3dbf0b8', 'name': 'train_session_1'}]

### B5. Results

We can clearly see the difference in accuracy results once the faulty Session data is removed. The new accuracy is much more in line with the results we saw from recipe `1-reactor-intro.ipynb`. 

# Conclusions and Next Steps

We can see that by getting rid of the Session with the faulty data, our Trainee performance improved dramatically, as expected. This capability provides the user with a very efficient way to maintain control over a continously evolving Trainee if the user is constantly adding training data.

The tools shown in this and previous recipes allows the user to find, diagnose, and act at a level of ease and precision that other machine learning models cannot match. This opens the door to possibilities for the user and provides a flexible platform that can adjust to any type of machine learning needs.

The next recipe `5-bias_mitigation.ipynb` will demonstrate an interesting and specific use case that demonstrates how taking advantage of Diveplane's interpretability can provide an oppurtunity to perform ethical and responsible machine learning.