Summary of current progress:

In the previous article we preprocessed our absenteeism dataset by performing the following steps:

  • Dropping the ID column (it contained no useful information for our upcoming analysis)
  • Performed some exploratory analysis on the Reason for absence column, which contained integers describing the reasons for absenteeism. We performed dummy encoding and grouped the dummy variables into 4 classes:
    • Various diseases
    • Pregnancy-related reasons
    • Poisoning
    • Light reasons and replaced the original column with four dummy-encoded reasons columns
  • Split the date column into month and weekday columns
  • Grouping the education column into two classes representing High School and Tertiary-level graduates respectively.
  • Finally, we saved our preprocessed dataset as absenteeism_data_preprocessed.csv in our working directory.

So, What’s next?

Now, we would like to use our preprocessed data to build a logistic regression classifier, or Logit model, to help us predict whether or not a given employee will exhibit excessive absenteeism, based on information encoded in the predictors we preprocessed.

Let us begin by importing the usual libraries, loading and taking a preliminary look at our preprocessed data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set('notebook')
from IPython.display import display, Image, SVG, Math

%matplotlib inline

dataset ='absenteeism_data_preprocessed.csv'
raw = pd.read_csv(dataset)
raw.sample(5)
Reason_1 Reason_2 Reason_3 Reason_4 Month Value Day of the Week Transportation Expense Distance to Work Age Daily Work Load Average Body Mass Index Education Children Pets Absenteeism Time in Hours
529 1 0 0 0 10 6 248 25 47 284.853 32 0 2 1 8
418 0 0 0 1 4 2 179 51 38 239.409 31 0 0 0 8
130 0 0 1 0 1 1 289 36 33 308.593 30 0 2 1 8
422 0 0 0 1 4 4 260 50 36 239.409 23 0 4 0 4
484 0 0 0 1 8 1 289 36 33 249.797 30 0 2 1 8

The Absenteeism Time in Hours column

This is our target column, the variable we will use to predict whether or not absenteeism occurs given our predictors.

print('\033[1m' + 'Basic Stats' + '\033[0m')
abstimestats = raw['Absenteeism Time in Hours'].describe()
abstimestats

Basic Stats 

count    700.000000
mean       6.761429
std       12.670082
min        0.000000
25%        2.000000
50%        3.000000
75%        8.000000
max      120.000000
Name: Absenteeism Time in Hours, dtype: float64

Creating our targets

Let’s create our targets for the logistic regression. We would like to have a binary classification telling us whether or not an employee is excessively absent. Thus, we can transform this column into a classification containing binary values: True if absent and False if not absent.

One way to achieve this is by mapping all values above a certain threshold amount of Absenteeism hours to 1 and the rest to 0.

The median can be used as our threshold as it automatically balances our data into 2 roughly equal classes

Perfectly balanced

med = abstimestats['50%'] # median = 50 percentile!
print('The median is %d'%med)
targets = np.where(raw['Absenteeism Time in Hours']>med, 1, 0)
print('%5.2f%% of the targets are excessively absent. \nA 60/40 split\
 still counts as balanced!' %(100*sum(targets)/np.shape(targets)))
The median is 3
45.57% of the targets are excessively absent.
A 60/40 split still counts as balanced!
Targs = pd.DataFrame(data=targets, columns=['Excessive Absenteeism'])
Targs.sample(6)
Excessive Absenteeism
579 1
83 1
311 0
406 0
31 1
213 0
plt.rcParams['figure.figsize'] = (14.0, 8.0)
plt.xlabel('Absenteeism in Hours')
sns.countplot(raw['Absenteeism Time in Hours'],
              hue=Targs['Excessive Absenteeism'].map({0:'Moderate', 1:'Excessive'}))
plt.show()

Excessive absenteeism countplot

graph of stuff

Thus, using logistic regression, we will classify employees into 2 categories:

class 1: Excessively absent \(\le median \le\) class 2: moderately to non-absconding.

i.e We’ve decided to classify an instance where more than 3 hours are taken off work as excessive absenteeism.


plt.pie(Targs['Excessive Absenteeism'].value_counts(),  explode=(0,0.1),
        labels=['Moderate to none', 'Excessive'], startangle=80, autopct='%1.2f%%', shadow=True)

plt.title('Excessive Absenteeism in the workplace', fontweight='bold')
plt.show()

Excessive absenteeism classes pie chart

Selecting the inputs for regression:

raw_inputs = raw.drop(['Absenteeism Time in Hours'], axis=1)

we could alternatively use iloc to slice the dataFrame, like so:

raw_inputs = df.iloc[:,:-1]

where the first argument selects all rows and after the comma we are selecting all columns but the last one.

Train/ Test Split

We divide our dataset into two parts:

  • The Training set, on which we will train our model. This chunk is typically 70-80% of the entire dataset.

  • The Validation/Test set, also known as the holdout set. This is the remainder of the data, the ‘unseen’ data on which we will test our model.

It is always a good idea to perform this step early on, specifically before any scaling to prevent any leakage of information between the train and test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(raw_inputs, Targs, test_size=0.2, random_state=0)

Standardizing the data:

Why do we need to standardize our data?

Standardizing allows us to use one distribution (the Normal distribution) when comparing data with different units, ranges or other attributes (i.e multivariate data). It also ensures that the data and results inferred therefrom are comparable with other datasets. Standardization works when the (random) variable, X is normally (N) distributed with mean \(\mu\) and a variance of \(\sigma^{2}\): \(X \sim N(\mu, \sigma^{2})\)

The result of standardizing X is a zero-mean (\(\mu = 0\)) data-set with a standard deviation of 1 (\(\sigma =1\)). towardsdatascience

Each observation’s standardized score \(z_{i} = \frac{x_{i} - \mu}{\sigma}\) (a.k.a z-score) tells us, based on the sign of \(z (\pm)\), if the datapoint is below or above the mean, while the magnitude of \(z\) tells us by how much.

365datascience

The above two points basically allow us to easily say how far from the norm a single observation is, i.e: whether it’s 1 \(\sigma\) (read 1 sigma, when z=1) or 3 \(\sigma\) (z=3), etc. above or below the mean, ultimately allowing us to use a single model to evaluate the likelihood of any observation.

Neglecting to standardize our data can have the effect of skewing the results of a ML algorithm towards, say, a feature with a large range but very litte importance to the model in reality!


To Standardize or not to standardize?

Shakespeare

In the context of ML Engineering, standardization helps with model accuracy: statisticians, BI analysts and other businessfolk on the other hand prioritise model interpretability since they are more concerned with the driving forces behind the phenomena covered by our models, and thus will often opt for no standardization.

The decision to standardize or not to standardize ultimately depends on the data scientist/analyst, informed by their ultimate requirements with the data. Some machine learning algorithms’ solvers also penalise unscaled data, thus in the interest of accuracy, one may opt to either standardize or create analyses covering both scenarios, which would be simplified by the use of a pipeline.


This topic could be an entire post on its own, but this is the gist of it.

TLDR: Standardization tells us the average spread of the observations from the mean of the feature, and it is useful for comparing different features and datasets.

TLDR for the TLDR: 🍏🍊 \(\underrightarrow{\mbox{standardization}}\) 🍎🍏.

See this Quora question and its answers, this Humans of data article or One of a ton of Medium articles on the topic.


Now, we are building a binary classifier, there are some columns which are already binary (0,1) and will not need scaling in our dataset

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

#exclude = [i for i in X_train.columns if np.all(X_train[i].unique() in np.array([0,1])) or np.all(X_train[i].unique() in np.array([1,0]))]
exclude = [i for i in X_train.columns if X_train[i].nunique()==2 ]

print('Standardize all inputs except:\n', exclude, '\n\n')

#The columns we WILL be scaling are:

to_scale = [col for col in X_train.columns if col not in exclude]
#print(to_scale, '\n')

coltrans = ColumnTransformer(
      remainder = 'passthrough', #ignore the unspecified columns, without dropping them
      transformers=[('scale', StandardScaler(), to_scale)]
      )  

X_train = X_train.astype(float) # StandardScaler expects all data to have dtype=float64  
# ColumnTransformer still moves the unscaled columns to the end. We must recover the original column order
X_train_scaled = pd.DataFrame(data=coltrans.fit_transform(X_train), columns=to_scale+exclude)[X_train.columns]

print('\033[1m'+'\nOur Predictors, after scaling:\n'+'\033[0m' )
X_train_scaled.head()
Standardize all inputs except:
 ['Reason_1', 'Reason_2', 'Reason_3', 'Reason_4', 'Education']


*Our Predictors, after scaling:*
Reason_1 Reason_2 Reason_3 Reason_4 Month Value Day of the Week Transportation Expense Distance to Work Age Daily Work Load Average Body Mass Index Education Children Pets
0 0.0 0.0 0.0 1.0 0.772843 2.682111 -1.563951 -1.372249 0.100137 -0.773124 0.280561 0.0 -0.912335 -0.585103
1 0.0 0.0 0.0 0.0 0.772843 0.647824 0.178263 -0.698398 1.045247 0.556297 2.587672 0.0 0.003270 -0.585103
2 0.0 0.0 0.0 1.0 1.056269 -0.708368 -0.655617 1.390539 0.257655 -0.470922 0.972694 0.0 -0.912335 -0.585103
3 0.0 0.0 0.0 1.0 -1.211142 1.325919 -0.655617 1.390539 0.257655 -0.512437 0.972694 0.0 -0.912335 -0.585103
4 0.0 1.0 0.0 0.0 -0.644289 -0.708368 0.178263 -0.967938 -0.687455 -0.651830 -0.411572 1.0 -0.912335 -0.585103

The Logistic Regression

We can finally perform our logistic regression!

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

Let’s define and train our model

logreg = LogisticRegression(solver='newton-cg')
logreg.fit(X_train_scaled, y_train.to_numpy().ravel())
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

Predictions and interpretability

Predictions:

Let’s see what our model’s predictions are, by running model.predict on our holdout set predictors.

NB: Let’s not forget to standardize our test set first!

NB: We fit our scaler on the training set only, to teach our model the training set’s \(\mu\) and \(\sigma\): \(X\_train\_scaled = \frac{X_{i}- \mu_{train}}{\sigma_{train}}\). The .fit_transform method immediately transforms (scales) our training set after fitting. X_train_scaled = scaler.fit_transform(X_train). Learning MUST only take place within the training set.

After the model has learned the values of \(\mu\) and \(\sigma\) from the training set, we use the fitted model to transform the holdout set using: \(X\_test\_scaled = \frac{X_{i}- \mu_{train}}{\sigma_{train}}\)

or X_test_scaled = scaler.transform(X_test)

for each column.

See, for example, discussions from sebastianraschka.com, Data Science Stack Exchange and Matthew Drury’s answer on Statistics Stack Exchange for more info on the subject

y_pred = logreg.predict(X_test_scaled) gives us our model’s predictions

X_test_scaled = pd.DataFrame(data=coltrans.transform(X_test.astype(float)), columns=to_scale+exclude)[X_test.columns]
y_pred = pd.DataFrame(data=(logreg.predict(X_test_scaled)), columns=y_test.columns.values)
y_pred.sample(5)

Excessive Absenteeism
138 0
84 1
73 0
16 1
48 1

Model Accuracy

The model accuracy is calculated by comparing the predictions in y_pred with the holdout target column.

Let’s see how our model performs on the training and testing predictors respectively. A good model doesn’t do significantly worse on the test set, if it does, we’ve overfit.

The inverse, though, doesn’t make much sense. However, a marginal ‘improvement’ in accuracy is possible due to randomness.

print('\033[1m'+'Train Accuracy:'+'\033[0m', logreg.score(X_train_scaled, y_train)) # Train score
print('\033[1m'+'Test Accuracy:'+'\033[0m', logreg.score(X_test_scaled, y_test)) # Train score
# Let's Manually calculate the test accuracy....
manual_test_accuracy = (sum(y_pred.to_numpy()==y_test.to_numpy())/np.shape(y_pred.to_numpy()))[0]
print('\033[1m'+'Manual Test Accuracy:'+'\033[0m', manual_test_accuracy)
[**Train Accuracy:** 0.7589285714285714
[**Test Accuracy:** 0.7642857142857142
[**Manual Test Accuracy:** 0.7642857142857142

The Weights and Bias

The objective of Regression analysis is to determine the weights (coefficients) and bias, which we then apply to the predictors (inputs) to obtain our predictions (the final result).

\[\gamma = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{n}x_{n} + \epsilon\]

where \(\gamma\) is the predicted output

\(\beta_{0}\) is the bias (intercept in math)

\(\beta_{k}, 1 \leq k \leq n\) are the weights

logreg.coef_ # Each predictor's coefficient
array([[ 2.6719901 ,  0.46065513,  3.09032814,  0.87137457,  0.07028897,
        -0.17395212,  0.62522164,  0.05633776, -0.1554372 ,  0.0393792 ,
         0.20286036,  0.18514474,  0.49220729, -0.33276522]])
logreg.intercept_[0] # bias
-1.7037491273882583

Interpretation of the weights and bias

The way these coefficients are displayed here makes it quite hard to match up the inputs to their coefficients.

Also, since we are dealing with a logistic regression model the equation actually looks like this:

\[\gamma = log(odds) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + ... + \beta_{n}x_{n} + \epsilon\]

And since we are interested in finding the odds of excessive absenteeism occurring:

\[odds = e^{log(odds)}\]

Let’s create a summary table:

summary_table = pd.DataFrame(data=X_train_scaled.columns.values, columns=['Feature name'])
summary_table['Weight'] = np.transpose(logreg.coef_) # Convert the coefficients into columns
#display(summary_table)

# To add the intercept to the beginning of the summary table:
summary_table.index +=1 # shift the indices down by one
#display(summary_table) # Space has been created for the intercept to be prepended
summary_table.loc[0] = ['Bias', logreg.intercept_[0]]
summary_table = summary_table.sort_index()

summary_table['Odds_ratio'] = np.exp(summary_table['Weight'])
summary_table['importance'] = [(100*abs(1-abs(i))/(abs(1-abs(summary_table.Odds_ratio[3])))) for i in summary_table.Odds_ratio]
summary_table.sort_values('importance', ascending=False)

Feature name Weight Odds_ratio importance
3 Reason_3 3.090328 21.984291 100.000000
1 Reason_1 2.671990 14.468735 64.184847
4 Reason_4 0.871375 2.390194 6.624928
7 Transportation Expense 0.625222 1.868660 4.139573
0 Bias -1.703749 0.182000 3.898155
13 Children 0.492207 1.635923 3.030473
2 Reason_2 0.460655 1.585112 2.788334
14 Pets -0.332765 0.716938 1.348921
11 Body Mass Index 0.202860 1.224901 1.071761
12 Education 0.185145 1.203393 0.969261
6 Day of the Week -0.173952 0.840337 0.760869
9 Age -0.155437 0.856041 0.686033
5 Month Value 0.070289 1.072818 0.347013
8 Distance to Work 0.056338 1.057955 0.276183
10 Daily Work Load Average 0.039379 1.040165 0.191404

Now we have our coefficients sorted from most to least important. A weight \(\beta_{k}\) of zero (or close to 0) \(\implies\) the feature will not impact the model by much. Conversely, a larger weight means that the model depends more heavily on that feature. This is intuitive.

The odds ratio on the other hand: \(Odds \times odds\_ratio = \mbox{new odds}\) i.e if odds are 3:1

and \(odds\_ratio = 2\)

then new odds = 6:1

for a unit change (change of 1)

If odds_ratio = 1, odds = new odds.

Thus, the closer a predictor’s odds_ratio is to 1, the lower its significance to the model. We would like to know how important each of our predictors are to the model and its predictions. The weights / Odds ratio offer a crude way of evaluating this. However, for a quick ranking of the predictors, I included an arbitrarily defined importance column, with no meaning beyond being a ranking aid. In this column, the importance by the absolute value of each feature’s difference with 1, displayed as a percentage of the highest-importance predictor.

Feature Importance provides a much more straightforward and less fluffy way of achieving this…

Feature Importance

The Permutation Importance of a feature is calculated randomly shuffling the feature’s rows and checking how much the model’s prediction accuracy on the hold out set deteriorates as a result. This is done for each feature without changing any of the other columns. This shuffling and model performance evaluation (by calculating how much the loss function suffers per shuffle) is performed multiple times for each feature, to account for randomness. The final value reported is the average importance weight \(\pm\) the standard deviation, or the range of variation of the importance weights each time the shuffling is performed.

Now, shuffling a predictor’s rows should result in less accurate model predictions, for obvious reasons. Thus, the feature importances are calculated by how negatively shuffling them affects the model.

eli5 does this very thing in no more than 4 lines of code!

!pip install eli5
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(logreg, random_state=0).fit(X_train_scaled, y_train)

eli5.show_weights(perm, feature_names=X_train_scaled.columns.tolist())
Weight Feature
0.1446 ± 0.0168 Reason_1
0.0771 ± 0.0167 Reason_3
0.0493 ± 0.0080 Transportation Expense
0.0382 ± 0.0062 Children
0.0200 ± 0.0142 Reason_4
0.0132 ± 0.0105 Pets
0.0082 ± 0.0048 Age
0.0025 ± 0.0062 Education
0.0018 ± 0.0078 Daily Work Load Average
0.0018 ± 0.0023 Reason_2
-0.0007 ± 0.0017 Month Value
-0.0007 ± 0.0129 Day of the Week
-0.0011 ± 0.0066 Distance to Work
-0.0029 ± 0.0161 Body Mass Index

Interpretation:

Permutation importances are displayed in descending order.

The weights are reported with an uncertainty denoting how much the base weight varied per shuffle.

Some of the features’ reported weights are negative. Does that mean the model’s loss function improved after random shuffling of the column?

yes and no. Rather, this is a result of happenstance: such features have very low permutation importances and due to random noise, the shuffled data’s predictions happen to be slightly more accurate than the unshuffled validation data. This issue is less common with larger datasets, where there is less room for chance. (recall: this dataset has only 700 observaions!)


Anyway, we see, from our permutation importances, that Reasons 1 and 3, followed by Transport expense and number of Children have the biggest impact on Absenteeism.

Backward Elimination

is where we simplify our model by removing the low-importance features (with weights \(\approx 0\) or odds ratio \(\approx 1\))

The Day of the Week, Distance to work, Body Mass Index and Month appear to be three such features, according to the Permutation Importance Table.


Let’s re-train our model without these features and see if a simplified model performs any better.

# Drop unimportant features from a checkpoint variable. Let's remove variables with importances < 10% of highest importance
eliminate = ['Distance to Work', 'Month', 'Body Mass Index', 'Day of the Week']
#`Day of the Week`, `Distance to work`, `Body Mass Index` and `Month`
#eliminate = ['Absenteeism Time in Hours','Day of the Week', 'Daily Work Load Average','Distance to Work']
feat = raw_inputs.drop([i for i in eliminate if i in raw_inputs], axis=1)

# Train/test split
X_tr, X_te, y_tr, y_te = train_test_split(feat, Targs, test_size=0.2, random_state=0, stratify=Targs)

# Select the columns to standardize
#incl = [i for i in X_tr.columns if np.all(X_tr[i].unique() != np.array([0,1])) and np.all(X_tr[i].unique() != np.array([1,0]))]
incl = [i for i in X_tr.columns if X_tr[i].nunique()!=2 ]
excl = [i for i in X_tr.columns if i not in incl]
#standardize
coltransformer = ColumnTransformer(remainder = 'passthrough', transformers=[('scale', StandardScaler(), incl)])
X_tr_scaled = pd.DataFrame(data=coltransformer.fit_transform(X_tr.astype(float)), columns=incl+excl)[X_tr.columns] # scale train predictors
X_te_scaled = pd.DataFrame(data=coltransformer.transform(X_te.astype(float)), columns=incl+excl)[X_te.columns] #scale test predictors

### Retrain the model with less features:
logreg_new = LogisticRegression(solver='newton-cg')
logreg_new.fit(X_tr_scaled, y_tr.to_numpy().ravel())

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

# Summary Table COmparison
summary_table2 = pd.DataFrame(data=X_tr_scaled.columns.values, columns=['Feature name'])
summary_table2['New Weight'] = np.transpose(logreg_new.coef_) # Convert the coefficients into columns


# To add the intercept to the beginning of the summary table:
summary_table2.index +=1 # shift the indices down by one
# display(summary_table2) # Space has been created for the intercept to be prepended
summary_table2.loc[0] = ['New Bias', logreg_new.intercept_[0]]
summary_table2 = summary_table2.sort_index()
summary_table2['New Odds_ratio'] = np.exp(summary_table2['New Weight'])


summary_table2['New importance'] = [(abs(1-abs(i))/max(abs(1-abs(summary_table2['New Odds_ratio'])))) for i in summary_table2['New Odds_ratio']]


from IPython.display import display_html
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html()
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

"""
# credit: https://stackoverflow.com/questions/38783027/jupyter-notebook-display-two-pandas-tables-side-by-side
display_side_by_side(summary_table.sort_values('importance', ascending=False),
                     summary_table2.sort_values('New importance', ascending=False))
"""

perm = PermutationImportance(logreg, random_state=0).fit(X_train_scaled, y_train)
eli5.show_weights(perm, feature_names=X_train_scaled.columns.tolist())

perm2 = PermutationImportance(logreg_new, random_state=0).fit(X_tr_scaled, y_tr)
eli5.show_weights(perm2, feature_names=X_tr.columns.tolist())


Weight Feature
0.1686 ± 0.0262 Reason_1
0.0771 ± 0.0156 Reason_3
0.0236 ± 0.0069 Transportation Expense
0.0225 ± 0.0137 Children
0.0104 ± 0.0089 Reason_4
0.0039 ± 0.0116 Pets
0.0036 ± 0.0045 Reason_2
0.0000 ± 0.0064 Daily Work Load Average
-0.0004 ± 0.0061 Month Value
-0.0014 ± 0.0057 Age
-0.0043 ± 0.0029 Education
print('\n------------------------------------------------------------------------------------------------------\n')
plt.rcParams['figure.figsize'] = (18.0, 9.0)
f = plt.figure(figsize=(20,8))
f.suptitle('Feature Importances pre- and post-Backward elimination', fontweight='bold')
ax = f.add_subplot(121)
ax.set_title('All predictors')
ax.pie(summary_table['Odds_ratio'], labels=summary_table['Feature name'], shadow=True)

ax2 = f.add_subplot(122)
ax2.set_title('Important predictors')
ax2.pie(summary_table2['New Odds_ratio'], labels=summary_table2['Feature name'], shadow=True)
plt.show()

------------------------------------------------------------------------------------------------------

Feature Importances before and after feature selection

trsc = 100*logreg.score(X_train_scaled, y_train)
tsst = 100*logreg.score(X_test_scaled, y_test)
trsc_new = 100*logreg_new.score(X_tr_scaled, y_tr)
tsst_new = 100*logreg_new.score(X_te_scaled, y_te)
print('\033[1m'+'New Accuracies:'+'\033[0m'+'\nTrain Accuracy= %.2f%%\t Test Accuracy = %5.3f%%' %(trsc_new, trsc_new))
print('\033[1m'+'Old Accuracies:'+'\033[0m'+'\nTrain Accuracy= %.2f%%\t Test Accuracy = %5.3f%%\
      \n --------------------------------------------------------------\n' %(trsc, tsst))

score_table = pd.DataFrame(data=[trsc, tsst], columns=['old score'])
score_table['new score'] = [trsc_new, tsst_new]
score_table.index = ['train accuracy', 'test accuracy']
score_table
**New Accuracies:**
Train Accuracy= 74.64%	 Test Accuracy = 74.643%
**Old Accuracies:**
Train Accuracy= 75.89%	 Test Accuracy = 76.429%      
 --------------------------------------------------------------
old score new score
train accuracy 75.892857 74.642857
test accuracy 76.428571 77.142857

Productionalising the model

Based on one’s needs, the trained model may be saved in a variety of ways for future use.

  • One could write a python module that can be emailed to colleagues and simply imported to make further predictions, for static, batch predictions
  • Alternatively, a Docker container, apparently could be most helpful. DON’T ask me about Docker Containers, yet 🙃
  • Web services such as Amazon SageMaker accept Jupyter Notebooks.
    This is one of the most commonly used options for industrial use, as Amazon’s AWS provides the infrastructure for the model deployment
  • I have also recently come across Flask, which may also be worth checking out…

Håkon Hapnes Strand covered this subject in succinctly in this Quora Answer.

Anyway, we can easily save our final model as a pickle file

Pickle Rick

import pickle

with open('model', 'wb') as file:
    pickle.dump(logreg_new, file)

Let’s also save the scaler we used to standardize our data!

with open('scaler', 'wb') as file:
    pickle.dump(coltransformer, file)

And that’s it for now…

Send your comments and suggestions to my email!