Skip to main content

Model Design and Logistic Regression in Python

Justin Napolitano

2022-06-17 13:20:32.169 +0000 UTC


Table of Contents


Series

This is a post in the Quantitative Analysis in Python series.
Other posts in this series:


Model Design and Logistic Regression in Python

I recently modeled customer churn in Julia with logistic regression model. It was interesting to be sure, but I want to extend my analysis skillset by modeling biostatistics data. In this post, I design a logistic regression model of health predictors.

Imports

# load some default Python modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
from google.cloud import bigquery
from pprint import pprint
from datetime import date, datetime
import contextily as cx
import matplotlib.pyplot as plt
from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Data

Data Description

Chinese Longitudinal Healthy Longevity Survey (CLHLS), Biomarkers Datasets, 2009, 2012, 2014 (ICPSR 37226) Principal Investigator(s): Yi Zeng, Duke University, and Peking University; James W. Vaupel, Max Planck Institutes, and Duke University

filename = "/Users/jnapolitano/Projects/biostatistics/data/37226-0003-Data.tsv"
df = pd.read_csv(filename, sep='\t')
# read data in pandas dataframe


# list first few rows (datapoints)
df.head()

ID TRUEAGE A1 ALB GLU BUN CREA CHO TG GSP ... RBC HGB HCT MCV MCH MCHC PLT MPV PDW PCT
0 32160008 95 2 30.60000038147 4.230000019073 6.860000133514 64.90000152588 3.5 .8799999952316 232.89999389649 ... 3.5 104 29.14999961853 82.69999694825 29.5 357 394 8.60000038147 14.30000019074 .33000001311
1 32161008 95 2 39.09999847413 6.94000005722 16.190000534058 152.39999389649 4.619999885559 1.2799999713898 264.20001220704 ... 3.2999999523163 101.3000030518 28.930000305176 88.90000152588 31.10000038147 350 149 9.10000038147 15 .12999999523
2 32162608 87 2 44.79999923707 5.550000190735 5.679999828339 78.5 5.199999809265 2.3900001049042 276.20001220704 ... 3.5999999046326 111.3000030518 31.159999847412 87.59999847413 31.29999923707 357 201 8.30000019074 12 .15999999642
3 32163008 90 2 41.29999923707 5.269999980927 5.949999809265 75.80000305176 4.25 1.5499999523163 264.20001220704 ... 3.7000000476837 113.9000015259 32.900001525879 89.69999694825 31.10000038147 346 150 9.89999961854 16.79999923707 .1400000006
4 32164908 94 2 39.90000152588 7.05999994278 6.039999961853 90.80000305176 7.139999866486 2.3399999141693 237.69999694825 ... 4.1999998092651 131.1999969483 36.689998626709 88.5 31.60000038147 358 163 9.69999980927 17.79999923707 .15000000596

5 rows × 33 columns

df.describe()

ID
count 2.546000e+03
mean 4.069177e+07
std 4.367164e+06
min 3.216001e+07
25% 3.743344e+07
50% 4.135976e+07
75% 4.430106e+07
max 4.611231e+07

The float collumns were not interpretted correctly by pandas. I’ll fix that

df.columns
Index(['ID', 'TRUEAGE', 'A1', 'ALB', 'GLU', 'BUN', 'CREA', 'CHO', 'TG', 'GSP',
       'CRPHS', 'UA', 'HDLC', 'SOD', 'MDA', 'VD3', 'VITB12', 'UALB', 'UCR',
       'UALBBYUCR', 'WBC', 'LYMPH', 'LYMPH_A', 'RBC', 'HGB', 'HCT', 'MCV',
       'MCH', 'MCHC', 'PLT', 'MPV', 'PDW', 'PCT'],
      dtype='object')
# check datatypesdf
df.dtypes
ID            int64
TRUEAGE      object
A1           object
ALB          object
GLU          object
BUN          object
CREA         object
CHO          object
TG           object
GSP          object
CRPHS        object
UA           object
HDLC         object
SOD          object
MDA          object
VD3          object
VITB12       object
UALB         object
UCR          object
UALBBYUCR    object
WBC          object
LYMPH        object
LYMPH_A      object
RBC          object
HGB          object
HCT          object
MCV          object
MCH          object
MCHC         object
PLT          object
MPV          object
PDW          object
PCT          object
dtype: object

Everything was read an object. I’ll cast everything to numeric… Thank you numpy

# replace empty space with na
df = df.replace(" ", np.nan)

just to be safe, I’ll replace all blank spaces with np.nan.

# convert numeric objects to numeric data types.  I checked in the code book there will not be any false positives
df = df.apply(pd.to_numeric, errors='raise')
# Recheck dictypes
df.dtypes
ID             int64
TRUEAGE      float64
A1           float64
ALB          float64
GLU          float64
BUN          float64
CREA         float64
CHO          float64
TG           float64
GSP          float64
CRPHS        float64
UA           float64
HDLC         float64
SOD          float64
MDA          float64
VD3          float64
VITB12       float64
UALB         float64
UCR          float64
UALBBYUCR    float64
WBC          float64
LYMPH        float64
LYMPH_A      float64
RBC          float64
HGB          float64
HCT          float64
MCV          float64
MCH          float64
MCHC         float64
PLT          float64
MPV          float64
PDW          float64
PCT          float64
dtype: object
# check statistics of the features
df.describe()

ID TRUEAGE A1 ALB GLU BUN CREA CHO TG GSP ... RBC HGB HCT MCV MCH MCHC PLT MPV PDW PCT
count 2.546000e+03 2542.000000 2542.000000 2499.000000 2499.000000 2499.000000 2499.000000 2499.000000 2499.000000 2499.000000 ... 2497.000000 2497.000000 2497.000000 2497.000000 2497.000000 2497.000000 2497.000000 2487.000000 2483.000000 1711.000000
mean 4.069177e+07 85.584972 1.543273 42.363345 5.364794 6.661321 82.805642 4.770340 1.251369 253.726811 ... 4.165012 127.684902 38.664654 94.532295 31.033849 323.221826 195.033440 9.322951 16.114692 0.244237
std 4.367164e+06 12.061941 0.498222 4.367372 1.802363 2.355459 29.246926 1.010844 0.757557 38.658243 ... 0.602123 33.852642 7.163306 7.624568 11.041158 20.995983 76.322382 4.468129 4.264532 2.679986
min 3.216001e+07 47.000000 1.000000 21.900000 1.960000 2.090000 30.500000 0.070000 0.030000 139.899994 ... 1.910000 13.000000 0.280000 54.799999 15.900000 3.900000 9.000000 0.000000 5.500000 0.020000
25% 3.743344e+07 76.000000 1.000000 40.000000 4.400000 5.150000 66.599998 4.090000 0.800000 232.600006 ... 3.780000 115.000000 35.400002 91.300003 29.500000 317.000000 150.000000 8.200000 15.300000 0.140000
50% 4.135976e+07 86.000000 2.000000 42.799999 5.020000 6.380000 77.000000 4.690000 1.050000 248.800003 ... 4.160000 127.000000 39.000000 95.400002 31.200001 325.000000 189.000000 9.300000 16.000000 0.170000
75% 4.430106e+07 95.000000 2.000000 45.000000 5.790000 7.695000 92.099998 5.370000 1.470000 266.899994 ... 4.540000 140.000000 42.700001 98.900002 32.500000 333.000000 229.000000 10.300000 16.799999 0.210000
max 4.611231e+07 113.000000 2.000000 130.000000 22.000000 39.860001 585.099976 13.070000 8.150000 778.000000 ... 7.210000 1116.000000 70.199997 125.800003 371.000000 429.000000 1514.000000 107.000000 153.000000 111.000000

8 rows × 33 columns

It is kind of odd that there are greater counts for some rows. I’ll remove all na.

Checking for negative values and anything else I missed from the initial sql clean:

df = df.dropna()
df.describe()

ID TRUEAGE A1 ALB GLU BUN CREA CHO TG GSP ... RBC HGB HCT MCV MCH MCHC PLT MPV PDW PCT
count 1.561000e+03 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 ... 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000 1561.000000
mean 3.951369e+07 84.782191 1.534914 42.445292 5.334414 6.898834 84.107880 4.806336 1.243748 254.537604 ... 4.060587 126.936387 38.154004 94.416284 31.500461 329.142217 190.152082 10.006560 15.777220 0.251365
std 4.702774e+06 12.056596 0.498939 4.659920 1.707653 2.292577 29.500260 0.999193 0.775910 39.858537 ... 0.589037 39.956758 5.261258 7.701819 11.723908 16.978577 71.839722 4.453105 5.142069 2.805686
min 3.216101e+07 48.000000 1.000000 21.900000 1.960000 2.140000 30.500000 0.340000 0.070000 139.899994 ... 2.100000 13.000000 13.200000 56.000000 17.500000 35.000000 25.000000 0.200000 5.500000 0.020000
25% 3.736671e+07 76.000000 1.000000 39.900002 4.400000 5.370000 67.000000 4.120000 0.780000 231.600006 ... 3.680000 114.000000 34.799999 91.199997 30.100000 320.000000 147.000000 8.800000 15.400000 0.140000
50% 3.745741e+07 85.000000 2.000000 42.900002 4.990000 6.610000 77.599998 4.730000 1.040000 249.800003 ... 4.060000 126.000000 38.200001 95.400002 31.400000 328.000000 185.000000 9.600000 15.900000 0.170000
75% 4.332561e+07 94.000000 2.000000 45.099998 5.780000 7.960000 93.199997 5.420000 1.460000 268.899994 ... 4.400000 137.000000 41.599998 98.900002 32.700001 337.000000 227.000000 10.600000 16.299999 0.210000
max 4.581641e+07 113.000000 2.000000 130.000000 20.760000 23.549999 392.000000 8.490000 8.150000 778.000000 ... 7.210000 1116.000000 70.199997 125.800003 371.000000 408.000000 1302.000000 107.000000 153.000000 111.000000

8 rows × 33 columns

We remove about 4/5 of our data. The counts are now equivalent. Everything is in the correct data type.

Visualizing Age Distribution

I am curious what the age spread looks like. An even spread could be used to determine health outcomes.

# plot histogram of fare
df.TRUEAGE.hist(figsize=(14,3))
plt.xlabel('Age')
plt.title('Histogram');

png

Unfortunately, the spread is not evenly distributed.

Visualizing Age to Triglyceride Levels

A predictive model relating health factors to longevity is probably possible. Certain factors must be met, but I’ll assume they are for the sake of this mockup.

#idx = (df.trip_distance < 3) & (gdf.fare_amount < 100)
plt.scatter(df.TRUEAGE, df.TG)
plt.xlabel('True Age')
plt.ylabel('Triglyceride, mmol/L')

# theta here is estimated by hand
plt.show()

png

Filter Examples

The data above doesn’t really need to be filtered. To demonstrate how it could be, I include some randomized columns that are then filtered according to specific conditions.

To fit the specificities of the conditions in the training video I’ll add some randomized columns.

n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
emergency_department =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["EMERGENCY"] = emergency_department
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 100 # exclusive
cancer_care =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["CANCER_TYPE"] = cancer_care
n = df.shape[0]
lower_bound = 0 #inclusive 
#0 = no
#1 = ICPI
# 2 MONO 
upper_bound = 3 # exclusive
icpi_history =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["ICPI_HIST"] = icpi_history
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2# exclusive
# Spanish = 0
# English = 1
# Arbitrarily chosen. 
language =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["LANG"] = language
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
follow_up =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["FOLLOW_UP"] = follow_up
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
cons =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["CONSENT"] = cons
n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
prego =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["PREGNANT"] = prego
df

ID TRUEAGE A1 ALB GLU BUN CREA CHO TG GSP ... MPV PDW PCT EMERGENCY CANCER_TYPE ICPI_HIST LANG FOLLOW_UP CONSENT PREGNANT
1 32161008 95.0 2.0 39.099998 6.94 16.190001 152.399994 4.62 1.28 264.200012 ... 9.1 15.000000 0.13 1 20 2 1 1 1 1
2 32162608 87.0 2.0 44.799999 5.55 5.680000 78.500000 5.20 2.39 276.200012 ... 8.3 12.000000 0.16 1 2 2 0 0 1 1
3 32163008 90.0 2.0 41.299999 5.27 5.950000 75.800003 4.25 1.55 264.200012 ... 9.9 16.799999 0.14 0 36 2 1 0 0 0
6 32166108 89.0 2.0 45.000000 8.80 13.170000 147.000000 3.19 1.72 336.399994 ... 8.2 12.300000 0.12 1 84 1 0 0 0 0
7 32167608 100.0 2.0 40.099998 4.34 5.950000 76.000000 5.67 1.44 223.300003 ... 10.8 16.400000 0.20 1 0 1 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2265 45816014 98.0 2.0 37.000000 6.04 5.010000 59.299999 3.84 0.95 195.300003 ... 9.9 16.200001 0.26 0 6 0 1 0 0 0
2266 45816114 69.0 1.0 46.299999 5.99 5.030000 85.500000 4.43 1.44 224.000000 ... 10.4 16.000000 0.23 1 31 0 0 0 1 1
2267 45816214 93.0 2.0 42.599998 5.53 6.320000 85.500000 4.03 0.92 249.800003 ... 11.1 16.299999 0.14 1 39 1 1 1 1 0
2268 45816314 91.0 2.0 43.400002 5.82 7.770000 72.099998 4.29 1.08 259.299988 ... 10.0 15.900000 0.20 1 57 1 1 0 1 0
2269 45816414 93.0 2.0 42.900002 5.10 5.010000 59.799999 4.94 1.82 236.399994 ... 8.7 16.299999 0.21 1 32 2 1 0 1 0

1561 rows × 40 columns

Writing the Filter

Writing a quick filter to ensure eligibiity. This could, and probably should be written functionally, but so it goes.

# Greater than 18
# CANCER TYPE IS NOT equal to a non-malanoma skin cancer ie 5 arbitrarily chosen 
# Patient Seeking care in emergency department is true
# History is not equal to 0. Ie not recieving either.  IDK how it would be provided.  It could also possibly be written as equal to 1 or 2
# Lang is either english or spanish
# Patient Agrees to Follow Up
# Patient Consents
# Patient is Not Pregnant

idx_spanish = (df.TRUEAGE > 18) & (df.CANCER_TYPE != 5) & (df.EMERGENCY == 1) & \
        (df.ICPI_HIST == 0) & (df.LANG == 0) & (df.FOLLOW_UP == 1) & (df.CONSENT == 1) & (df.PREGNANT == 0)

# Ideally the english and spanish speakers would have been filtered prior to this, but for the sake of exploration this will work. 
idx_english = (df.TRUEAGE > 18) & (df.CANCER_TYPE != 5) & (df.EMERGENCY == 1) & \
        (df.ICPI_HIST == 0) & (df.LANG == 1) & (df.FOLLOW_UP == 1) & (df.CONSENT == 1) & (df.PREGNANT == 0)

I created spanish and english dataframes for the sake of data manipulation. It is not realy necessary, but it would permit modifying and recoding the data if it were formatted differently.


filtered_df = pd.concat([df[idx_english], df[idx_spanish]], ignore_index=True)

the filtered df is a concattenation of the english and spanish filtered data.

filtered_df.describe()

ID TRUEAGE A1 ALB GLU BUN CREA CHO TG GSP ... MPV PDW PCT EMERGENCY CANCER_TYPE ICPI_HIST LANG FOLLOW_UP CONSENT PREGNANT
count 3.300000e+01 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000 ... 33.000000 33.000000 33.000000 33.0 33.000000 33.0 33.000000 33.0 33.0 33.0
mean 4.023138e+07 85.000000 1.636364 41.563636 5.139697 6.289394 83.718182 4.893030 1.168788 244.539395 ... 9.715151 14.945455 0.202424 1.0 49.969697 0.0 0.575758 1.0 1.0 0.0
std 4.523359e+06 12.080459 0.488504 5.748577 1.516953 1.695724 28.692055 1.158084 0.510090 30.776044 ... 1.466953 1.866161 0.080856 0.0 26.886898 0.0 0.501890 0.0 0.0 0.0
min 3.244411e+07 64.000000 1.000000 29.600000 3.350000 3.550000 43.700001 3.160000 0.410000 196.600006 ... 7.400000 9.800000 0.070000 1.0 0.000000 0.0 0.000000 1.0 1.0 0.0
25% 3.744541e+07 75.000000 1.000000 36.900002 4.360000 4.660000 64.500000 3.940000 0.800000 226.800003 ... 8.800000 15.100000 0.160000 1.0 33.000000 0.0 0.000000 1.0 1.0 0.0
50% 4.222571e+07 85.000000 2.000000 42.700001 4.750000 6.310000 78.900002 4.730000 1.150000 235.399994 ... 9.600000 15.700000 0.180000 1.0 55.000000 0.0 1.000000 1.0 1.0 0.0
75% 4.460501e+07 94.000000 2.000000 46.200001 5.210000 7.250000 100.199997 5.800000 1.410000 273.600006 ... 10.600000 16.100000 0.240000 1.0 69.000000 0.0 1.000000 1.0 1.0 0.0
max 4.581561e+07 102.000000 2.000000 50.299999 11.350000 10.000000 151.699997 7.220000 2.770000 310.899994 ... 14.300000 16.799999 0.440000 1.0 91.000000 0.0 1.000000 1.0 1.0 0.0

8 rows × 40 columns

filtered_df.shape[0]
#only 33 left following the filter. 
33

Following the filter only 35 data are left in the set. A workflow similiar to this could be used to identify possible survey recruits from aggregated chart data.

Logistic Regression Sample

I am surpirsed by the low level of samples left following the filter. To avoid a small n, I will use the initial dataset.

filename = "/Users/jnapolitano/Projects/biostatistics/data/37226-0003-Data.tsv"
df = pd.read_csv(filename, sep='\t')

# replace empty space with na
df = df.replace(" ", np.nan)

# convert numeric objects to numeric data types.  I checked in the code book there will not be any false positives
df = df.apply(pd.to_numeric, errors='raise')
df = df.dropna()
df

ID TRUEAGE A1 ALB GLU BUN CREA CHO TG GSP ... RBC HGB HCT MCV MCH MCHC PLT MPV PDW PCT
1 32161008 95.0 2.0 39.099998 6.94 16.190001 152.399994 4.62 1.28 264.200012 ... 3.30 101.300003 28.930000 88.900002 31.100000 350.0 149.0 9.1 15.000000 0.13
2 32162608 87.0 2.0 44.799999 5.55 5.680000 78.500000 5.20 2.39 276.200012 ... 3.60 111.300003 31.160000 87.599998 31.299999 357.0 201.0 8.3 12.000000 0.16
3 32163008 90.0 2.0 41.299999 5.27 5.950000 75.800003 4.25 1.55 264.200012 ... 3.70 113.900002 32.900002 89.699997 31.100000 346.0 150.0 9.9 16.799999 0.14
6 32166108 89.0 2.0 45.000000 8.80 13.170000 147.000000 3.19 1.72 336.399994 ... 3.00 92.599998 26.340000 88.500000 31.100000 352.0 157.0 8.2 12.300000 0.12
7 32167608 100.0 2.0 40.099998 4.34 5.950000 76.000000 5.67 1.44 223.300003 ... 3.76 114.000000 35.400002 94.099998 30.299999 322.0 193.0 10.8 16.400000 0.20
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2265 45816014 98.0 2.0 37.000000 6.04 5.010000 59.299999 3.84 0.95 195.300003 ... 4.31 122.000000 38.900002 90.300003 28.299999 313.0 267.0 9.9 16.200001 0.26
2266 45816114 69.0 1.0 46.299999 5.99 5.030000 85.500000 4.43 1.44 224.000000 ... 4.46 133.000000 42.200001 94.599998 29.799999 315.0 230.0 10.4 16.000000 0.23
2267 45816214 93.0 2.0 42.599998 5.53 6.320000 85.500000 4.03 0.92 249.800003 ... 4.60 137.000000 43.799999 95.199997 29.799999 313.0 129.0 11.1 16.299999 0.14
2268 45816314 91.0 2.0 43.400002 5.82 7.770000 72.099998 4.29 1.08 259.299988 ... 4.14 122.000000 39.000000 94.300003 29.500000 312.0 200.0 10.0 15.900000 0.20
2269 45816414 93.0 2.0 42.900002 5.10 5.010000 59.799999 4.94 1.82 236.399994 ... 4.50 128.000000 40.900002 90.800003 28.400000 313.0 240.0 8.7 16.299999 0.21

1561 rows × 33 columns

The Model

There is strong suspicion that biomarkers can determine whether a patient should be admitted for emergency care. In this simplified model, I will randomly distribute proper disposition across the dataset.

n = df.shape[0]
lower_bound = 0 #inclusive
upper_bound = 2 # exclusive
# 0 = no
# 1 = Yes
tmp =  np.random.randint(low=lower_bound , high = upper_bound, size = n)
df["PROP_DISPOSITION"] = tmp

Create Test and Train Set

This could be randomly sampled as well…

Random Sample

# copy in memory to avoid errors.  This could be done from files or in other ways if memory is limited.  
master_table = df.copy()

Test Sample Set with 10,000 Randomly Selected from the Master with Replacement

test_sample = master_table.sample(n=20000,replace=True)
targets = test_sample.pop("PROP_DISPOSITION")

Seperate Train and Test Sets

x_train, x_test, y_train, y_test = train_test_split(test_sample, targets, test_size=0.2, random_state=0)

Data Standardization

Calculate the mean and standard deviation for each column. Subtract the corresponding mean from each element. Divide the obtained difference by the corresponding standard deviation.

Thankfully this is built into SKLearn.

scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)

Create the Model

model = LogisticRegression(solver='liblinear', C=0.05, multi_class='ovr',
                           random_state=0)
model.fit(x_train, y_train)
LogisticRegression(C=0.05, multi_class='ovr', random_state=0,
                   solver='liblinear')

Evaluate Model

x_test = scaler.transform(x_test)
y_pred = model.predict(x_test)

Model Scoring

With completely randomized values the score should be about 50%. If it is significantly greater than there is probably a problem with the model.

model.score(x_train, y_train)
0.5676875
model.score(x_test, y_test)
0.55825

Results are expected

Confusion matrix

cm = confusion_matrix(y_test, y_pred)
font_size = 10

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.set_xlabel('Predicted outputs', color='purple')
ax.set_ylabel('Actual outputs', color='purple')
ax.xaxis.set(ticks=range(len(cm)))
ax.yaxis.set(ticks=range(len(cm)))
#ax.set_ylim(0, 1)
for i in range(len(cm)):
    for j in range(len(cm)):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='purple')
plt.show()

png

Because the data is randomized it makes the model is accurate about 50% of the time.

Printing the Classification Report

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.55      0.47      0.51      1927
           1       0.56      0.64      0.60      2073

    accuracy                           0.56      4000
   macro avg       0.56      0.56      0.55      4000
weighted avg       0.56      0.56      0.55      4000

Series

This is a post in the Quantitative Analysis in Python series.
Other posts in this series:


comments powered by Disqus