로지스틱 회귀모델 구현실습

2019-01-31

Classification

scikit-learn ‘iris’ 데이터를 예시로 로지스텍 회귀모델 구현실습

그림, 실습코드 등 학습자료 출처 : https://datascienceschool.net

%matplotlib inline
%config InlineBackend.figure_formats = {'png', 'retina'}
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from patsy import *
import statsmodels.api as sm
import scipy as sp
import seaborn as sns

[연습문제 1]

1 ) 붓꽃 분류 문제에서 클래스가 setosa, versicolor인 데이터만 사용하고 (setosa=0, versicolor=1) 독립변수로는 꽃받침 길이(Sepal Length)와 상수항만 사용하여 StatsModels 패키지의 로지스틱 회귀 모형으로 결과를 예측하고 보고서를 출력한다.

from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
sy = pd.Series(iris.target, dtype="category")
sy = sy.cat.rename_categories(iris.target_names)
df['species'] = sy
df.tail()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

df = df[["sepal length (cm)","species"]]
df = df.loc[df["species"] != 'virginica']

df.rename(columns={'sepal length (cm)' : "sepal_length"}, inplace = True)

df['type'] = iris.target[:100]

model = sm.Logit.from_formula("type ~ 1 + sepal_length", df)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.321056
         Iterations 8
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                   type   No. Observations:                  100
Model:                          Logit   Df Residuals:                       98
Method:                           MLE   Df Model:                            1
Date:                Tue, 04 Dec 2018   Pseudo R-squ.:                  0.5368
Time:                        19:16:22   Log-Likelihood:                -32.106
converged:                       True   LL-Null:                       -69.315
                                        LLR p-value:                 6.320e-18
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      -27.8315      5.434     -5.122      0.000     -38.481     -17.182
sepal_length     5.1403      1.007      5.107      0.000       3.168       7.113
================================================================================

2 ) 위 결과를 confusion matrix와 classification report로 표현한다.

df['prediction']  = result.predict(df['sepal_length']).map(lambda x : 1 if x >=0.5 else 0)

from sklearn.metrics import confusion_matrix

y_true = df["type"]
y_pred = df['prediction']
confusion_matrix(y_true, y_pred)

array([[45,  5],
       [ 6, 44]], dtype=int64)

from sklearn.metrics import classification_report
print(classification_report(y_true, y_pred,
                            target_names=['setosa', 'versicolor']))

              precision    recall  f1-score   support

      setosa       0.88      0.90      0.89        50
  versicolor       0.90      0.88      0.89        50

   micro avg       0.89      0.89      0.89       100
   macro avg       0.89      0.89      0.89       100
weighted avg       0.89      0.89      0.89       100

3 ) 이 모형에 대해 ROC커브를 그리고 AUC를 구한다. 이 때 Scikit-Learn의 LogisticRegression을 사용하지 않고 위에서 StatsModels로 구한 모형을 사용한다.

recall = 45 / (45 + 5)
fallout = 6 / (6 + 44)
print("recall =", recall)
print("fallout =", fallout)

recall = 0.9
fallout = 0.12

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(df['type'], result.predict(df['sepal_length']))
fpr, tpr, thresholds

(array([0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.02, 0.06, 0.06, 0.1 , 0.2 , 0.22, 0.28, 0.44, 0.6 , 0.68, 0.78,
        0.82, 0.9 , 0.92, 0.98, 1.  ]),
 array([0.  , 0.02, 0.06, 0.12, 0.16, 0.18, 0.22, 0.28, 0.32, 0.48, 0.52,
        0.58, 0.68, 0.78, 0.88, 0.9 , 0.9 , 0.92, 0.94, 0.98, 1.  , 1.  ,
        1.  , 1.  , 1.  , 1.  , 1.  ]),
 array([1.99971161, 0.99971161, 0.99919417, 0.99865337, 0.99775044,
        0.99624436, 0.9937363 , 0.98957085, 0.9826836 , 0.95304918,
        0.92389836, 0.87894726, 0.81282396, 0.72200549, 0.60835381,
        0.48159935, 0.35716977, 0.24942093, 0.16579367, 0.1062368 ,
        0.06637193, 0.04078357, 0.02479825, 0.01498061, 0.00901385,
        0.00541059, 0.00324301]))

plt.plot(fpr, tpr, label="Logistic Regression")
plt.plot([0, 1], [0, 1], 'k--', label="random guess")
plt.plot([fallout], [recall], 'ro', ms=10)
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver operating characteristic example')
plt.show()

logistic_15_0

from sklearn.metrics import auc
auc(fpr, tpr)

0.9326

[연습문제 2]

붓꽃 분류 문제에서 클래스가 versicolor, virginica인 데이터만 사용하여 (versicolor=0, virginica=1) 로지스틱 회귀 모형으로 결과를 예측하고 보고서를 출력한다. 독립 변수는 모두 사용한다. (연습문제 1번처럼 결과를 확인한다)

from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()

df = pd.DataFrame(iris.data, columns=iris.feature_names)
sy = pd.Series(iris.target, dtype="category")
sy = sy.cat.rename_categories(iris.target_names)
df['species'] = sy
df.tail()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	species
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica

df.rename(columns={'sepal length (cm)' : "sepal_length", 'sepal width (cm)' : "sepal_width", 
                  'petal length (cm)' : "petal_length", 'petal width (cm)' : "petal_width" }, inplace = True)

df = df[["sepal_length","sepal_width","petal_length","petal_width", 'species']]

df = df.loc[df["species"] != 'setosa']

df['type'] = df['species'].map(lambda x : 0 if x == 'versicolor' else 1)

model = sm.Logit.from_formula("type ~ 1 + sepal_length + sepal_width + petal_length + petal_width", df)
result = model.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.059493
         Iterations 12
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                   type   No. Observations:                  100
Model:                          Logit   Df Residuals:                       95
Method:                           MLE   Df Model:                            4
Date:                Tue, 04 Dec 2018   Pseudo R-squ.:                  0.9142
Time:                        19:16:23   Log-Likelihood:                -5.9493
converged:                       True   LL-Null:                       -69.315
                                        LLR p-value:                 1.947e-26
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept      -42.6378     25.708     -1.659      0.097     -93.024       7.748
sepal_length    -2.4652      2.394     -1.030      0.303      -7.158       2.228
sepal_width     -6.6809      4.480     -1.491      0.136     -15.461       2.099
petal_length     9.4294      4.737      1.990      0.047       0.145      18.714
petal_width     18.2861      9.743      1.877      0.061      -0.809      37.381
================================================================================

Possibly complete quasi-separation: A fraction 0.60 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

df['prediction']  = result.predict(df[["sepal_length","sepal_width","petal_length","petal_width", 'species']]).map(lambda x : 1 if x >=0.5 else 0)

y_true = df["type"]
y_pred = df['prediction']
confusion_matrix(y_true, y_pred)

array([[49,  1],
       [ 1, 49]], dtype=int64)

recall = 48 / (48 + 2)
fallout = 4 / (4 + 46)
print("recall =", recall)
print("fallout =", fallout)

recall = 0.96
fallout = 0.08

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(df['type'], result.predict(df[["sepal_length","sepal_width","petal_length","petal_width", 'species']]))
fpr, tpr, thresholds

(array([0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.02, 0.08, 0.08, 1.  ]),
 array([0.  , 0.02, 0.62, 0.66, 0.92, 0.92, 0.98, 0.98, 1.  , 1.  ]),
 array([2.00000000e+00, 1.00000000e+00, 9.99718850e-01, 9.99613908e-01,
        8.90812327e-01, 8.67629892e-01, 6.69142464e-01, 2.24833801e-01,
        2.04874060e-01, 6.16382618e-11]))

plt.plot(fpr, tpr, label="Logistic Regression")
plt.plot([0, 1], [0, 1], 'k--', label="random guess")
plt.plot([fallout], [recall], 'ro', ms=10)
plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.title('Receiver operating characteristic example')
plt.show()

logistic_26_0

from sklearn.metrics import auc
auc(fpr, tpr)

0.9972000000000001

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.preprocessing import label_binarize

iris = load_iris()
X = iris.data
y = label_binarize(iris.target, [0, 1, 2])

fpr = [None] * 3
tpr = [None] * 3
thr = [None] * 3

for i in range(3):
    model = GaussianNB().fit(X, y[:, i])
    fpr[i], tpr[i], thr[i] = roc_curve(y[:, i], model.predict_proba(X)[:, 1])
    plt.plot(fpr[i], tpr[i])

plt.xlabel('False Positive Rate (Fall-Out)')
plt.ylabel('True Positive Rate (Recall)')
plt.show()

logistic_28_0

model.predict_proba(X)[:, 1]

array([1.45863603e-23, 2.28937303e-23, 2.65425827e-24, 3.24649201e-23,
       7.60546022e-24, 2.06632327e-20, 4.44339079e-23, 5.53645131e-23,
       5.66645865e-24, 7.42229928e-24, 5.74011573e-23, 1.12145653e-22,
       1.66193994e-24, 6.56958940e-27, 6.59204705e-25, 1.21559733e-22,
       1.17675519e-22, 1.36052840e-22, 9.69412856e-21, 1.42342424e-22,
       2.15564837e-21, 1.92365623e-21, 9.75337168e-27, 7.09066340e-19,
       4.36218603e-21, 4.03638400e-22, 1.55824557e-20, 7.15669480e-23,
       2.64845499e-23, 1.33665634e-22, 2.06958890e-22, 1.35130179e-20,
       3.41389837e-25, 1.11975711e-24, 7.67915240e-23, 1.60191791e-24,
       1.19100560e-23, 5.52498155e-25, 1.42954649e-24, 7.36113455e-23,
       2.69261234e-23, 7.68614972e-24, 1.11739007e-24, 5.90809615e-19,
       1.67070525e-19, 1.60380228e-22, 5.47609398e-23, 7.56526483e-24,
       4.32640418e-23, 1.94107528e-23, 7.99994821e-01, 4.82779949e-01,
       9.50001463e-01, 4.23773565e-04, 7.10849173e-01, 2.14795419e-02,
       7.58000933e-01, 1.83654537e-08, 2.89814385e-01, 6.70976282e-04,
       2.54583613e-08, 6.71810275e-02, 2.48457433e-05, 3.11374497e-01,
       1.08269632e-04, 3.70369288e-01, 1.14276916e-01, 7.79176035e-05,
       1.74975616e-01, 4.27548146e-05, 8.86530601e-01, 5.38735184e-03,
       7.44087778e-01, 4.33178688e-02, 5.82908896e-02, 3.30032543e-01,
       8.14378171e-01, 9.88627404e-01, 2.91647794e-01, 1.17740619e-06,
       1.38511165e-05, 1.78279579e-06, 3.68651931e-04, 8.79860044e-01,
       6.84244367e-02, 3.20194646e-01, 8.46728621e-01, 3.03017669e-02,
       2.27336269e-03, 7.05645836e-04, 2.00742532e-03, 2.27232225e-01,
       5.89146974e-04, 1.84288908e-08, 3.72735599e-03, 1.54929021e-03,
       5.32597410e-03, 3.44122739e-02, 1.42528886e-08, 3.09955575e-03,
       9.98975444e-01, 9.69765008e-01, 9.99885062e-01, 9.95018313e-01,
       9.99526899e-01, 9.99950707e-01, 6.33093382e-02, 9.99704987e-01,
       9.97969358e-01, 9.99772406e-01, 9.96557269e-01, 9.96291740e-01,
       9.99632738e-01, 9.56545556e-01, 9.92248974e-01, 9.98700764e-01,
       9.96583087e-01, 9.99657780e-01, 9.99945041e-01, 4.23550754e-01,
       9.99814013e-01, 9.46286657e-01, 9.99936310e-01, 9.76036701e-01,
       9.99399284e-01, 9.99546046e-01, 9.59524854e-01, 9.60427661e-01,
       9.99064537e-01, 9.98380974e-01, 9.99885248e-01, 9.99763828e-01,
       9.99281165e-01, 8.81093847e-01, 7.70448744e-01, 9.99984976e-01,
       9.98404067e-01, 9.95055174e-01, 9.29605897e-01, 9.99639057e-01,
       9.99711325e-01, 9.99548271e-01, 9.69765008e-01, 9.99782675e-01,
       9.99610167e-01, 9.99446025e-01, 9.86475233e-01, 9.97890340e-01,
       9.97033527e-01, 9.61151009e-01])

 해시테이블 기초개념 QDA와 LDA 기초개념 