본문 바로가기
국비 교육/머신러닝, 딥러닝

[딥러닝 - 교육 외] DNN - [실습] Heart Disease

by 육츠 2024. 8. 1.
Contents 접기

데이터 셋

https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

 

Heart Disease Dataset

Public Health Dataset

www.kaggle.com

 

패키지 로딩

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# from keras.models import Squential
from tensorflow.keras import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score, precision_score

 

하이퍼 파라미터 설정

INPUT_DIM = 13
MY_EPOCH = 100 
MY_BATCH = 32
MY_SPLIT = 0.4

 

데이터 불러오기

data = pd.read_excel('./dataset/heart.xls')
print(data.shape)
display(data.head())
data.describe()
data.info()
print(data.isna().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

 

데이터 스케일링 - 데이터 표준화

from sklearn.preprocessing import StandardScaler

X = data.drop('target', axis = 1)
y = data['target']

scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)
scaled_data = pd.DataFrame(scaled_data, columns= X.columns)
print(scaled_data.describe())

boxplot = scaled_data.boxplot(figsize = (10,7), showmeans = True)
plt.show()

표준화된 데이터에 대한 박스그래프

 

학습/평가 데이터 분할

데이터를 두 번 분할 시켰다.

X_train, X_test, y_train, y_test = train_test_split(scaled_data, y, test_size = MY_SPLIT, random_state = 10)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

X_val, X_test, y_val, y_test = train_test_split(X_test,y_test, test_size = 0.5, random_state = 0)
X_val.shape, X_test.shape, y_val.shape, y_test.shape
((61, 13), (61, 13), (61,), (61,))

 

모델 생성

Output Shape (None, 1000) : None -- 배치 사이즈가 정해지지 않아서 None 으로 표기

# Output Shape (None, 1000) : None -- 배치 사이즈가 정해지지 않아서 None 으로 표기
from keras.layers import Dropout
from keras import regularizers

model = Sequential()
model.add(Dense(1000, activation= 'tanh', input_dim = INPUT_DIM, kernel_regularizer= regularizers.l2(0.02)))
model.add(Dense(1000, activation= 'tanh', kernel_regularizer= regularizers.l2(0.1)))
model.add(Dropout(rate= 0.5))
model.add(Dense(1,activation= 'sigmoid'))
model.summary()
odel: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 1000)              14000     
                                                                 
 dense_1 (Dense)             (None, 1000)              1001000   
                                                                 
 dropout (Dropout)           (None, 1000)              0         
                                                                 
 dense_2 (Dense)             (None, 1)                 1001      
                                                                 
=================================================================
Total params: 1016001 (3.88 MB)
Trainable params: 1016001 (3.88 MB)
Non-trainable params: 0 (0.00 Byte)

 

모델 컴파일 및 학습

from keras.callbacks import TensorBoard
import datetime

# log_dir : 로그가 기록될 디렉토리 경로 (경로에 한글이 포함되면 안된다.)
log_dir = 'c:\\Logs\\' + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard = TensorBoard(log_dir=log_dir, histogram_freq=1)

from keras.callbacks import EarlyStopping
early_stop = EarlyStopping(monitor='val_loss', mode='min', patience=3)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, batch_size= MY_BATCH, epochs= MY_EPOCH, validation_data=(X_val,y_val), verbose = 1, callbacks = [tensorboard, early_stop])
model.save('heart-disease.h5')
Epoch 1/100
6/6 [==============================] - 0s 35ms/step - loss: 75.1489 - accuracy: 0.7403 - val_loss: 51.0454 - val_accuracy: 0.7869
Epoch 2/100
6/6 [==============================] - 0s 20ms/step - loss: 42.6342 - accuracy: 0.8398 - val_loss: 31.9832 - val_accuracy: 0.7869

(중략)

100 번 중 20번까지 돌고 early stop 했다 (val_loss 가 3번 동안 줄어들지 않았기 때문)

 

예측 및 모델 평가

y_pred_prob = model.predict(X_test)
y_pred = (y_pred_prob > 0.5)
print('\n == CONFUSION MATRIX ==')
print(confusion_matrix(y_test,y_pred))

score = model.evaluate(X_test, y_test, verbose= 1) #, callbacks = [tensorboard])
print('Loss: ', score[0])
print('Accuracy: ', score[1])
print('Precision: ', precision_score(y_test,y_pred))
print('AUC: ', roc_auc_score(y_test,y_pred_prob))
2/2 [==============================] - 0s 1ms/step

 == CONFUSION MATRIX ==
[[18 13]
 [ 1 29]]
2/2 [==============================] - 0s 2ms/step - loss: 0.5433 - accuracy: 0.7705
Loss:  0.5433093309402466
Accuracy:  0.7704917788505554
Precision:  0.6904761904761905
AUC:  0.9193548387096774