동아리,학회/GDGoC

[AI 스터디] Section 5 : 전통적인 머신러닝 - 지도학습 모델 part 2

egahyun 2024. 12. 26. 05:31

Logistic Regression (로지스틱 회귀)

로지스틱 회귀 소개

: regression이라는 이름이지만 분류 문제 → 이진 분류 문제

로지스틱 회귀 분류기

01. 방법

: 독립변수와 종속변수 간의 관계를 찾아내고 평가

회귀 식을 구해서 값을 예측 ⇒ 확률값을 반환
0.5 보다 작으면 0. 0.5 보다 크면 1로 분류

02. 특징 및 활용

(1) 구현이 간단

(2) 모든 분류 문제의 기초가 됨

(3) 로지스틱 회귀의 기본 개념이 딥러닝에 적용됨

(4) 문제점

분류 문제에 적합한 함수인가
모호한 경계 ⇒ 해결 : 시그모이드 함수

03. 시그모이드 함수 (= 로지스틱 함수)

(1) 로지스틱 회귀에 적합한 이유

경계가 명확히 보여짐
0 ~ 1 사이의 값을 가지므로 확률 값으로 해석 가능함
미분 가능함

(2) 공식

$$ f(z) = \frac{1}{1+e^{-z}} $$

z : logit = $\theta X$⇒ 그래서 이름이 Logistic regression
⇒ logit안에 regression이 들어가있는 형태 (wx +b)

04. 모델 사용 코드

: sklearn.linear_model.LogisticRegression(solver=‘lbfgs’)

→ solver : optimization algorithm

로지스틱 회귀 실습 : 이진분류

데이터 소개

칼럼 : 사용자 id, 성별, 연령, 추정급여, 구매여부
예측 : 특정 사용자가 구매 할지 여부 → 구매: 1, 구매 않음: 0

User ID Gender Age EstimatedSalary Purchased

395	15691863	Female	46	41000
396	15706071	Male	51	23000
397	15654296	Female	50	20000
398	15755018	Male	36	33000
399	15594041	Female	49	36000

1. 데이터 확인 및 변수 설정

# 데이터 확인
df.head() # 처음부터 확인
df.tail() # 끝에서 확인
df['purchased'].value_counts() # 예측 레이블이 편향 되었는지 확인하기 위해
# 50 50 으로 되어있으면 좋겠지만, 2대 1 정도면 많이 편향 된게 아님

# 데이터 분류
# int로 넣어줘도 되지만 float 형태가 일반적이므로 변환
X = df.iloc[:, [2,3]].values.astype("float32") # age, estimatedSalary만 사용
y = df.iloc[:, 4].values.astype("float32") # purchased

X.shape, y.shape # 400,2 / 400,

# train, test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape # 320,2 / 80,2 / 320, / 80,

2. Feature Scaling

# age = 10단위, EstimatedSalaryy = 10000단위 
# => 숫자 차이가 큰데 이런 경우, 큰 숫자가 리딩하기 떄문에 스케일링 필요
sc = StandardScaler() # 인스턴스 생성
X_train = sc.fit_transform(X_train) # fit, transform을 따로 할 수도 있음
# transform 만 해야함 
# -> fit하면 별도로 데이터셋에 평균, 표준편차를 다시 계산하므로 미래의 데이터를 미리 계산하면 안됨
X_test  = sc.transform(X_test)

print(X_train.shape) # 320, 2

3. 모델

# fit
lr_clf = LogisticRegression(solver='lbfgs', random_state=0)
lr_clf.fit(X_train, y_train)

# predict : 예측된 class 반환
y_pred = lr_clf.predict(X_test)

# 평가
print("Test set의 실제 true 갯수 = ", sum(y_test)) # 22
print("모델이 예측한 true 갯수 = ", sum(y_pred)) #18
print("accuracy = {:.2f}".format(accuracy_score(y_test, y_pred))) # 0.925
print("precision = {:.2f}".format(precision_score(y_test, y_pred))) # 0.94
print("recall = {:.2f}".format(recall_score(y_test, y_pred))) # 0.77

# predict_proba : class 당 probability 반환
y_pred_proba = lr_clf.predict_proba(X_test)

print(y_pred_proba[:,1][:5]) # [0.12602436 0.17691062 0.2077208  0.10091478 0.10701443]

# predict는 0.5를 threshold로 했음을 확인할 수 있는 코드
sum((y_pred_proba[:, 1] > 0.5) == y_test)/len(y_test) # predict의 accuracy값과 동일
sum((y_pred_proba[:, 1] > 0.5) == y_test) # 몇개를 맞췄는지 개수 확인

# threshold = 0.5 -> 보통 기본적으로 이 값이 지정됨
threshold = 0.5
y_pred_proba = lr_clf.predict_proba(X_test)
y_pred_proba_1 = y_pred_proba[:, 1] > threshold
sum(y_pred_proba_1 == y_test) / len(y_test) 

# threshold = 0.4
threshold = 0.4
y_pred_1 = y_pred_proba_1 > threshold
# threshold 감소시, accuracy : 감소, precision : 감소, recall : 증가
print("threshold가 {}일 때 1 로 분류된 갯수: ".format(threshold), sum(y_pred_1))
print("precision = {:.2f}".format(precision_score(y_test, y_pred_1)))
print("recall = {:.2f}".format(recall_score(y_test, y_pred_1)))
print("f1 score = ", f1_score(y_test, y_pred_1))

# threshold = 0.6
threshold = 0.6
y_pred_2 = y_pred_proba_1 > threshold
# threshold 증가시, accuracy : 감소, precision : 증가, recall : 감소
print("threshold가 {}일 때 1 로 분류된 갯수: ".format(threshold), sum(y_pred_2))
print("precision = {:.2f}".format(precision_score(y_test, y_pred_2))
print("recall = {:.2f}".format(recall_score(y_test, y_pred_2)))
print("f1 score = ", f1_score(y_test, y_pred_2))

04. 평가

# 혼동행렬
cm  = confusion_matrix(y_test, y_pred, labels=[1, 0])

ax = sns.heatmap(cm, annot=True, fmt='d')
ax.set_ylabel('true') # 가로 세로 주의
ax.set_xlabel('predicted')
ax.set_title('Confusion Matirx')

# Roc curve
y_proba = lr_classifier.predict_proba(X_test)
y_scores = y_proba[:,1] # 전체 확률에 대해 1일 확률

# _ 사용해줘야함
fpr, tpr, _ = roc_curve(y_test, y_scores) # false positive rate : fpr, true positive rate : tpr
auc = roc_auc_score(y_test, y_scores)

plt.plot(fpr, tpr, label="auc="+ "{:.2f}".format(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()

Ensemble Learning (앙상블 학습)

앙상블 학습 이란 ?

: 다수의 약한 학습기를 조합 하여 더 높은 성능을 추출해냄

앙상블 학습 방법

01. Bagging (배깅) (Bootstrap Aggregating)

(1) 배깅이란? : 복원 추출을 통해 여러 모델을 결합하여 단일 모델보다 더 나은 성능을 내는 것

(2) 분류기 훈련방법 : 트레이닝 샘플의 subset을 무작위 추출 후, 분류기 훈련

(3) 효과 : 편향(모델 복잡도)이 커지지 않으면서 분산을 줄일 수 있음

→ 이유 : 랜덤 샘플링에 의한 모델 fit

(4) Bootstrap (복원 추출) : 샘플을 추출하는데, 이전에 추출된 샘플을 다시 뽑힐 수 있도록 하는 추출 방법

→ 비복원 추출 : 중복 되는게 없도록 샘플 추출

→ 효과

variance = $\frac{𝜎 ^2}{b}$으로 감소
Mean = 동일

(5) Bagging tree 알고리즘

training data 에서 random sampling 하여 X_b, y_b 를 고름 (ex. 1,000 개의 data 중 100 개 sampling)
X_b, y_b 를 이용하여 ID3 알고리즘으로 decision tree 구성
B 개의 tree 가 만들어질 때까지 1, 2 반복
B 개의 모든 tree 를 이용하여 classification 한 후 majority vote 로 결정
Bias 를 유지하며 Variance 를 줄인다.

(6) 랜덤 포레스트

문제점 : 몇 개의 트리를 만들든 첫 번째 노드에는 같은 열이 사용될 것임 ⇒ 나무가 다 비슷할 것, 일반화 능력에 도움이 되지 않음
해결 방법 : Decision Tree 에 포함될 attribute 들을 random 하게 일부만 선정
➔ 모든 attribute 를 가지고 Tree 를 만들 경우 매우 강한 attribute 가 모든 tree 에 항상 포함됨
장점
- 더 Random 하고 독립적인 classifier 생성가능
- 다양한 다른 특성의 나무들이 생성되므로 이 결과를 voting 해서 최종을 내면 좋은 성능을 냄
- Tree-based model 이므로 white box 특징을 유지
- High prediction accuracy
- 병렬적으로 생성 가능하므로 속도가 빠름
- 각 tree 는 매우 deep 하게 생성

02. Boosting

(1) 부스팅 방법 2가지

잘못 분류된 데이터에 더 높은 가중치를 부여하여 다음 모델의 샘플링에 포함될 확률을 높이는 방법
모델 자체를 부스팅 : 다수의 약한 학습기를 훈련시켜 majority voting (→ 배깅과 동일한 방법)

(2) 대표적인 메소드 : AdaBoost, Gradient Boost (XGBoost)

(3) Gradient Boost (XGBoost)

방법 : weak learner를 이용한 모델 부스팅
- Weak Learner (약한 학습기) : 랜덤하게 선택하는 것 보다 약간 더 나은 성능의 모델
  ⇒ 의사결정나무를 사용하였음
- 기존의 weak learner 가 생성한 잔차를 감소시키는 추가 tree 를 감소 효과가 없을 때까지 생성
- 이전 의사결정 나무로 부터 생성된 잔차를 다음 의사결정나무에 사용되는 input의 레이블로 사용
- 첫번째 나무 : 잔차 생성 / 이후의 나무 : 얼마의 잔차가 생성될지, 예측하는 훈련
  ⇒ 나무들의 output을 전부 더하면 y값이 얼마인지 알 수 있음
  ⇒ 나무가 생성될 수록 잔차 = 0 이 됨
Optimization 방법 : 경사하강법에 의한 손실 함수 optimize

(5) AdaBoost (Adaptive Boosting)

방법 : 데이터를 부스팅하는 방법
첫번째 생성되는 모델 ⇒ Stump ( depth가 두 개인 부정확한 이진 분류기 )
이전 stump 에서 잘 못 분류한 데이터의 가중치를 높여, 다음 sampling 에 포함되도록함
⇒ 모델로 부터 분류된 값들을 combined해서 최종 분류 결과를 냄

랜덤포레스트 & gradient boosting 실습

데이터 소개

칼럼 : 사용자 id, 성별, 연령, 추정급여, 구매여부
예측 : 특정 사용자가 구매 할지 여부 → 구매: 1, 구매 않음: 0

User ID	Gender	Age	EstimatedSalary
15706071	Male	51	23000
15654296	Female	50	20000
15755018	Male	36	33000
15594041	Female	49	36000

1. 변수 설정

X = df.iloc[:, [2,3]].values.astype("float32") # age, estimatedSalary
y = df.iloc[:, 4].values.astype("float32") # purchased

# train, test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

2. feature scaling

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

3. Random Forest model

# fit
# 10개 나무 만들어서 하는 배깅
rf = RandomForestClassifier(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)

# predict
y_pred = rf.predict(X_test)

# accuracy
accuracy_score(y_test,y_pred)
print("accuracy = {:.2f}".format(sum(y_pred == y_test) / len(y_test)))

# confusion matrix
print("confution matrix\\n", 
      confusion_matrix(y_test, y_pred, labels=[1, 0]))
# f1 score
print("f1 score\\n", f1_score(y_test, y_pred, labels=[1, 0]))

5. Gradient Boosting Classifier

- min_samples_split : node 분리에 필요한 최소 sample 수 => overfitting 방지  
- max_depth : tree 깊이 조절 => overfitting 방지
- learning_rate : 각 tree 의 기여도 조정, n_estimators 와 trade-off 
- n_estimators : number of sequential trees

from sklearn.ensemble import GradientBoostingClassifier
# 모델 
# 잔차를 계속 이어서, 앞의 트리가 만든 오류를 뒤에 트리가 오류를 예측하는 것이기 때문에 나무를 많이 생성해야함 -> 500
# 부정확한 나무들이 계속 이어져서 앞의 잔차를 예측하는 모델이므로 max_depth를 작게
gb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=500, max_depth=5)

# fit
gb.fit(X_train, y_train)

# predict
y_pred = gb.predict(X_test)

# 결과 확인
print("Test set true counts = ", sum(y_test))
print("predicted true counts = ", sum(y_pred))
print("accuracy = {:.2f}".format(
            sum(y_pred == y_test) / len(y_test)))
            
# 혼동행렬
print("confution matrix\\n", 
      confusion_matrix(y_test, y_pred, labels=[1, 0]))

# f1 score
print("f1 score\\n", f1_score(y_test, y_pred, labels=[1, 0]))

6. 훈련 결과 시각화

시각화를 위한 가상의 데이터 생성

# 가상의 데이터 생성 : 전체 레코드에 대해서 최소 최대 사용
x1_min, x1_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1       
x2_min, x2_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1  

# x1, x2가 서로 교차되는 점 마다 데이터를 생성해줌 -> meshgrid
X1, X2 = np.meshgrid(np.arange(x1_min, x1_max, 0.1), 
                     np.arange(x2_min, x2_max, 0.1))
X1.shape, X2.shape # (60, 61) / (60, 61)

# X1, X2를 1차원으로 펼쳐줘서 각 축의 값을 만듬                                 
XX = np.column_stack([X1.ravel(), X2.ravel()])
XX.shape # (3360,2)만큼의 가상데이터가 생성됨

가상의 데이터로 모델 예측

# random forest로 예측한 값
Y_rf = np.array(rf.predict(XX))
# gradient boost로 예측한 값
Y_gb = np.array(gb.predict(XX))

시각화

# 두가지 색 만듬
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA']) # 연한 값
cmap_bold = ListedColormap(['#FF0000', '#00FF00'])  # 진한 값  

# 시각화
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4), sharey=True)

# Random Forest 시각화
# 결정경계 생성
ax1.pcolormesh(X1, X2, Y_rf.reshape(X1.shape),cmap=cmap_light, shading='auto') 
# y_test가 0일때, 1일때 나눠서 데이터를 점으로 표현 -> 색을 다르게 그리도록 함
for i in range(2):
    ax1.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], s=20, color=cmap_bold(i), label=i, edgecolor='k')
ax1.set_title('Random Forest')
ax1.set_xlabel('Age')
ax1.set_ylabel('Estimated Salary')
ax1.legend()

# Gradient Boosting 시각화
ax2.pcolormesh(X1, X2, Y_gb.reshape(X1.shape), cmap=cmap_light, shading='auto') 
for i in range(2):
    ax2.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], s=20, color=cmap_bold(i), label=i, edgecolor='k')
ax2.set_title('Gradient Boosting')
ax2.set_xlabel('Age')
ax2.legend()
plt.tight_layout()

# Age, EstimatedSalary 의 중요도
feature_imp = pd.Series(gb.feature_importances_, 
            ['Age', 'EstimatedSalary']).sort_values(ascending=False)
feature_imp.plot(kind='bar', title='feature importance')ㄴ

Feature Engineering

좋은 피처의 조건

Target 과의 높은 관련성
prediction 시점에 알 수 있음
- 월 중에는 알 수 없고, 월말에만 알 수 있는 데이터들 → ex. sales data 는 익월에 집계
- 아직 발생하지도 않은 데이터로 훈련하게 되는 상황
numeric
충분한 데이터 수
인간 전문가의 도메인 지식 활용 가능 : 관점의 차이가 다를 수 있음
- Example ) 교통 정보 데이터 : 발생 시점
- → 요일별로 나누는게 좋을 거야! 라는 지식이 있을 경우, 예측에 도움이 될 수도 있음

머신러닝을 위한 피처 엔지니어링

결측치 처리
Data formatting : 데이터 통일
편향 처리
Normalization (= feature scaling) : scale이 다를 때, 비슷한 스케일로 해주는게 예측이 더 잘됨
Binning : 연속된 숫자를 어떠한 기준으로 그룹핑 하기 위해 범위를 정해주는 것
Categorical 변수 → 수치형으로
1. ordinal category (순서/크기가 있는 feature)
  - 숫자로 크기(순서) 표시
  - ex) L > M > S → 3, 2, 1
2. nominal category (순서/크기가 없는 feature)
  - 이런 경우가 훨씬 많음
  - 방법 : one-hot encoding

Feature engineering & 랜덤포레스트 실습

사용데이터 : 타이타닉 데이터 → 타이타닉호의 생존 여부 데이터

01. 데이터 확인

# 필요없는 칼럼 drop
df.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis=1, inplace=True)

# 데이터 확인
df_titanic.info()
df_titanic.isnull().sum()
df_titanic.describe()

02. Feature Analysis

→ feature 간의 correlation check / Survived 와 각 Feature 간 상관관계 파악

# 각 피처들간의 관계가 얼마냐
# 음수 : 역의 상관관계, 0 : 거의 관계 없음, 양수 : 양의 상관관계
# cov 라는 공분산을 -1, 1 사이로 정규화 시킨 것
df_titanic.corr(numeric_only=True)

# 상관관계 시각화
g = sns.heatmap(df_titanic.corr(numeric_only=True), annot=True, cmap="coolwarm")

03. 결측치 처리

# Age -> median으 처리 : 없애기엔 너무 많기 때문
df_titanic['Age'].fillna(df_titanic['Age'].median(), inplace=True)

# Embarked -> 얼마 없어서 삭제
df_titanic.dropna(inplace=True)

04. 편향 확인

# 히스토그램을 이용해 분포를 확인해 편향 확인 
df_titanic.hist(bins=30, figsize=(8, 8));

⇒ 결과 : Fare가 연속된 숫자값인데 심하게 치우침 ⇒ log값으로 바꿈

(log로 하는 이유 : 가운데 쪽으로 숫자가 몰리게 해줌 / 작은 값은 커지게, 큰값은 작게 해주는데 대소관계는 바뀌지 않음)

05. 편향 처리

# 로그로 바꿔
df_titanic['Fare'] = df_titanic['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
# 다시 확인
df_titanic.hist(bins=30, figsize=(8, 8));

06. 카테고리 변수 → 수치형 변수

# 원핫 인코딩 : 카테고리컬인 칼럼을 자동으로 골라서 바꿔줌
df_titanic = pd.get_dummies(df_titanic)

Survived	Pclass	Age	SibSp	Parch	Fare	Sex_female	Sex_male	Embarked_C	Embarked_Q	Embarked_S
0	3	22.0	1	0	1.981001	False	True	False	False	True
1	1	38.0	1	0	4.266662	True	False	True	False	False
1	3	26.0	0	0	2.070022	True	False	False	False	True
1	1	35.0	1	0	3.972177	True	False	False	False	True
0	3	35.0	0	0	2.085672	False	True	False	False	True

07. 데이터 분할 / 스케일링

# train, test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape # 711,10 / 178,10 / 711, / 178,

# standart scaling
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

08. 모델링 및 평가 : 랜덤포레스트

from sklearn.ensemble import RandomForestClassifier

# 모델 생성
rf = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
# fit
rf.fit(X_train_scaled, y_train)
# 예측 : predict
y_pred = rf.predict(X_test_scaled)
# 정확도 
print("accuracy = {:.2f}".format(sum(y_pred == y_test) / len(y_test))) #0.85