๋™์•„๋ฆฌ,ํ•™ํšŒ/GDGoC

[AI ์Šคํ„ฐ๋””] Section 5 : ์ „ํ†ต์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ - ์ง€๋„ํ•™์Šต ๋ชจ๋ธ part 2

egahyun 2024. 12. 26. 05:31

Logistic Regression (๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€)

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์†Œ๊ฐœ

: regression์ด๋ผ๋Š” ์ด๋ฆ„์ด์ง€๋งŒ ๋ถ„๋ฅ˜ ๋ฌธ์ œ → ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ถ„๋ฅ˜๊ธฐ

01. ๋ฐฉ๋ฒ•

: ๋…๋ฆฝ๋ณ€์ˆ˜์™€ ์ข…์†๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ฐพ์•„๋‚ด๊ณ  ํ‰๊ฐ€

  1. ํšŒ๊ท€ ์‹์„ ๊ตฌํ•ด์„œ ๊ฐ’์„ ์˜ˆ์ธก ⇒ ํ™•๋ฅ ๊ฐ’์„ ๋ฐ˜ํ™˜
  2. 0.5 ๋ณด๋‹ค ์ž‘์œผ๋ฉด 0. 0.5 ๋ณด๋‹ค ํฌ๋ฉด 1๋กœ ๋ถ„๋ฅ˜

02. ํŠน์ง• ๋ฐ ํ™œ์šฉ

(1) ๊ตฌํ˜„์ด ๊ฐ„๋‹จ

(2) ๋ชจ๋“  ๋ถ„๋ฅ˜ ๋ฌธ์ œ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋จ

(3) ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์ด ๋”ฅ๋Ÿฌ๋‹์— ์ ์šฉ๋จ

(4) ๋ฌธ์ œ์ 

  1. ๋ถ„๋ฅ˜ ๋ฌธ์ œ์— ์ ํ•ฉํ•œ ํ•จ์ˆ˜์ธ๊ฐ€
  2. ๋ชจํ˜ธํ•œ ๊ฒฝ๊ณ„ ⇒ ํ•ด๊ฒฐ : ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜

03. ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜ (= ๋กœ์ง€์Šคํ‹ฑ ํ•จ์ˆ˜)

(1) ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์— ์ ํ•ฉํ•œ ์ด์œ 

  1. ๊ฒฝ๊ณ„๊ฐ€ ๋ช…ํ™•ํžˆ ๋ณด์—ฌ์ง
  2. 0 ~ 1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฏ€๋กœ ํ™•๋ฅ  ๊ฐ’์œผ๋กœ ํ•ด์„ ๊ฐ€๋Šฅํ•จ
  3. ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•จ

(2) ๊ณต์‹

$$ f(z) = \frac{1}{1+e^{-z}} $$

  • z : logit = $\theta X$⇒ ๊ทธ๋ž˜์„œ ์ด๋ฆ„์ด Logistic regression
  • ⇒ logit์•ˆ์— regression์ด ๋“ค์–ด๊ฐ€์žˆ๋Š” ํ˜•ํƒœ (wx +b)

04. ๋ชจ๋ธ ์‚ฌ์šฉ ์ฝ”๋“œ

: sklearn.linear_model.LogisticRegression(solver=‘lbfgs’)

→ solver : optimization algorithm

๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์‹ค์Šต : ์ด์ง„๋ถ„๋ฅ˜

๋ฐ์ดํ„ฐ ์†Œ๊ฐœ

  1. ์นผ๋Ÿผ : ์‚ฌ์šฉ์ž id, ์„ฑ๋ณ„, ์—ฐ๋ น, ์ถ”์ •๊ธ‰์—ฌ, ๊ตฌ๋งค์—ฌ๋ถ€
  2. ์˜ˆ์ธก : ํŠน์ • ์‚ฌ์šฉ์ž๊ฐ€ ๊ตฌ๋งค ํ• ์ง€ ์—ฌ๋ถ€ → ๊ตฌ๋งค: 1, ๊ตฌ๋งค ์•Š์Œ: 0

User ID Gender Age EstimatedSalary Purchased

395 15691863 Female 46 41000
396 15706071 Male 51 23000
397 15654296 Female 50 20000
398 15755018 Male 36 33000
399 15594041 Female 49 36000

1. ๋ฐ์ดํ„ฐ ํ™•์ธ ๋ฐ ๋ณ€์ˆ˜ ์„ค์ •

# ๋ฐ์ดํ„ฐ ํ™•์ธ
df.head() # ์ฒ˜์Œ๋ถ€ํ„ฐ ํ™•์ธ
df.tail() # ๋์—์„œ ํ™•์ธ
df['purchased'].value_counts() # ์˜ˆ์ธก ๋ ˆ์ด๋ธ”์ด ํŽธํ–ฅ ๋˜์—ˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด
# 50 50 ์œผ๋กœ ๋˜์–ด์žˆ์œผ๋ฉด ์ข‹๊ฒ ์ง€๋งŒ, 2๋Œ€ 1 ์ •๋„๋ฉด ๋งŽ์ด ํŽธํ–ฅ ๋œ๊ฒŒ ์•„๋‹˜

# ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜
# int๋กœ ๋„ฃ์–ด์ค˜๋„ ๋˜์ง€๋งŒ float ํ˜•ํƒœ๊ฐ€ ์ผ๋ฐ˜์ ์ด๋ฏ€๋กœ ๋ณ€ํ™˜
X = df.iloc[:, [2,3]].values.astype("float32") # age, estimatedSalary๋งŒ ์‚ฌ์šฉ
y = df.iloc[:, 4].values.astype("float32") # purchased

X.shape, y.shape # 400,2 / 400,

# train, test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape # 320,2 / 80,2 / 320, / 80,

2. Feature Scaling

# age = 10๋‹จ์œ„, EstimatedSalaryy = 10000๋‹จ์œ„ 
# => ์ˆซ์ž ์ฐจ์ด๊ฐ€ ํฐ๋ฐ ์ด๋Ÿฐ ๊ฒฝ์šฐ, ํฐ ์ˆซ์ž๊ฐ€ ๋ฆฌ๋”ฉํ•˜๊ธฐ ๋–„๋ฌธ์— ์Šค์ผ€์ผ๋ง ํ•„์š”
sc = StandardScaler() # ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ
X_train = sc.fit_transform(X_train) # fit, transform์„ ๋”ฐ๋กœ ํ•  ์ˆ˜๋„ ์žˆ์Œ
# transform ๋งŒ ํ•ด์•ผํ•จ 
# -> fitํ•˜๋ฉด ๋ณ„๋„๋กœ ๋ฐ์ดํ„ฐ์…‹์— ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋ฏ€๋กœ ๋ฏธ๋ž˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•˜๋ฉด ์•ˆ๋จ
X_test  = sc.transform(X_test)

print(X_train.shape) # 320, 2

3. ๋ชจ๋ธ

# fit
lr_clf = LogisticRegression(solver='lbfgs', random_state=0)
lr_clf.fit(X_train, y_train)
# predict : ์˜ˆ์ธก๋œ class ๋ฐ˜ํ™˜
y_pred = lr_clf.predict(X_test)

# ํ‰๊ฐ€
print("Test set์˜ ์‹ค์ œ true ๊ฐฏ์ˆ˜ = ", sum(y_test)) # 22
print("๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ true ๊ฐฏ์ˆ˜ = ", sum(y_pred)) #18
print("accuracy = {:.2f}".format(accuracy_score(y_test, y_pred))) # 0.925
print("precision = {:.2f}".format(precision_score(y_test, y_pred))) # 0.94
print("recall = {:.2f}".format(recall_score(y_test, y_pred))) # 0.77
# predict_proba : class ๋‹น probability ๋ฐ˜ํ™˜
y_pred_proba = lr_clf.predict_proba(X_test)

print(y_pred_proba[:,1][:5]) # [0.12602436 0.17691062 0.2077208  0.10091478 0.10701443]

# predict๋Š” 0.5๋ฅผ threshold๋กœ ํ–ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ์ฝ”๋“œ
sum((y_pred_proba[:, 1] > 0.5) == y_test)/len(y_test) # predict์˜ accuracy๊ฐ’๊ณผ ๋™์ผ
sum((y_pred_proba[:, 1] > 0.5) == y_test) # ๋ช‡๊ฐœ๋ฅผ ๋งž์ท„๋Š”์ง€ ๊ฐœ์ˆ˜ ํ™•์ธ

# threshold = 0.5 -> ๋ณดํ†ต ๊ธฐ๋ณธ์ ์œผ๋กœ ์ด ๊ฐ’์ด ์ง€์ •๋จ
threshold = 0.5
y_pred_proba = lr_clf.predict_proba(X_test)
y_pred_proba_1 = y_pred_proba[:, 1] > threshold
sum(y_pred_proba_1 == y_test) / len(y_test) 

# threshold = 0.4
threshold = 0.4
y_pred_1 = y_pred_proba_1 > threshold
# threshold ๊ฐ์†Œ์‹œ, accuracy : ๊ฐ์†Œ, precision : ๊ฐ์†Œ, recall : ์ฆ๊ฐ€
print("threshold๊ฐ€ {}์ผ ๋•Œ 1 ๋กœ ๋ถ„๋ฅ˜๋œ ๊ฐฏ์ˆ˜: ".format(threshold), sum(y_pred_1))
print("precision = {:.2f}".format(precision_score(y_test, y_pred_1)))
print("recall = {:.2f}".format(recall_score(y_test, y_pred_1)))
print("f1 score = ", f1_score(y_test, y_pred_1))

# threshold = 0.6
threshold = 0.6
y_pred_2 = y_pred_proba_1 > threshold
# threshold ์ฆ๊ฐ€์‹œ, accuracy : ๊ฐ์†Œ, precision : ์ฆ๊ฐ€, recall : ๊ฐ์†Œ
print("threshold๊ฐ€ {}์ผ ๋•Œ 1 ๋กœ ๋ถ„๋ฅ˜๋œ ๊ฐฏ์ˆ˜: ".format(threshold), sum(y_pred_2))
print("precision = {:.2f}".format(precision_score(y_test, y_pred_2))
print("recall = {:.2f}".format(recall_score(y_test, y_pred_2)))
print("f1 score = ", f1_score(y_test, y_pred_2))

04. ํ‰๊ฐ€

# ํ˜ผ๋™ํ–‰๋ ฌ
cm  = confusion_matrix(y_test, y_pred, labels=[1, 0])

ax = sns.heatmap(cm, annot=True, fmt='d')
ax.set_ylabel('true') # ๊ฐ€๋กœ ์„ธ๋กœ ์ฃผ์˜
ax.set_xlabel('predicted')
ax.set_title('Confusion Matirx')

# Roc curve
y_proba = lr_classifier.predict_proba(X_test)
y_scores = y_proba[:,1] # ์ „์ฒด ํ™•๋ฅ ์— ๋Œ€ํ•ด 1์ผ ํ™•๋ฅ 

# _ ์‚ฌ์šฉํ•ด์ค˜์•ผํ•จ
fpr, tpr, _ = roc_curve(y_test, y_scores) # false positive rate : fpr, true positive rate : tpr
auc = roc_auc_score(y_test, y_scores)

plt.plot(fpr, tpr, label="auc="+ "{:.2f}".format(auc))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()

Ensemble Learning (์•™์ƒ๋ธ” ํ•™์Šต)

์•™์ƒ๋ธ” ํ•™์Šต ์ด๋ž€ ?

: ๋‹ค์ˆ˜์˜ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋ฅผ ์กฐํ•ฉ ํ•˜์—ฌ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์ถ”์ถœํ•ด๋ƒ„

์•™์ƒ๋ธ” ํ•™์Šต ๋ฐฉ๋ฒ•

01. Bagging (๋ฐฐ๊น…) (Bootstrap Aggregating)

(1) ๋ฐฐ๊น…์ด๋ž€? : ๋ณต์› ์ถ”์ถœ์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋‹จ์ผ ๋ชจ๋ธ๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ

 

(2) ๋ถ„๋ฅ˜๊ธฐ ํ›ˆ๋ จ๋ฐฉ๋ฒ• : ํŠธ๋ ˆ์ด๋‹ ์ƒ˜ํ”Œ์˜ subset์„ ๋ฌด์ž‘์œ„ ์ถ”์ถœ ํ›„, ๋ถ„๋ฅ˜๊ธฐ ํ›ˆ๋ จ

 

(3) ํšจ๊ณผ : ํŽธํ–ฅ(๋ชจ๋ธ ๋ณต์žก๋„)์ด ์ปค์ง€์ง€ ์•Š์œผ๋ฉด์„œ ๋ถ„์‚ฐ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Œ

    → ์ด์œ  : ๋žœ๋ค ์ƒ˜ํ”Œ๋ง์— ์˜ํ•œ ๋ชจ๋ธ fit

 

(4) Bootstrap (๋ณต์› ์ถ”์ถœ) : ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•˜๋Š”๋ฐ, ์ด์ „์— ์ถ”์ถœ๋œ ์ƒ˜ํ”Œ์„ ๋‹ค์‹œ ๋ฝ‘ํž ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์ถ”์ถœ ๋ฐฉ๋ฒ•

    → ๋น„๋ณต์› ์ถ”์ถœ : ์ค‘๋ณต ๋˜๋Š”๊ฒŒ ์—†๋„๋ก ์ƒ˜ํ”Œ ์ถ”์ถœ

    → ํšจ๊ณผ

  • variance = $\frac{๐œŽ ^2}{b}$์œผ๋กœ ๊ฐ์†Œ
  • Mean = ๋™์ผ

(5) Bagging tree ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • training data ์—์„œ random sampling ํ•˜์—ฌ X_b, y_b ๋ฅผ ๊ณ ๋ฆ„ (ex. 1,000 ๊ฐœ์˜ data ์ค‘ 100 ๊ฐœ sampling)
  • X_b, y_b ๋ฅผ ์ด์šฉํ•˜์—ฌ ID3 ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ decision tree ๊ตฌ์„ฑ
  • B ๊ฐœ์˜ tree ๊ฐ€ ๋งŒ๋“ค์–ด์งˆ ๋•Œ๊นŒ์ง€ 1, 2 ๋ฐ˜๋ณต
  • B ๊ฐœ์˜ ๋ชจ๋“  tree ๋ฅผ ์ด์šฉํ•˜์—ฌ classification ํ•œ ํ›„ majority vote ๋กœ ๊ฒฐ์ •
  • Bias ๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ Variance ๋ฅผ ์ค„์ธ๋‹ค.

(6) ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ

  • ๋ฌธ์ œ์  : ๋ช‡ ๊ฐœ์˜ ํŠธ๋ฆฌ๋ฅผ ๋งŒ๋“ค๋“  ์ฒซ ๋ฒˆ์งธ ๋…ธ๋“œ์—๋Š” ๊ฐ™์€ ์—ด์ด ์‚ฌ์šฉ๋  ๊ฒƒ์ž„ ⇒ ๋‚˜๋ฌด๊ฐ€ ๋‹ค ๋น„์Šทํ•  ๊ฒƒ, ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์— ๋„์›€์ด ๋˜์ง€ ์•Š์Œ
  • ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• : Decision Tree ์— ํฌํ•จ๋  attribute ๋“ค์„ random ํ•˜๊ฒŒ ์ผ๋ถ€๋งŒ ์„ ์ •
    โž” ๋ชจ๋“  attribute ๋ฅผ ๊ฐ€์ง€๊ณ  Tree ๋ฅผ ๋งŒ๋“ค ๊ฒฝ์šฐ ๋งค์šฐ ๊ฐ•ํ•œ attribute ๊ฐ€ ๋ชจ๋“  tree ์— ํ•ญ์ƒ ํฌํ•จ๋จ
  •  ์žฅ์ 
    • ๋” Random ํ•˜๊ณ  ๋…๋ฆฝ์ ์ธ classifier ์ƒ์„ฑ๊ฐ€๋Šฅ
    • ๋‹ค์–‘ํ•œ ๋‹ค๋ฅธ ํŠน์„ฑ์˜ ๋‚˜๋ฌด๋“ค์ด ์ƒ์„ฑ๋˜๋ฏ€๋กœ ์ด ๊ฒฐ๊ณผ๋ฅผ voting ํ•ด์„œ ์ตœ์ข…์„ ๋‚ด๋ฉด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„
    • Tree-based model ์ด๋ฏ€๋กœ white box ํŠน์ง•์„ ์œ ์ง€
    • High prediction accuracy
    • ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ƒ์„ฑ ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ ์†๋„๊ฐ€ ๋น ๋ฆ„
    • ๊ฐ tree ๋Š” ๋งค์šฐ deep ํ•˜๊ฒŒ ์ƒ์„ฑ

02. Boosting

(1) ๋ถ€์ŠคํŒ… ๋ฐฉ๋ฒ• 2๊ฐ€์ง€

  • ์ž˜๋ชป ๋ถ„๋ฅ˜๋œ ๋ฐ์ดํ„ฐ์— ๋” ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜์—ฌ ๋‹ค์Œ ๋ชจ๋ธ์˜ ์ƒ˜ํ”Œ๋ง์— ํฌํ•จ๋  ํ™•๋ฅ ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•
  • ๋ชจ๋ธ ์ž์ฒด๋ฅผ ๋ถ€์ŠคํŒ… : ๋‹ค์ˆ˜์˜ ์•ฝํ•œ ํ•™์Šต๊ธฐ๋ฅผ ํ›ˆ๋ จ์‹œ์ผœ majority voting (→ ๋ฐฐ๊น…๊ณผ ๋™์ผํ•œ ๋ฐฉ๋ฒ•)

(2) ๋Œ€ํ‘œ์ ์ธ ๋ฉ”์†Œ๋“œ : AdaBoost, Gradient Boost (XGBoost)

 

(3) Gradient Boost (XGBoost)

  • ๋ฐฉ๋ฒ• : weak learner๋ฅผ ์ด์šฉํ•œ ๋ชจ๋ธ ๋ถ€์ŠคํŒ…
    • Weak Learner (์•ฝํ•œ ํ•™์Šต๊ธฐ) : ๋žœ๋คํ•˜๊ฒŒ ์„ ํƒํ•˜๋Š” ๊ฒƒ ๋ณด๋‹ค ์•ฝ๊ฐ„ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์˜ ๋ชจ๋ธ
      ⇒ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜์˜€์Œ
    • ๊ธฐ์กด์˜ weak learner ๊ฐ€ ์ƒ์„ฑํ•œ ์ž”์ฐจ๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๋Š” ์ถ”๊ฐ€ tree ๋ฅผ ๊ฐ์†Œ ํšจ๊ณผ๊ฐ€ ์—†์„ ๋•Œ๊นŒ์ง€ ์ƒ์„ฑ
    • ์ด์ „ ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋กœ ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ์ž”์ฐจ๋ฅผ ๋‹ค์Œ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์— ์‚ฌ์šฉ๋˜๋Š” input์˜ ๋ ˆ์ด๋ธ”๋กœ ์‚ฌ์šฉ
    • ์ฒซ๋ฒˆ์งธ ๋‚˜๋ฌด : ์ž”์ฐจ ์ƒ์„ฑ / ์ดํ›„์˜ ๋‚˜๋ฌด : ์–ผ๋งˆ์˜ ์ž”์ฐจ๊ฐ€ ์ƒ์„ฑ๋ ์ง€, ์˜ˆ์ธกํ•˜๋Š” ํ›ˆ๋ จ
      ⇒ ๋‚˜๋ฌด๋“ค์˜ output์„ ์ „๋ถ€ ๋”ํ•˜๋ฉด y๊ฐ’์ด ์–ผ๋งˆ์ธ์ง€ ์•Œ ์ˆ˜ ์žˆ์Œ
      ⇒ ๋‚˜๋ฌด๊ฐ€ ์ƒ์„ฑ๋  ์ˆ˜๋ก ์ž”์ฐจ = 0 ์ด ๋จ
  • Optimization ๋ฐฉ๋ฒ• : ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์— ์˜ํ•œ ์†์‹ค ํ•จ์ˆ˜ optimize

(5) AdaBoost (Adaptive Boosting)

  • ๋ฐฉ๋ฒ• : ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ€์ŠคํŒ…ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์ฒซ๋ฒˆ์งธ ์ƒ์„ฑ๋˜๋Š” ๋ชจ๋ธ ⇒ Stump ( depth๊ฐ€ ๋‘ ๊ฐœ์ธ ๋ถ€์ •ํ™•ํ•œ ์ด์ง„ ๋ถ„๋ฅ˜๊ธฐ )
    ์ด์ „ stump ์—์„œ ์ž˜ ๋ชป ๋ถ„๋ฅ˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์—ฌ, ๋‹ค์Œ sampling ์— ํฌํ•จ๋˜๋„๋กํ•จ
    ⇒ ๋ชจ๋ธ๋กœ ๋ถ€ํ„ฐ ๋ถ„๋ฅ˜๋œ ๊ฐ’๋“ค์„ combinedํ•ด์„œ ์ตœ์ข… ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ„

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ & gradient boosting ์‹ค์Šต

๋ฐ์ดํ„ฐ ์†Œ๊ฐœ

  1. ์นผ๋Ÿผ : ์‚ฌ์šฉ์ž id, ์„ฑ๋ณ„, ์—ฐ๋ น, ์ถ”์ •๊ธ‰์—ฌ, ๊ตฌ๋งค์—ฌ๋ถ€
  2. ์˜ˆ์ธก : ํŠน์ • ์‚ฌ์šฉ์ž๊ฐ€ ๊ตฌ๋งค ํ• ์ง€ ์—ฌ๋ถ€ → ๊ตฌ๋งค: 1, ๊ตฌ๋งค ์•Š์Œ: 0
User ID Gender Age EstimatedSalary
15706071 Male 51 23000
15654296 Female 50 20000
15755018 Male 36 33000
15594041 Female 49 36000

1. ๋ณ€์ˆ˜ ์„ค์ •

X = df.iloc[:, [2,3]].values.astype("float32") # age, estimatedSalary
y = df.iloc[:, 4].values.astype("float32") # purchased

# train, test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

2. feature scaling

# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

3. Random Forest model

# fit
# 10๊ฐœ ๋‚˜๋ฌด ๋งŒ๋“ค์–ด์„œ ํ•˜๋Š” ๋ฐฐ๊น…
rf = RandomForestClassifier(n_estimators=10, random_state=0)
rf.fit(X_train, y_train)

# predict
y_pred = rf.predict(X_test)

# accuracy
accuracy_score(y_test,y_pred)
print("accuracy = {:.2f}".format(sum(y_pred == y_test) / len(y_test)))

# confusion matrix
print("confution matrix\\n", 
      confusion_matrix(y_test, y_pred, labels=[1, 0]))
# f1 score
print("f1 score\\n", f1_score(y_test, y_pred, labels=[1, 0]))

5. Gradient Boosting Classifier

- min_samples_split : node ๋ถ„๋ฆฌ์— ํ•„์š”ํ•œ ์ตœ์†Œ sample ์ˆ˜ => overfitting ๋ฐฉ์ง€  
- max_depth : tree ๊นŠ์ด ์กฐ์ ˆ => overfitting ๋ฐฉ์ง€
- learning_rate : ๊ฐ tree ์˜ ๊ธฐ์—ฌ๋„ ์กฐ์ •, n_estimators ์™€ trade-off 
- n_estimators : number of sequential trees
from sklearn.ensemble import GradientBoostingClassifier
# ๋ชจ๋ธ 
# ์ž”์ฐจ๋ฅผ ๊ณ„์† ์ด์–ด์„œ, ์•ž์˜ ํŠธ๋ฆฌ๊ฐ€ ๋งŒ๋“  ์˜ค๋ฅ˜๋ฅผ ๋’ค์— ํŠธ๋ฆฌ๊ฐ€ ์˜ค๋ฅ˜๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‚˜๋ฌด๋ฅผ ๋งŽ์ด ์ƒ์„ฑํ•ด์•ผํ•จ -> 500
# ๋ถ€์ •ํ™•ํ•œ ๋‚˜๋ฌด๋“ค์ด ๊ณ„์† ์ด์–ด์ ธ์„œ ์•ž์˜ ์ž”์ฐจ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์ด๋ฏ€๋กœ max_depth๋ฅผ ์ž‘๊ฒŒ
gb = GradientBoostingClassifier(learning_rate=0.1, n_estimators=500, max_depth=5)

# fit
gb.fit(X_train, y_train)

# predict
y_pred = gb.predict(X_test)

# ๊ฒฐ๊ณผ ํ™•์ธ
print("Test set true counts = ", sum(y_test))
print("predicted true counts = ", sum(y_pred))
print("accuracy = {:.2f}".format(
            sum(y_pred == y_test) / len(y_test)))
            
# ํ˜ผ๋™ํ–‰๋ ฌ
print("confution matrix\\n", 
      confusion_matrix(y_test, y_pred, labels=[1, 0]))

# f1 score
print("f1 score\\n", f1_score(y_test, y_pred, labels=[1, 0]))

6. ํ›ˆ๋ จ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ๊ฐ€์ƒ์˜ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ

# ๊ฐ€์ƒ์˜ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ : ์ „์ฒด ๋ ˆ์ฝ”๋“œ์— ๋Œ€ํ•ด์„œ ์ตœ์†Œ ์ตœ๋Œ€ ์‚ฌ์šฉ
x1_min, x1_max = X_test[:, 0].min() - 1, X_test[:, 0].max() + 1       
x2_min, x2_max = X_test[:, 1].min() - 1, X_test[:, 1].max() + 1  

# x1, x2๊ฐ€ ์„œ๋กœ ๊ต์ฐจ๋˜๋Š” ์  ๋งˆ๋‹ค ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ด์คŒ -> meshgrid
X1, X2 = np.meshgrid(np.arange(x1_min, x1_max, 0.1), 
                     np.arange(x2_min, x2_max, 0.1))
X1.shape, X2.shape # (60, 61) / (60, 61)

# X1, X2๋ฅผ 1์ฐจ์›์œผ๋กœ ํŽผ์ณ์ค˜์„œ ๊ฐ ์ถ•์˜ ๊ฐ’์„ ๋งŒ๋“ฌ                                 
XX = np.column_stack([X1.ravel(), X2.ravel()])
XX.shape # (3360,2)๋งŒํผ์˜ ๊ฐ€์ƒ๋ฐ์ดํ„ฐ๊ฐ€ ์ƒ์„ฑ๋จ

๊ฐ€์ƒ์˜ ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ ์˜ˆ์ธก

# random forest๋กœ ์˜ˆ์ธกํ•œ ๊ฐ’
Y_rf = np.array(rf.predict(XX))
# gradient boost๋กœ ์˜ˆ์ธกํ•œ ๊ฐ’
Y_gb = np.array(gb.predict(XX))

์‹œ๊ฐํ™”

# ๋‘๊ฐ€์ง€ ์ƒ‰ ๋งŒ๋“ฌ
from matplotlib.colors import ListedColormap
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA']) # ์—ฐํ•œ ๊ฐ’
cmap_bold = ListedColormap(['#FF0000', '#00FF00'])  # ์ง„ํ•œ ๊ฐ’  

# ์‹œ๊ฐํ™”
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4), sharey=True)

# Random Forest ์‹œ๊ฐํ™”
# ๊ฒฐ์ •๊ฒฝ๊ณ„ ์ƒ์„ฑ
ax1.pcolormesh(X1, X2, Y_rf.reshape(X1.shape),cmap=cmap_light, shading='auto') 
# y_test๊ฐ€ 0์ผ๋•Œ, 1์ผ๋•Œ ๋‚˜๋ˆ ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์œผ๋กœ ํ‘œํ˜„ -> ์ƒ‰์„ ๋‹ค๋ฅด๊ฒŒ ๊ทธ๋ฆฌ๋„๋ก ํ•จ
for i in range(2):
    ax1.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], s=20, color=cmap_bold(i), label=i, edgecolor='k')
ax1.set_title('Random Forest')
ax1.set_xlabel('Age')
ax1.set_ylabel('Estimated Salary')
ax1.legend()

# Gradient Boosting ์‹œ๊ฐํ™”
ax2.pcolormesh(X1, X2, Y_gb.reshape(X1.shape), cmap=cmap_light, shading='auto') 
for i in range(2):
    ax2.scatter(X_test[y_test == i, 0], X_test[y_test == i, 1], s=20, color=cmap_bold(i), label=i, edgecolor='k')
ax2.set_title('Gradient Boosting')
ax2.set_xlabel('Age')
ax2.legend()
plt.tight_layout()

# Age, EstimatedSalary ์˜ ์ค‘์š”๋„
feature_imp = pd.Series(gb.feature_importances_, 
            ['Age', 'EstimatedSalary']).sort_values(ascending=False)
feature_imp.plot(kind='bar', title='feature importance')ใ„ด

 


Feature Engineering

์ข‹์€ ํ”ผ์ฒ˜์˜ ์กฐ๊ฑด

  1. Target ๊ณผ์˜ ๋†’์€ ๊ด€๋ จ์„ฑ
  2. prediction ์‹œ์ ์— ์•Œ ์ˆ˜ ์žˆ์Œ
    • ์›” ์ค‘์—๋Š” ์•Œ ์ˆ˜ ์—†๊ณ , ์›”๋ง์—๋งŒ ์•Œ ์ˆ˜ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋“ค → ex. sales data ๋Š” ์ต์›”์— ์ง‘๊ณ„
    • ์•„์ง ๋ฐœ์ƒํ•˜์ง€๋„ ์•Š์€ ๋ฐ์ดํ„ฐ๋กœ ํ›ˆ๋ จํ•˜๊ฒŒ ๋˜๋Š” ์ƒํ™ฉ
  3. numeric
  4. ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ ์ˆ˜
  5. ์ธ๊ฐ„ ์ „๋ฌธ๊ฐ€์˜ ๋„๋ฉ”์ธ ์ง€์‹ ํ™œ์šฉ ๊ฐ€๋Šฅ : ๊ด€์ ์˜ ์ฐจ์ด๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ
    • Example ) ๊ตํ†ต ์ •๋ณด ๋ฐ์ดํ„ฐ : ๋ฐœ์ƒ ์‹œ์ 
    • → ์š”์ผ๋ณ„๋กœ ๋‚˜๋ˆ„๋Š”๊ฒŒ ์ข‹์„ ๊ฑฐ์•ผ! ๋ผ๋Š” ์ง€์‹์ด ์žˆ์„ ๊ฒฝ์šฐ, ์˜ˆ์ธก์— ๋„์›€์ด ๋  ์ˆ˜๋„ ์žˆ์Œ

๋จธ์‹ ๋Ÿฌ๋‹์„ ์œ„ํ•œ ํ”ผ์ฒ˜ ์—”์ง€๋‹ˆ์–ด๋ง

  1. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ
  2. Data formatting : ๋ฐ์ดํ„ฐ ํ†ต์ผ
  3. ํŽธํ–ฅ ์ฒ˜๋ฆฌ
  4. Normalization (= feature scaling) : scale์ด ๋‹ค๋ฅผ ๋•Œ, ๋น„์Šทํ•œ ์Šค์ผ€์ผ๋กœ ํ•ด์ฃผ๋Š”๊ฒŒ ์˜ˆ์ธก์ด ๋” ์ž˜๋จ
  5. Binning : ์—ฐ์†๋œ ์ˆซ์ž๋ฅผ ์–ด๋– ํ•œ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ•‘ ํ•˜๊ธฐ ์œ„ํ•ด ๋ฒ”์œ„๋ฅผ ์ •ํ•ด์ฃผ๋Š” ๊ฒƒ
  6. Categorical ๋ณ€์ˆ˜ → ์ˆ˜์น˜ํ˜•์œผ๋กœ
    1. ordinal category (์ˆœ์„œ/ํฌ๊ธฐ๊ฐ€ ์žˆ๋Š” feature)
      • ์ˆซ์ž๋กœ ํฌ๊ธฐ(์ˆœ์„œ) ํ‘œ์‹œ
      • ex) L > M > S → 3, 2, 1
    2. nominal category (์ˆœ์„œ/ํฌ๊ธฐ๊ฐ€ ์—†๋Š” feature)
      • ์ด๋Ÿฐ ๊ฒฝ์šฐ๊ฐ€ ํ›จ์”ฌ ๋งŽ์Œ
      • ๋ฐฉ๋ฒ• : one-hot encoding

Feature engineering & ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์‹ค์Šต

์‚ฌ์šฉ๋ฐ์ดํ„ฐ : ํƒ€์ดํƒ€๋‹‰ ๋ฐ์ดํ„ฐ → ํƒ€์ดํƒ€๋‹‰ํ˜ธ์˜ ์ƒ์กด ์—ฌ๋ถ€ ๋ฐ์ดํ„ฐ

01. ๋ฐ์ดํ„ฐ ํ™•์ธ

# ํ•„์š”์—†๋Š” ์นผ๋Ÿผ drop
df.drop(['PassengerId', 'Name', 'Cabin', 'Ticket'], axis=1, inplace=True)

# ๋ฐ์ดํ„ฐ ํ™•์ธ
df_titanic.info()
df_titanic.isnull().sum()
df_titanic.describe()

02. Feature Analysis

→ feature ๊ฐ„์˜ correlation check / Survived ์™€ ๊ฐ Feature ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„ ํŒŒ์•…

# ๊ฐ ํ”ผ์ฒ˜๋“ค๊ฐ„์˜ ๊ด€๊ณ„๊ฐ€ ์–ผ๋งˆ๋ƒ
# ์Œ์ˆ˜ : ์—ญ์˜ ์ƒ๊ด€๊ด€๊ณ„, 0 : ๊ฑฐ์˜ ๊ด€๊ณ„ ์—†์Œ, ์–‘์ˆ˜ : ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„
# cov ๋ผ๋Š” ๊ณต๋ถ„์‚ฐ์„ -1, 1 ์‚ฌ์ด๋กœ ์ •๊ทœํ™” ์‹œํ‚จ ๊ฒƒ
df_titanic.corr(numeric_only=True)

# ์ƒ๊ด€๊ด€๊ณ„ ์‹œ๊ฐํ™”
g = sns.heatmap(df_titanic.corr(numeric_only=True), annot=True, cmap="coolwarm")

03. ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ

# Age -> median์œผ ์ฒ˜๋ฆฌ : ์—†์• ๊ธฐ์—” ๋„ˆ๋ฌด ๋งŽ๊ธฐ ๋•Œ๋ฌธ
df_titanic['Age'].fillna(df_titanic['Age'].median(), inplace=True)

# Embarked -> ์–ผ๋งˆ ์—†์–ด์„œ ์‚ญ์ œ
df_titanic.dropna(inplace=True)

04. ํŽธํ–ฅ ํ™•์ธ

# ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ด์šฉํ•ด ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•ด ํŽธํ–ฅ ํ™•์ธ 
df_titanic.hist(bins=30, figsize=(8, 8));

 

⇒ ๊ฒฐ๊ณผ : Fare๊ฐ€ ์—ฐ์†๋œ ์ˆซ์ž๊ฐ’์ธ๋ฐ ์‹ฌํ•˜๊ฒŒ ์น˜์šฐ์นจ ⇒ log๊ฐ’์œผ๋กœ ๋ฐ”๊ฟˆ

(log๋กœ ํ•˜๋Š” ์ด์œ  : ๊ฐ€์šด๋ฐ ์ชฝ์œผ๋กœ ์ˆซ์ž๊ฐ€ ๋ชฐ๋ฆฌ๊ฒŒ ํ•ด์คŒ / ์ž‘์€ ๊ฐ’์€ ์ปค์ง€๊ฒŒ, ํฐ๊ฐ’์€ ์ž‘๊ฒŒ ํ•ด์ฃผ๋Š”๋ฐ ๋Œ€์†Œ๊ด€๊ณ„๋Š” ๋ฐ”๋€Œ์ง€ ์•Š์Œ)

05. ํŽธํ–ฅ ์ฒ˜๋ฆฌ

# ๋กœ๊ทธ๋กœ ๋ฐ”๊ฟ”
df_titanic['Fare'] = df_titanic['Fare'].map(lambda x: np.log(x) if x > 0 else 0)
# ๋‹ค์‹œ ํ™•์ธ
df_titanic.hist(bins=30, figsize=(8, 8));

06. ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ€์ˆ˜ → ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜

# ์›ํ•ซ ์ธ์ฝ”๋”ฉ : ์นดํ…Œ๊ณ ๋ฆฌ์ปฌ์ธ ์นผ๋Ÿผ์„ ์ž๋™์œผ๋กœ ๊ณจ๋ผ์„œ ๋ฐ”๊ฟ”์คŒ
df_titanic = pd.get_dummies(df_titanic)

 

Survived Pclass Age SibSp Parch Fare Sex_female Sex_male Embarked_C Embarked_Q Embarked_S
0 3 22.0 1 0 1.981001 False True False False True
1 1 38.0 1 0 4.266662 True False True False False
1 3 26.0 0 0 2.070022 True False False False True
1 1 35.0 1 0 3.972177 True False False False True
0 3 35.0 0 0 2.085672 False True False False True

07. ๋ฐ์ดํ„ฐ ๋ถ„ํ•  / ์Šค์ผ€์ผ๋ง

# train, test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape # 711,10 / 178,10 / 711, / 178,

# standart scaling
sc = StandardScaler()
X_train_scaled = sc.fit_transform(X_train)
X_test_scaled = sc.transform(X_test)

08. ๋ชจ๋ธ๋ง ๋ฐ ํ‰๊ฐ€ : ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ

from sklearn.ensemble import RandomForestClassifier

# ๋ชจ๋ธ ์ƒ์„ฑ
rf = RandomForestClassifier(n_estimators=10, criterion='entropy', random_state=0)
# fit
rf.fit(X_train_scaled, y_train)
# ์˜ˆ์ธก : predict
y_pred = rf.predict(X_test_scaled)
# ์ •ํ™•๋„ 
print("accuracy = {:.2f}".format(sum(y_pred == y_test) / len(y_test))) #0.85