๋™์•„๋ฆฌ,ํ•™ํšŒ/GDGoC

[AI ์Šคํ„ฐ๋””] Section 3 : ์ „ํ†ต์  ๋จธ์‹ ๋Ÿฌ๋‹ - ์ง€๋„ํ•™์Šต ๋ชจ๋ธ part 1

egahyun 2024. 12. 26. 04:30

๋จธ์‹ ๋Ÿฌ๋‹ end-to-end process

1. ๋ฌธ์ œ ์ •์˜ : ๋ญ˜ ํ•ด์•ผ ๊ฒ ๋‹ค! -> ๋ถ„๋ฅ˜, ํšŒ๊ท€ ๋“ฑ ์–ด๋–ค ๊ฒƒ์ธ์ง€

2. ๋ฐ์ดํ„ฐ ์ค€๋น„

  < ์ „ํ†ต์ ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ค€๋น„ >

  a. ์œ ์ €์˜ ์š”๊ตฌ์‚ฌํ•ญ, ๋น„์ฆˆ๋‹ˆ์Šค ๋ฃฐ, ํ”„๋กœ๊ทธ๋žจ ์ŠคํŒฉ, ์„ค๊ณ„์„œ ์ •๋ฆฌ

  b. ์ „ํ†ต์ ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ํ”„๋กœ๊ทธ๋žจ์— ์–ด๋–ค ๊ทœ์น™์„ ๋„ฃ์„์ง€ ์กฐ์‚ฌ

 

 < ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋กœ์ ํŠธ์—์„œ์˜ ๋ฐ์ดํ„ฐ ์ค€๋น„ >

  ⇒ ์œ„์™€ ๊ฐ™์€ ์ „ํ†ต์ ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ์˜ ๋‹จ๊ณ„ ํ•„์š” ์—†์Œ

  a. ์ „๋‹ฌ๋ฐ›์€ ๋น„์ฆˆ๋‹ˆ์Šค ์š”๊ตฌ์‚ฌํ•ญ ํ™•์ธ

  b. ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š”์ง€, ์–ด๋–ป๊ฒŒ ์ค€๋น„ํ•  ๊ฒƒ์ธ์ง€, ์–ด๋””์„œ ๋ชจ์„๊ฒƒ์ธ์ง€ ์—ฐ๊ตฌ

  c. ํ™•๋ณดํ•œ ๋ฐ์ดํ„ฐ ์ •์ œ (80%์˜ ์‹œ๊ฐ„์„ ์ฃผ๋กœ ์—ฌ๊ธฐ์— ์‚ฌ์šฉ)

 

3. ๋ชจ๋ธ ์„ ํƒ : ๋ฐ์ดํ„ฐ์™€ ๋ฌธ์ œ ์ •์˜์— ๋งž๋Š” ์—ฌ๋Ÿฌ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜ ์„ ํƒ

4. ๋ชจ๋ธ ์ž‘์„ฑ : ์„ ํƒํ•œ ๋ชจ๋ธ์„ ์ž‘์„ฑ

5. ๋ชจ๋ธ ํ‰๊ฐ€ : ์ž‘์„ฑํ•œ ๋ชจ๋ธ ํ‰๊ฐ€

6. ๋ชจ๋ธ ๊ฐœ์„  : ํ‰๊ฐ€ ๊ฒฐ๊ณผ๊ฐ€ ๊ธฐ๋Œ€์น˜์— ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธ

  a. ๊ธฐ๋Œ€์น˜์— ๋ฏธ์นจ ⇒ ๊ฒฐ๊ณผ ๋ณด๊ณ 

  b. ๊ธฐ๋Œ€์น˜์— ๋ฏธ์น˜์ง€ ๋ชปํ•จ ⇒ ๋ชจ๋ธ ์„ ํƒ ๋‹จ๊ณ„๋ถ€ํ„ฐ ๋‹ค์‹œ ์‚ฌ์ดํด ์‹œ์ž‘

 

๋ชจ๋ธ ์ž‘์„ฑ ์ˆœ์„œ

: ํ”„๋กœ๊ทธ๋žจ ์•ˆ์—์„œ ํ”„๋กœ๊ทธ๋žจ์ด ์ž‘์„ฑ๋˜๋Š” ์ˆœ์„œ (ํ‹ฐํ”ผ์ปฌํ•จ, ํ‹€์ด ๋ฐ•ํ˜€์žˆ์Œ)

  1. ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ
    • ๋Œ€ํ‘œ์ ์ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ : sklearn, numpy, pandas, matplotlib ๋“ฑ
  2. ๋ฐ์ดํ„ฐ ๋กœ๋“œ
    • ์‚ฌ์ „์— ์ค€๋น„๋œ csv ํŒŒ์ผ ์‚ฌ์šฉ
    • Sklearn์— ๋‚ด์žฅ๋œ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉํ•ด ์‚ฌ์šฉ
  3. ๋ฐ์ดํ„ฐ ๋‚ด์šฉ ํŒŒ์•…
    • ๋ชฉ์  : ์ง๊ด€์„ ์–ป๊ธฐ ์œ„ํ•ด & ๋‚ด๊ฐ€ ์‚ฌ์šฉํ•˜๋ ค๋Š” ๋ชจ๋ธ์˜ input ์‚ฌ์–‘๊ณผ ๋ฐ์ดํ„ฐ๊ฐ€ ๋™์ผํ•œ์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด
    • Shape : input ์‚ฌ์–‘ ํ™•์ธ
    • Pandas์˜ describe(), ๊ธฐ์ˆ  ํ†ต๊ณ„ : ๋ฐ์ดํ„ฐ ๋‚ด์šฉ ํŒŒ์•…
    • Matplotlib ๋“ฑ : ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ํŒŒ์•…
  4. Train, test set ๋ถ„ํ• 
    • ๋ฐฉ๋ฒ• : sklearn.train_test_split ํ•จ์ˆ˜ / numpy ์Šฌ๋ผ์ด์‹ฑ ๋“ฑ
    • Train set : ๋ชจ๋ธ ํ›ˆ๋ จ์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ ์…‹
    • Test set : ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ํ…Œ์ŠคํŠธํ•˜๋Š”๋ฐ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ ์…‹
  5. Feature Scaling (= ๋ฐ์ดํ„ฐ ์ •๊ทœํ™”)
    • : ์ „์ฒด ๋ฐ์ดํ„ฐ๋“ค์ด ํฐ ์ˆซ์ž, ์ž‘์€ ์ˆซ์ž ๋“ค์ด ํ”ผ์ฒ˜๋ณ„๋กœ ์„ž์—ฌ ์žˆ๋Š” ๊ฒƒ์„ ์ „๋ถ€ ๋น„์Šทํ•œ ํฌ๊ธฐ๋กœ ๋งž์ถฐ์ฃผ๋Š” ์ž‘์—…
  6. Model object creation⇒ ๋ชจ๋ธ ์•ˆ์— ์–ด๋ ค์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋‚ด์žฅ๋˜์–ด ์žˆ์–ด ํ•จ์ˆ˜์ฒ˜๋Ÿผ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์„œ ์‚ฌ์šฉํ•  ๊ฒƒ
    • : ๋ชจ๋ธ ์˜ค๋ธŒ์ ํŠธ (๋ชจ๋ธ ์ธ์Šคํ„ด์Šค) ํ•˜๋‚˜ ์ƒ์„ฑ
  7. Model train
    • : sklearn์˜ fit()์ด๋ผ๋Š” ๋ฉ”์†Œ๋“œ๋กœ ํ›ˆ๋ จ ์ง„ํ–‰
  8. Model ํ‰๊ฐ€
    • : ํ‰๊ฐ€ ์ง€ํ‘œ ์ถœ๋ ฅ ๋ฐ ์‹œ๊ฐํ™”
  9. Best model ์„ ํƒ
    • : ํ›Œ๋ฅญํ•œ ๋ชจ๋ธ, ๊ฐ€์žฅ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉํ•œ ๋ชจ๋ธ ์„ ํƒํ•˜๋Š” ๊ณผ์ •์˜ ๋ฐ˜๋ณต ์ง„ํ–‰

 

์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ ์„ค๋ช…

๋‹จ๋ณ€์ˆ˜ ์„ ํ˜•ํšŒ๊ท€ (Univariate Linear Regression)

: ํ•œ ๊ฐœ์˜ ๋ณ€์ˆ˜๋กœ ๊ฒฐ๊ณผ ์˜ˆ์ธก

  1. ํ•˜๋‚˜๋งŒ ๊ฐ€์ง€๊ณ  ํ•˜๋Š” ์ด์œ ๋Š” ?
    • ์‹œ๊ฐํ™”๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด (๋ณ€์ˆ˜๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ ๋˜๋ฉด ์‹œ๊ฐํ™” ๋ถˆ๊ฐ€)
    • ๋ณ€์ˆ˜๊ฐ€ ์•„๋ฌด๋ฆฌ ๋งŽ์•„์ ธ๋„, ๋ณ€์ˆ˜ ํ•˜๋‚˜๊ฐ€ ์ฆ๋ช…๋˜๋ฉด ๋˜‘๊ฐ™์ด ๊ณต์‹ ์ ์šฉ ๊ฐ€๋Šฅ
    • ⇒ ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋ณ€์ˆ˜ ํ•˜๋‚˜๋งŒ์œผ๋กœ ์›๋ฆฌ๋ฅผ ํŒŒ์•…ํ›„, ํ™•์žฅํ•ด๋‚˜๊ฐ€๋ฉด ๋จ
  2. ๋ฐฉ๋ฒ•
    • $y=wx+b$
      • X, y (์ž…๋ ฅ๋ฐ์ดํ„ฐ, ๋ ˆ์ด๋ธ”)๊ฐ€ ์ฃผ์–ด์ง + w, b๋Š” ๋ฏธ์ง€์ˆ˜
      • W, b ๋ฅผ ์ถ”์ •ํ•ด์•ผํ•จ
    • Linear regression ๊ณผ์ • : ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ๊ทธ ํฌ์ธํŠธ๋“ค์„ ๊ฐ€์žฅ ์ž˜ ํ‘œํ˜„ํ•˜๋Š” ์„ ์„ ์ฐพ์•„๋‚ด๊ธฐ

                ⇒ ๋ฌด์ˆ˜ํ•œ ๋ผ์ธ์„ ๊ทธ์„ ์ˆ˜ ์žˆ๋Š”๋ฐ ์ด ์ค‘์—์„œ ์ตœ์„ ์˜ ๋ผ์ธ์„ ์ฐพ์•„์•ผํ•˜๋Š” ๊ฒƒ

 

3. ๋น„์šฉํ•จ์ˆ˜ (cost Function)

 : ์ตœ์„ ์˜ ๋ผ์ธ์ด ๋ฌด์—‡์ธ์ง€ ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ๋ชฉ์  : ๊ฐ€์„ค(= ๋ชจ๋ธ)์˜ ์•„์›ƒํ’‹์ด ์–ผ๋งˆ๋‚˜ ํ‹€๋ ธ๋Š”์ง€ ์ธก์ •ํ•˜๋Š” ๊ฒƒ  (=์˜ค์ฐจ์ตœ์†Œํ™”ํ•˜๋Š” ์„  ์ฐพ๊ธฐ)
  • Example
    • X1์ผ๋•Œ, ๋ชจ๋ธ ์˜ˆ์ธก = ๋ชจ๋ธ ์‹์— x1์„ ๋„ฃ์–ด ๋‚˜์˜จ ๊ฐ’, ์‹ค์ œ ๋ฐ์ดํ„ฐ = y1
    • ์˜ค์ฐจ = ์‹ค์ œ ๋ฐ์ดํ„ฐ y1 - ๋ชจ๋ธ์— x1์„ ๋„ฃ์–ด ๋‚˜์˜จ ์˜ˆ์ธกํ•œ ๊ฐ’  = $y-\\hat{y}$
  • ๊ณต์‹
    • $$ Minimize \sum_{i=1}^n{(true-prediction)^2} $$
    • ์ œ๊ณฑ์„ ํ•˜๋Š” ์ด์œ ๋Š” ? : +, - ๋กœ ์˜ค์ฐจ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋˜๋ฏ€๋กœ ์ด๋ฅผ ์—†์• ๊ธฐ ์œ„ํ•ด
    • ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ? : ์ ˆ๋Œ“๊ฐ’์œผ๋กœ ๋งŒ๋“ค๊ธฐ ⇒ ์ œ๊ณฑ๋ณด๋‹ค ์•ˆ ์“ฐ๋Š” ์ด์œ ๋Š” ? : ๋ฏธ๋ถ„์ด ์ž˜ ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ
  • MSE (Mean squared Error) (ํ‰๊ท  ์ œ๊ณฑ ์˜ค์ฐจ)
    • $$ MSE = \frac{1}{n} \sum_{i=1}^n(\hat{Y_i} - Y_i) $$
    • ๋ฐ์ดํ„ฐ๋ฅผ ์ „๋ถ€ ์˜ˆ์ธกํ•œ ๊ฐ’, ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด ๋ชจ๋‘๋ฅผ ์ œ๊ณฑํ•˜์—ฌ ํ‰๊ท ๋‚ธ๊ฒƒ
    • MSE๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” w, b๋ฅผ ์ฐพ๊ธฐ ⇒ w, b๊ฐ€ ๋™์‹œ์— ์ตœ์†Œํ™”๊ฐ€ ๋˜๋Š” ์ง€์ ์ด MSE๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์ง€์ ์ž„

4. ์„ ํ˜• ํšŒ๊ท€์˜ ์ •ํ™•๋„ ์ธก์ • : R2 score (๊ฒฐ์ • ๊ณ„์ˆ˜)

  • ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ•จ์ˆ˜ : sklearn.metrics.r2_score
  • 0 (๋ถ€์ •ํ™•ํ•จ) ≤ R2 ≤ 1 (์ •ํ™•ํ•จ)  ⇒ ๋ถ„๋ชจ์™€ ๋ถ„์ž๊ฐ€ ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— 0
  • $$1-\frac{SSE}{SST}$$
  • $$= 1 - \frac{์˜ˆ์ธก๊ฐ’์— ๋Œ€ํ•œ ๋ถ„์‚ฐ์˜ ํ•ฉ}{๋ถ„์‚ฐ์˜ ํ•ฉ}$$
  • (SST : ํ‰๊ท ์œผ๋กœ ๋ถ€ํ„ฐ ์–ผ๋งˆ๋‚˜ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€ ⇒ ๋ถ„์‚ฐ)

 

๋‹ค๋ณ€์ˆ˜ ์„ ํ˜•ํšŒ๊ท€ (Multivariate Linear Regression)

: ๋ณ€์ˆ˜๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ (X1, X2, X3 … )

 

1. ์ฐจ์›

  ⇒ ๋ฉด ์ฃผ์œ„์— ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์ด ๊ณต๊ฐ„์— ์ฒ˜์ ธ ์žˆ๋Š”๋ฐ ๊ทธ ์‚ฌ์ด์— MSE๊ฐ€ ์ตœ์†Œ๊ฐ€ ๋˜๋Š” ๋ฉด์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ

  ⇒ 3์ฐจ์› ์ด์ƒ์ด ๋˜๋ฏ€๋กœ ์„ ์ด ์•„๋‹Œ ๋ฉด์œผ๋กœ ํ‘œํ˜„

 

์‹ค์Šต - ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ : ๋‹น๋‡จ๋ณ‘ ์ง„ํ–‰๋ฅ  ์˜ˆ์ธก

๋‹จ๋ณ€์ˆ˜ ์„ ํ˜•ํšŒ๊ท€

dia = datasets.load_diabetes() # ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ์˜ค๊ธฐ
print(dia.DESCR) # ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋Š”์ง€ ์ถœ๋ ฅ

dia.feature_names # ํ”ผ์ฒ˜ ์ด๋ฆ„๋“ค ์ถœ๋ ฅ
dia.data.shape # shape ํ™•์ธ : (442, 10)
dia.target.shape # ํƒ€๊ฒŸ shape ํ™•์ธ : (442,) 442๊ฐœ์˜ ๊ฐ ๋ ˆ์ฝ”๋“œ์— ๋Œ€ํ•œ ์ •๋‹ต์ด ๋งคํ•‘๋จ

# ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ๋ฐ”๊พธ์–ด ํ™•์ธ
df = pd.DataFrame(dia.data, columns=dia.feature_names) # ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ ํ™•์ธ
df.head() # 5๊ฐœ์˜ ํ–‰, ์ „์ฒด ์—ด ๊ฐ„๋‹จํžˆ ํ™•์ธ ๊ฐ€๋Šฅ

๋ฐ์ดํ„ฐ ์ƒ์„ฑ

# ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•ด ๋‹จ๋ณ€์ˆ˜ ์„ ํ˜•ํšŒ๊ท€๋ฅผ ํ•ด์•ผํ•˜๋Š”๋ฐ, ๊ทธ๋Ÿฌ๋ ค๋ฉด ํ”ผ์ฒ˜๋ฅผ ํ•˜๋‚˜๋งŒ ๊ณ ๋ฆ„
# bmi๋กœ ์‹ค์Šต
dia_X = df["bmi"].values
dia_X.shape # (442,) : ๋ฒกํ„ฐ ํ˜•ํƒœ์ž„ => ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ ๋ฐ์ดํ„ฐ๋Š” 2์ฐจ์› ๋ฐฐ์—ด ํ˜•ํƒœ๋กœ ๋งŒ๋“ค์–ด์ค˜์•ผํ•จ => reshape
# reshape : 2์ฐจ์› ๋ฐฐ์—ด ํ˜•ํƒœ๋กœ ์ƒ์„ฑ
dia_X = df["bmi"].values.reshape(-1,1) #reshape(442,1) (o)

# ํ›ˆ๋ จ์…‹, ๊ฒ€์ฆ์…‹ ๋ถ„ํ™œ : ์Šฌ๋ผ์ด์‹ฑ ์ด์šฉ
dia_X_train = dia_X[:-20] # ๋’ค์—์„œ 20๋ฒˆ์งธ ๊นŒ์ง€, 
dia_X_test = dia_X[-20:] # ๋’ค์—์„œ ๋ถ€ํ„ฐ 20๋ฒˆ์งธ
dia_X_train.shape, dia_X_test.shape # (422,1),(20,1)

dia_y_train = dia.target[:-20] # (422,)
dia_y_test = dia.target[20:] #(20,)

Sklearn model ์ด์šฉ

regr = linear_model.LinearRegression() # ๊ด„ํ˜ธ ํ•„์ˆ˜!, ์•„๋‹ˆ๋ฉด ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ์ด ์•ˆ๋˜๋Š”๊ฒƒ

regr.fit(dia_X_train, dia_y_train) # ๋ชจ๋ธ ํ•™์Šต
regr.coef_ # ๊ธฐ์šธ๊ธฐ
regr.intercept_ # ์ ˆํŽธ

y_pred = regr.predict(dia_X_test) # ๊ฒ€์ฆ ์…‹์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’

# ์˜ˆ์ธก๊ฐ’, ์ •๋‹ต๊ฐ’ ์‚ฌ์ด์˜ ์‹œ๊ฐํ™”๋กœ ์˜ˆ์ธก์ด ์ž˜๋˜์—ˆ๋Š”์ง€ ํ™•์ธ
plt.scatter(dia_X_test, dia_y_test, label = "True Value")
plt.plot(dia_X_test, y_pred, color='r', label = "Predict")
plt.xlabel("bmi")
plt.ylabel("Progress")
plt.legend()

# R2 ๊ณ„์‚ฐ
r2_score(dia_y_test, y_pred) # 0.4725... : ๋†’์ง„ ์•Š์Œ
# MSE
mean_squared_error(dia_y_test, y_pred)

๋‹ค๋ณ€์ˆ˜ ์„ ํ˜•ํšŒ๊ท€ : bmi, bp ๋‘๊ฐ€์ง€ ๋ณ€์ˆ˜ ์ด์šฉ

# bmi, bp ๋‘ ํ”ผ์ฒ˜๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ
dia_X = df[["bmi", "bp"]].values # reshape ํ•„์š”์—†์Œ 2๊ฐœ์˜ ํ”ผ์ฒ˜์ด๋ฏ€๋กœ ์ด๋ฏธ ํ–‰๋ ฌ์ž„
dia_X.shape #(442,2)

# ํ›ˆ๋ จ์…‹, ๊ฒ€์ฆ์…‹ ๋ถ„ํ™œ : ์Šฌ๋ผ์ด์‹ฑ ์ด์šฉ -> ์œ„์™€ ๋™์ผ
dia_X_train = dia_X[:-20] # ๋’ค์—์„œ 20๋ฒˆ์งธ ๊นŒ์ง€, 
dia_X_test = dia_X[-20:] # ๋’ค์—์„œ ๋ถ€ํ„ฐ 20๋ฒˆ์งธ
dia_X_train.shape, dia_X_test.shape # (422,2),(20,2)

dia_y_train = dia.target[:-20] # (422,)
dia_y_test = dia.target[20:] #(20,)

# ๋ชจ๋ธ ํ›ˆ๋ จ ๋ฐ ์˜ˆ์ธก
regr = linear_model.LinearRegression() # ๊ด„ํ˜ธ ํ•„์ˆ˜!, ์•„๋‹ˆ๋ฉด ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ์ด ์•ˆ๋˜๋Š”๊ฒƒ

regr.fit(dia_X_train, dia_y_train) # ๋ชจ๋ธ ํ•™์Šต
regr.coef_ # ๊ธฐ์šธ๊ธฐ
regr.intercept_ # ์ ˆํŽธ

y_pred = regr.predict(dia_X_test) # ๊ฒ€์ฆ ์…‹์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’

# R2 ๊ณ„์‚ฐ
r2_score(dia_y_test, y_pred) # 0.465... : ๋” ์•ˆ์ข‹์•„์ง
# MSE
mean_squared_error(dia_y_test, y_pred)

๋‹ค๋ณ€์ˆ˜ ์„ ํ˜•ํšŒ๊ท€ : ์ „์ฒด ํ”ผ์ฒ˜๋ฅผ ์ด์šฉ

# bmi, bp ๋‘ ํ”ผ์ฒ˜๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ
dia_X = df.values
dia_X.shape # (442,10) : reshape ํ•„์š”์—†์Œ

# ํ›ˆ๋ จ์…‹, ๊ฒ€์ฆ์…‹ ๋ถ„ํ™œ : ์Šฌ๋ผ์ด์‹ฑ ์ด์šฉ -> ์œ„์™€ ๋™์ผ
dia_X_train = dia_X[:-20] # ๋’ค์—์„œ 20๋ฒˆ์งธ ๊นŒ์ง€, 
dia_X_test = dia_X[-20:] # ๋’ค์—์„œ ๋ถ€ํ„ฐ 20๋ฒˆ์งธ
dia_X_train.shape, dia_X_test.shape # (422,10),(20,10)

dia_y_train = dia.target[:-20] # (422,)
dia_y_test = dia.target[20:] #(20,)

# ๋ชจ๋ธ ํ›ˆ๋ จ ๋ฐ ์˜ˆ์ธก
regr = linear_model.LinearRegression() # ๊ด„ํ˜ธ ํ•„์ˆ˜!, ์•„๋‹ˆ๋ฉด ์ธ์Šคํ„ด์Šค ์ƒ์„ฑ์ด ์•ˆ๋˜๋Š”๊ฒƒ

regr.fit(dia_X_train, dia_y_train) # ๋ชจ๋ธ ํ•™์Šต
regr.coef_ # ๊ธฐ์šธ๊ธฐ
regr.intercept_ # ์ ˆํŽธ

y_pred = regr.predict(dia_X_test) # ๊ฒ€์ฆ ์…‹์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’

# R2 ๊ณ„์‚ฐ
r2_score(dia_y_test, y_pred) # 0.58.... : ๊ทธ๋ƒฅ ์•„๊นŒ๋ณด๋‹จ ์ข‹์•„์ง
# MSE
mean_squared_error(dia_y_test, y_pred)

KNN ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๋ช…

KNN (K-Nearest Neighbors, K ์ตœ๊ทผ์ ‘ ์ด์›ƒ)

  1. ์ž‘๋™ ๋ฐฉ๋ฒ•
    • k๊ฐ’์„ ์„ ํƒ
    • ์˜ˆ์ธกํ•˜๊ณ ์žํ•˜๋Š” ๋ฐ์ดํ„ฐํฌ์ธํŠธ์™€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ„์˜ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ
    • ํŠธ๋ ˆ์ด๋‹์…‹์—์„œ ์˜ˆ์ธกํ•  ํฌ์ธํŠธ์™€ ๊ฐ€๊นŒ์ด์žˆ๋Š” K๊ฐœ์˜ ๋ฐ์ดํ„ฐํฌ์ธํŠธ ์„ ํƒ
    • ๋ถ„๋ฅ˜, ํšŒ๊ท€์— ๋”ฐ๋ผ ๊ฐ’ ์˜ˆ์ธก
      • ๋ถ„๋ฅ˜ : ๋‹จ์ˆœํžˆ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ด์›ƒ๋“ค์˜ ๋ ˆ์ด๋ธ” ์ค‘ ๊ฐ€์žฅ ๋งŽ์€ ๋ ˆ์ด๋ธ”๋กœ ๋ถ„๋ฅ˜
      • ํšŒ๊ท€ : ์ด์›ƒ ๊ฐ’๋“ค์˜ ํ‰๊ท 
  2. ํŠน์ง•
    • ์žฅ์  : ์‹ฌํ”Œ / ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ๋ชจ๋ธ
    • ๋‹จ์ 
      • ๋ฐ์ดํ„ฐ๊ฐ€ ์ปค์งˆ ์ˆ˜๋ก ๋А๋ ค์ง
      • ⇒ ์ด์œ  : ๋‹ค์ฐจ์›์œผ๋กœ ๊ฐˆ ์ˆ˜๋ก ๋ชจ๋“  ๊ฑฐ๋ฆฌ๋ฅผ ๋‹ค ๊ณ„์‚ฐํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ์–‘์ด ๋А๋ ค์ง€๊ธฐ ๋•Œ๋ฌธ์—
      • ์ด์ƒ์น˜์™€ ๊ฒฐ์ธก์น˜์˜ ์˜ํ–ฅ์ด ํฌ๋‹ค (= ์—‰๋šฑํ•˜๊ฒŒ ๋ถ„๋ฅ˜๋  ํ™•๋ฅ ์ด ํฌ๋‹ค)
    • K ๊ฐ’์ด ๋งค์šฐ ์ค‘์š”ํ•˜๊ฒŒ ์ž‘์šฉ๋จ
  3. ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ• : ํ”ผํƒ€๊ณ ๋ผ์Šค ์ •๋ฆฌ์— ์˜ํ•œ ์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ

$$ d(p,q) =d(q,p)\\ = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + ... +(q_n-p_n)^2 }\\= \sqrt{\sum_{i=1}^n (q_i - p_i)^2} $$

 

์‹ค์Šต - KNN : ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜

๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ

  1. Dataset : sklearn์˜ iris dataset
  2. ํ”ผ์ฒ˜ : ๊ฝƒ๋ฐ›์นจ ๊ธธ์ด, ๊ฝƒ๋ฐ›์นจ ๋„ˆ๋น„, ๊ฝƒ์žŽ ๊ธธ์ด, ๊ฝƒ์žŽ ๋„ˆ๋น„
  3. ํ–‰ : 150
  4. ํƒ€๊ฒŸ : ๊ฝƒ์˜ ์ข…๋ฅ˜ (Setosa, verslcolor, virginica 3์ข…๋ฅ˜)

๋ชจ๋ธ ์ฝ”๋“œ

neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
  1. N_neighbors : ์ด์›ƒ ๊ฐœ์ˆ˜
  2. Weights : ๊ฑฐ๋ฆฌ ๊ณ„์‚ฐ์„ ์–ด๋–ค ๋ฐฉ์‹์œผ๋กœ ํ•  ๊ฒƒ์ธ๊ฐ€
    • Uniform : ๋ชจ๋“  ์ด์›ƒ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋™์ผํ•˜๊ฒŒ ์ทจ๊ธ‰ → ๊ฑฐ๋ฆฌ๋Š” ์ƒ๊ด€ํ•˜์ง€ ์•Š๋Š”๊ฒƒ
    • Distance : ์ด์›ƒ์˜ ๊ฑฐ๋ฆฌ์— ๋ฐ˜๋น„๋ก€ํ•˜๊ฒŒ ๊ฐ€์ค‘์น˜ ์กฐ์ • → ๋จผ ๊ฒƒ์€ ๊ฐ€์ค‘์น˜ ์ ์Œ

๋ฐ์ดํ„ฐ ์…‹

from sklearn.datasets import load_iris

iris = load_iris() # ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ํ‚ค์— ๊ฐ ๋ฐ์ดํ„ฐ๋“ค์ด ์žˆ์Œ
# ์ฃผ์˜ : ํƒ€๊ฒŸ ๋ฐ์ดํ„ฐ๊ฐ€ 0 1 2 ์ˆœ์œผ๋กœ ์„ž์ด์ง€ ์•Š๊ณ  ๋˜์–ด์žˆ์Œ -> ์„ž์–ด์•ผํ•จ
# ์•ˆ์„ž์œผ๋ฉด ๊ฒ€์ฆ์€ 2๋ฒˆ๋งŒ ํ•˜๊ณ  ํ›ˆ๋ จ์€ 0 1๋กœ๋งŒ ํ›ˆ๋ จ๋จ

iris.data.shape #(150,4)
iris.feature_names # ํ”ผ์ฒ˜ ์ด๋ฆ„
iris.target_names # ํƒ€๊ฒŸ ์ข…๋ฅ˜

X = iris.data[:,:2] # sepal length, sepal width ๋‘๊ฐœ๋งŒ ์‚ฌ์šฉ
y = iris.target

X.shape, y.shape #(150,2) (150,)

# train, test ์…‹์œผ๋กœ ๋ถ„๋ฅ˜
from sklearn.model_selection import train_test_split
# test set : 20% / random state : 0
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.2, random_state = 0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape
# (120,2) (30,2) (120,) (30,)
# ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ ๋ฐ์ดํ„ฐ์…‹์€ ์ž˜ ์ •์ œ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์—๋Ÿฌ๋‚  ์ผ์ด ์—†๋Š”๋ฐ
# ์—๋Ÿฌ๊ฐ€ ๋‚œ๋‹ค๋ฉด ๊ฑฐ์˜ ๋ฌด์กฐ๊ฑด shape๊ฐ€ ์ž˜๋ชป ๋˜์–ด์žˆ์–ด์„œ์ด๋ฏ€๋กœ shape๋ฅผ ์ž˜ ํ™•์ธํ•˜์ž

KNN object ์ƒ์„ฑ ๋ฐ ํ›ˆ๋ จ

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=15, weights='uniforn')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

์˜ˆ์ธก์˜ ์ •ํ™•๋„ ํ‰๊ฐ€

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred) #0.666666

์‹œ๊ฐํ™”

# ๋ถ„๋ฅ˜๋œ๊ฒƒ์„ ์ƒ‰์œผ๋กœ ํ‘œ์‹œํ•˜์—ฌ 2์ฐจ์› ๊ณต๊ฐ„์—์„œ 3๊ฐœ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
import matplotlib.pyplot as plt
X_train[y_train==0] # ํƒ€๊ฒŸ์ด 0์ธ๊ฒƒ๋งŒ ๊ณจ๋ผ์คŒ

# ๋ฐฉ๋ฒ• 1
plt.scatter(X_train[y_train==0,0], X_train[y_train==0,1])
plt.scatter(X_train[y_train==1,0], X_train[y_train==1,1])
plt.scatter(X_train[y_train==2,0], X_train[y_train==2,1])

# ๋ฐฉ๋ฒ• 2
for i in range(3):
	plt.scatter(X_train[y_train==i,0], X_train[y_train==i,1])
	
plt.legend()

# ํ•˜๋‚˜ ์„ ํƒํ•ด์„œ ์ด๊ฒŒ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์˜ˆ์ธก์ด ๋˜๋Š”๊ฑด์ง€ ํ™•์ธํ•˜๋Š” ์‹œ๊ฐํ™”
# 20๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณจ๋ผ์„œ ํ•ด๋‹น ์œ„์น˜์— x ํ‘œ์‹œ๋กœ ๊ทธ๋ ค์คŒ
# 20๋ฒˆ์งธ ๋ฐ์ดํ„ฐ ์ฃผ์œ„์— ์–ด๋–ค ๋ ˆ์ด๋ธ”์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์€์ง€ ํ™•์ธ ๊ฐ€๋Šฅ => 2๋ฒˆ
plt.plot(X_test[20,0], X_test[20,1], cr='r', marker='x', markersize = 20)

clf.predict(X_test[20:21]) # ์˜ˆ์ธกํ•œ ๊ฐ’์ด ์ถœ๋ ฅ๋จ => 2๋ฒˆ

Confusion Matrix (ํ˜ผ๋™ํ–‰๋ ฌ)

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

import seaborn as sns
plt.figure(figsize=(5,4))
ax = sns.headmap(cm, annot=True, fmt='d')
ax.set_title("confusion matrix)
ax.set_ylable("True")
ax.set_xlable("Predicted")
# ์–ด๋–ค ์ข…์œผ๋กœ ๋ช‡๊ฐœ๋ฅผ ๋งž์ท„๋Š”์ง€, ์–ด๋–ค ์ข…์œผ๋กœ ๋ช‡๊ฐœ๋ฅผ ํ‹€๋ ธ๋Š”์ง€ ๋“ฑ์„ ํ™•์ธ ๊ฐ€๋Šฅ

๊ฒฐ์ •๋‚˜๋ฌด (Decision Tree) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๋ช…

์ด ๋ชจ๋ธ์ด ์ค‘์š”ํ•œ ์ด์œ 

: ์ด ๋ชจ๋ธ ๊ธฐ๋ฐ˜์œผ๋กœ ์ข‹์€ ๋ชจ๋ธ๋“ค์ด ๋งŽ์ด ๋งŒ๋“ค์–ด์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ

๊ฒฐ์ •๋‚˜๋ฌด ํ˜•์‹ ๋ฐ ํŠน์ง•

  1. ํ˜•์‹ : ์ด์ง„ํŠธ๋ฆฌ ํ˜•์‹
    • Tree ์ธ ์ด์œ  : ๋‚˜๋ฌด๋ฅผ ๋ฝ‘์•„ ๋ฟŒ๋ฆฌ๋ฅผ ์œ„์ชฝ์œผ๋กœ ํ•ด๋†“์€ ํ˜•ํƒœ
  2. ๊ตฌ์„ฑ
    • ๋งจ ์œ„์ชฝ : root
    • ๋งจ ์•„๋ž˜์ชฝ(๋ถ„๋ฅ˜๊ฐ’) : leaf node
    • ์งˆ๋ฌธ๋“ค(test) : node
    • Test ๊ฒฐ๊ณผ : branch
  3. ์žฅ์ 
    • ํ™”์ดํŠธ ๋ฐ•์Šค ๋ชจ๋ธ : ์™œ ์ด ๋ชจ๋ธ์ด ์ด๋ ‡๊ฒŒ ์˜ˆ์ธกํ–ˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
    • → ๋ณดํ†ต ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ์—๋Š” ์™œ ๊ทธ๋ ‡๊ฒŒ ๋‚˜์™”๋Š” ์ง€ ์„ค๋ช…ํ•˜๊ธฐ ํž˜๋“ค๋‹ค.
    • ๋ฐ์ดํ„ฐ ํ”„๋ฆฌํ”„๋กœ์„ธ์‹ฑํ•  ํ•„์š” ์—†์Œ
  4. ๋‹จ์ 
    • ๊ณผ์ ํ•ฉ ๋˜๊ธฐ ์‰ฌ์›€ (: ์˜ˆ์ธก์€ ์ž˜๋˜๋‚˜ ๊ฒ€์ฆ, ์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ž˜ ๋งž์ง€ ์•Š์Œ)
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์ž‘์€ ๋ณ€ํ™”์—๋„ ๋งค์šฐ ๋ฏผ๊ฐํ•จ

๊ฒฐ์ •๋‚˜๋ฌด ๋ฐฉ๋ฒ•

: ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋†’์€ ์ƒํƒœ → ๋‚ฎ์€ ์ƒํƒœ๊ฐ€ ๋˜๋„๋ก ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ์กฐ๊ฑด์„ ์ฐพ์•„ ๋‚˜๋ฌด ๋ชจ์–‘์œผ๋กœ ๊ตฌ๋ถ„ํ•ด ๋‚˜๊ฐ

  1. ์–ด๋–ค ์กฐ๊ฑด ๋ฐ ๋ฒ”์œ„์— ๋Œ€ํ•œ ๋‹ต์œผ๋กœ yes, no๋ฅผ ์ •ํ•ด๊ฐ
  2. ์กฐ๊ฑด ๋ฐ ๋ฒ”์œ„๋ฅผ ์ค„์—ฌ๊ฐ€๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜

์งˆ๋ฌธ ๊ธฐ์ค€

  • If else๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์ง€๋งŒ ๋ณต์žกํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ฝ”๋”ฉ ํ•˜์ง€ ์•Š์Œ
  • ์ตœ์ ์˜ ์กฐ๊ฑด๋“ค์ด ๋ฐฐ์น˜๋˜๋„๋ก ์ ์ ˆํžˆ ํ•™์Šตํ•˜์—ฌ ์Šค์Šค๋กœ ๋ฐฐ์น˜

์ข‹์€ ๊ฒฐ์ •๋‚˜๋ฌด ๋ชจ๋ธ

: ๊ฐ€์žฅ ์ตœ์ ์˜ ์กฐ๊ฑด๋“ค์ด ๋ฐฐ์น˜๊ฐ€ ๋œ ๊ฒƒ

์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ข…๋ฅ˜

→ ๋ณ€์ข…์ด ๋งŽ์Œ

  1. ID3
    • ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ์ •๋ณด์ด๋“ ์ด์šฉํ•ด ํŠธ๋ฆฌ ๊ตฌ์„ฑ
    • ์‚ฌ์ดํ‚ท๋Ÿฐ์— ๋‚ด์žฅ๋จ
  2. CART
    • ID3์™€ ๊ฑฐ์˜ ๋น„์Šทํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ์ง€๋‹ˆ๋ถˆ์ˆœ๋„๋ฅผ ์ด์šฉํ•œ ํŠธ๋ฆฌ ๊ตฌ์„ฑ
    • ์‚ฌ์ดํ‚ท๋Ÿฐ์— ๋‚ด์žฅ๋จ
  3. C4.5, C5.0
    • ID3๋ฅผ ๊ฐœ์„ ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  4. CHAID, MARS

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ธฐ์ค€

[ ์—”ํŠธ๋กœํ”ผ (Entropy)]

  1. ์˜๋ฏธ : ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์˜ ํ˜ผ์žก๋„, ์–ผ๋งˆ๋‚˜ ๋ถ„๋ฅ˜๋˜์ง€ ์•Š๊ณ  ์„ž์—ฌ์žˆ๋Š”์ง€ ⇒ ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์€ ์ •๋ณด์˜ ์–‘
  2. ํŠน์ง•
    • ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋†’๋‹ค : ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์— ์„œ๋กœ ๋‹ค๋ฅธ ์ข…๋ฅ˜์˜ ๋ ˆ์ฝ”๋“œ๋“ค์ด ์„ž์—ฌ์žˆ์Œ (= ๋ถ„๋ฅ˜๊ฐ€ ์•ˆ๋จ)
    • ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋‚ฎ๋‹ค : ๊ฐ™์€ ์ข…๋ฅ˜์˜ ๋ ˆ์ฝ”๋“œ๋“ค์ด ์„ž์—ฌ์žˆ์Œ
    • ๋ฒ”์œ„ : 0 (๊ฐ€์žฅ ํ˜ผํ•ฉ๋„๊ฐ€ ๋‚ฎ์€ ์ƒํƒœ) ~ 1 (๊ฐ€์žฅ ํ˜ผํ•ฉ๋„๊ฐ€ ๋†’์€ ์ƒํƒœ)
    • ํ˜ผ์žก๋„, ์ •๋ณด์˜ ์–‘์€ ๋น„๋ก€ํ•จ : ๊น”๋”ํ•˜๊ฒŒ ๋ถ„๋ฅ˜๋˜์–ด์žˆ์œผ๋ฉด ๋ฐฐ์šธ๊ฒŒ ์—†์Œ
  3. ๊ณต์‹ : $$ Entropy = -\sum_{i=1}^{m} p_ilog_2^{(p_i)}\\ p_i = \frac{freq(c_i, s)}{|s|} $$

  (S: ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋“ค์˜ ์ง‘ํ•ฉ, C: ๋ ˆ์ฝ”๋“œ(ํด๋ž˜์Šค) ๊ฐ’๋“ค์˜ ์ง‘ํ•ฉ, freq(Ci,S): S์—์„œ Ci์— ์†ํ•˜๋Š” ๋ ˆ์ฝ”๋“œ์˜ ์ˆ˜, |S|: ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋“ค์˜ ์ง‘ํ•ฉ์˜ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜)

 

   4. ์ •๋ณด ์ด๋“(information gain) : ์šฐ๋ฆฌ๊ฐ€ ์‹œ์Šคํ…œ์˜ ํ†ต๊ณ„๋ฅผ ์•Œ๊ฒŒ๋˜์–ด ๊ฐ์†Œํ•˜๋Š” ์—”ํŠธ๋กœํ”ผ

$$Information Gain = Entropy(Parent) – (weight) * Entropy(Child)$$

์‹ค์Šต - Decision Tree ๋ชจ๋ธ ์ž‘์„ฑ ๋ฐ ์‹œ๊ฐํ™” : ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ๋ถ„๋ฅ˜

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ ์Šคํ”Œ๋ฆฟ

# ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
from sklearn.datasets import load_iris
iris = load_iris()

# train_test_split
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

decision tree ๋ชจ๋ธ : max_depth = 2

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=2, criterion='entropy') # ๊ธฐ๋ณธ์€ gini 
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy_score(y_test, y_pred) #0.9555555555555556

# ์ด์ง„ ํŠธ๋ฆฌ ์‹œ๊ฐํ™” -> ํ™”์ดํŠธ๋ฐ•์Šค ๋ชจ๋ธ
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(25,20))
# _(์–ธ๋”์Šค์ฝ”์–ด)๋ฅผ ํ•˜๋‚˜ ์ฃผ๋Š” ์ด์œ  : ๋ฐ˜ํ™˜๊ฐ’์„ ๋ฐ›๊ฒŒ ๋˜๋Š”๋ฐ, ๋ณ€์ˆ˜๋กœ ์ด๊ฑธ ๋ฐ›์ง€ ์•Š์œผ๋ฉด ์ง€์ €๋ถ„ํ•˜๊ฒŒ ์ฐํžˆ๊ฒŒ๋˜๋ฏ€๋กœ
# ๋ฐ˜ํ™˜๊ฐ’์„ ์„œํ”„๋ ˆ์Šค ํ•˜๊ณ ์ž ํ•จ
# plot_tree ๋ผ๋Š” ๋ฉ”์†Œ๋“œ๊ฐ€ ์ œ๊ณต๋จ
_ = tree.plot_tree(clf, _names=iris.feature_names, ss_names=iris.target_names, filled=True)

 

max_depth๋ฅผ ์„ค์ •ํ•˜์ง€ ์•Š์•˜์„ ๋•Œ

# ๋ชจ๋ธ ์„ค์ •
# NONE ์œผ๋กœ ์„ค์ •์‹œ, ์ œํ•œ์„ ๋‘์ง€ ์•Š๋Š” ๊ฒƒ
clf = tree.DecisionTreeClassifier(max_depth=None)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred) # 0.9333333333333333

# ์‹œ๊ฐํ™”
fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(clf, 
                   feature_names=iris.feature_names,  
                   class_names=iris.target_names,
                   filled=True)