๋™์•„๋ฆฌ,ํ•™ํšŒ/GDGoC

[AI ์Šคํ„ฐ๋””] Section 7 : ์‹ ๊ฒฝ๋ง๊ณผ ๋”ฅ๋Ÿฌ๋‹

egahyun 2024. 12. 26. 19:35

Neural Network

์ „ํ†ต์ ์ธ ML ํ•™์Šต ๋ฐฉ๋ฒ•

  1. ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ
    1. ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„๋ฉ”์ธ ์ง€์‹ ๋ฐ ํ†ต๊ณ„ํ•™์  ์ง€์‹์„ ๋ฐ”ํƒ•์œผ๋กœ ํ”ผ์ฒ˜๋ฅผ ๊ตฌ์„ฑํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์„ฑ
      ⇒ ๋„๋ฉ”์ธ ์ง€์‹์„ ๊ฐ€์ง„ ์„ ๋ฐ•์‚ฌ๊ธ‰ ์ธ์žฌ ํ•„์š”
  2. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ•™์Šต

๋”ฅ๋Ÿฌ๋‹

  1. ํŠน์ง•
    • ์ค‘์š”ํ•œ Feature ๋ฅผ ์Šค์Šค๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ weight ๋ฅผ ๋ถ€์—ฌ
      ⇒ ์‚ฌ๋žŒ์ด ์ง€์ •ํ•œ ํ”ผ์ฒ˜ : over-specified, incomplete ์œ„ํ—˜์„ฑ + ์ž‘์„ฑ์— ๋งŽ์€ ์‹œ๊ฐ„ ์†Œ์š”
    • ์—ฌ๋Ÿฌ ์ธต์— ๊ฑธ์นœ ๋‚ด๋ถ€ parameter ๋ฅผ ์Šค์Šค๋กœ ํ•™์Šต • ์ ์šฉํ•˜๊ธฐ ์‰ฝ๊ณ  ๋น ๋ฅด๋‹ค.
    • Raw data ๋ฅผ ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ – computer vision, ์–ธ์–ด์ฒ˜๋ฆฌ ๋“ฑ (ex, image, sound, characters, words)
    • Unsupervised, supervised learning ๋ชจ๋‘ ๊ฐ€๋Šฅ
    • ์ด๋ฏธ์ง€ ์ธ์‹, ๋Œ€ํ™”/์–ธ์–ด ๋ฌธ์ œ์— ํƒ์›”ํ•œ ์„ฑ๋Šฅ

Artificial Neural Network

  1. Artificial Neuron (perceptron)

    • ๋ชจ์–‘ : ์ธ๊ฐ„ ๋‘๋‡Œ์˜ ์‹ ๊ฒฝ๋ง์„ ๋ชจ๋ฐฉ → ์‹ค์ œ ์ ์šฉ๋ฐฉ์‹์€ ๋‘๋‡Œ์™€๋Š” ๋‹ค๋ฆ„
    • ๊ตฌ์„ฑ : Pre-Activation ๋ถ€๋ถ„(์•ž) + Activation ๋ถ€๋ถ„ (๋’ค)
    • Pre-Activation : linear regression์ด ๊ทธ๋Œ€๋กœ ๋“ค์–ด๊ฐ
      (X : ํ”ผ์ฒ˜๋“ค ๊ฐ๊ฐ, output : ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒƒ, w : ์ฐพ์•„์•ผํ•˜๋Š” ๊ฒƒ)
      $$a(x) = b+ \sum_i w_ix_i= b + w^TX$$
    • Activation : sigmoid ๋“ฑ์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๊ฐ€ ๋“ค์–ด๊ฐ
      ( W : connection weight, b : bias, g : activation function )
      $$h(x) = g(a(x)) = g(b+ \sum_i w_ix_i)$$
  2. ํ™œ์„ฑํ™” ํ•จ์ˆ˜ :
    • sigmoid : \(\sigma \left( x\right) =\dfrac{1}{1+e^{-x}}\) 
        0~1์‚ฌ์ด์˜ ๊ฐ’์ด ๋‚˜์˜ด
    • Tanh : \( tanh(x) \)   
      → -1~1์‚ฌ์ด์˜ ๊ฐ’์ด ๋‚˜์˜ด
    • ReLU : \( max(0,x) \) 
      → ํ˜„๋Œ€ ๋”ฅ๋Ÿฌ๋‹์— ๋Œ€๋ถ€๋ถ„ ์‚ฌ์šฉ
    • Leaky ReLU : \( max(0.1x, x) \) 
      → ReLU์—์„œ ์Œ์ˆ˜์ผ๋•Œ ๊ธฐ์šธ๊ธฐ๊ฐ€ ์—†์œผ๋ฏ€๋กœ, ์ด๋ฅผ ๊ฐœ์„ ํ•˜์—ฌ ์กฐ๊ธˆ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ๋ถ€์—ฌ
    • ELU : \( x \ \ (x≥ 0) \\ \ \ \ \ \ \ \ \ \alpha(e^x -1) \ \ (x <0) \) 
    • Softmax : \( \sigma(z)j = \frac{e^{zj}}{\sum^{K}{k=1}e^{zk}} \)  (for j =1,…, K)
      → ์ถœ๋ ฅ๊ฐ’์˜ ๋‹ค์ค‘ํด๋ž˜์Šค ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ์ถœ๋ ฅ๊ฐ’์— ๋Œ€ํ•ด ์ •๊ทœํ™”ํ•˜์—ฌ (total =1)ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ถœ๋ ฅ
      Linear : pre-Activation, Softmax : Activation

      → \( \frac{e^{2.0} }{e^{2.0} + e^{1.0}+e^{0.1}} = 0.7 \ (y๊ฐ€ \ 0์ผ \ ํ™•๋ฅ ์ด \ 0.7 \ (70\%)) \)

     
  3. ์ž‘๋™ ์›๋ฆฌ

  • ๊ตฌ์„ฑ
    • input feature (2 : ์ฒด์ค‘, 9 : ํ˜ˆ์••), output : ๋‹น๋‡จ๋ณ‘ ์ง„ํ–‰๋ฅ  (1๊ฐœ)
    • ํžˆ๋“ ๋ ˆ์ด์–ด 1๊ฐœ๊ฐ€ 3๊ฐœ์˜ ๋‰ด๋Ÿฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์‹ ๊ฒฝ๋ง
    • ๊ฐ€์ค‘์น˜ : ํ•™์Šต์„ ํ†ตํ•ด ์Šค์Šค๋กœ ๋ถ€์—ฌ๋จ
  • ์ฒด์ค‘์ด๋ผ๊ณ  ํ•˜๋Š” ํ”ผ์ฒ˜๋Š” ํžˆ๋“ ๋ ˆ์ด์–ด์˜ ๋‰ด๋Ÿฐ 3๊ฐœ์™€ ์—ฐ๊ฒฐ๋จ → ์—ฐ๊ฒฐ์‹œ, ๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋จ
  • pre-activation : 7.6 (2 * 0.2 + 9 * 0.8)
  • activation : 0.9994 (ํ™œ์„ฑํ™” ํ•จ์ˆ˜์— 7.6์„ ๋„ฃ์–ด์„œ ๋‚˜์˜จ ๊ฐ’)
  • ํžˆ๋“  ๋ ˆ์ด์–ด → output : 1.79 (0.4 * 0.9994 + 0.5 * 1.000 + 0.9 * 0.9984)
  • output ํ™œ์„ฑํ™” ํ•จ์ˆ˜
    • linear regression → ๊ทธ๋Œ€๋กœ
    • ์ด์ง„ ๋ถ„๋ฅ˜ → sigmoid
    • ๋‹ค์ค‘ ๋ถ„๋ฅ˜ → softmaxNeural Network

Neural Network ํ›ˆ๋ จ์›๋ฆฌ

๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• (gradient descent)

  1. ๋ชฉ์  : ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์˜ ์ฐจ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” parameter(๐œƒ) ๋ฐœ๊ฒฌ
  2. ๋ฐฉ๋ฒ• : ์†์‹คํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•˜์—ฌ ์†์‹คํ•จ์ˆ˜์˜ ๊ฐ’์ด 0 ์œผ๋กœ ์ˆ˜๋ ดํ•˜๋„๋ก parameter(๐œƒ) ์กฐ์ ˆ
    → ๋ฏธ๋ถ„์„ ํ†ตํ•ด ๊ฒฝ์‚ฌ๋„๋ฅผ ๋”ฐ๋ผ ๋ฐ‘์œผ๋กœ ๋‚ด๋ ค๊ฐ€๋Š”, ์ตœ์†Œ๊ฐ’์„ ์ฐพ์•„๊ฐ
  3. Derivative (๋„ํ•จ์ˆ˜, ๋ฏธ๋ถ„, ์ ‘์„ ์˜ ๊ธฐ์šธ๊ธฐ)

    • ์˜๋ฏธ : x๊ฐ€ ์–ผ๋งŒํผ ์ฆ๊ฐ€ํ• ๋•Œ, y๊ฐ€ ์–ผ๋งŒํผ ์ฆ๊ฐ€ํ•œ๋‹ค
    • J(w) (=L(w)) : W๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ์ธ ์†์‹คํ•จ์ˆ˜ → ex) MSE
  4. Optimization : ์†์‹คํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” w์™€ b
    • ๋ฐฉํ–ฅ : Gradient (derivative of Cost Function)
      : ์†์‹ค์ด ์ตœ์†Œํ™” ๋˜๋Š” w๋ฅผ ์ฐพ์•„์•ผํ• ๋•Œ, gradient๊ฐ€ +, - ์— ๋”ฐ๋ผ ๋ฐฉํ–ฅ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
      • gradient < 0 : ํ˜„์žฌ์˜ w๊ฐ’์„ ํ‚ค์›Œ์•ผ๊ฒ ๋‹ค
      • gradient > 0 : ํ˜„์žฌ์˜ w๊ฐ’์„ ์ค„์—ฌ์•ผ๊ฒ ๋‹ค
    • ์ด๋™ ์†๋„ : Learning Rate
      → ์ตœ์ ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ง€๋‚˜๊ฐ€์ง€ ์•Š๋„๋ก ์ž‘๊ฒŒ ํ•ด์•ผํ•œ๋‹ค.
    • ๊ฐ€์ค‘์น˜ ๊ฐฑ์‹  ๊ณต์‹ : New W = old W – (Learning Rate) * (Gradient)
  5. ์„ ํ˜• ํšŒ๊ท€์—์„œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ → ์•„๋ž˜์˜ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ ์ตœ์ ์˜ ๊ฐ’์„ ์ฐพ์•„๊ฐ
    • \( y = ๐œƒ_0 + ๐œƒ_1๐‘ฅ \) (= y=b+wx)

    • Loss Function → (์ •๋‹ต- ์˜ˆ์ธก)^2 ์„ ํ‰๊ท  : MSE
      \( ๐ฟ (๐œƒ_0, ๐œƒ_1) = \frac{1}{m}\sum^{m}_{i=0} (y_i - (๐œƒ_0 + ๐œƒ_1x_i))^2 \) ( \( y_i \) : ์ •๋‹ต ๋ ˆ์ด๋ธ”)

    • Gradient : ์†์‹คํ•จ์ˆ˜ ํŽธ๋ฏธ๋ถ„
      \( \frac{d๐ฟ (๐œƒ_0, ๐œƒ_1)}{d๐œƒ_1} = -2\frac{1}{N}\sum^{m}_{i=0}x_i (y_i - (๐œƒ_0 + ๐œƒ_1x_i)) \)
      \( \frac{d๐ฟ (๐œƒ_0, ๐œƒ_1)}{d๐œƒ_0} = -2\frac{1}{N}\sum^{m}_{i=0} (y_i - (๐œƒ_0 + ๐œƒ_1x_i)) \)

    • Update : b, w๋ฅผ ์—…๋ฐ์ดํŠธ
      \( ๐œƒ_1 := ๐œƒ_1 -\alpha \frac{d๐ฟ (๐œƒ_0, ๐œƒ_1)}{d๐œƒ_1} \)
      \( ๐œƒ_0 := ๐œƒ_0 -\alpha \frac{d๐ฟ (๐œƒ_0, ๐œƒ_1)}{d๐œƒ_0} \)

์†์‹คํ•จ์ˆ˜(= ๋น„์šฉํ•จ์ˆ˜, ๋ชฉ์ ํ•จ์ˆ˜)

: ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•, ์˜ค์ฐจ์—ญ์ „ํŒŒ๊ฐ€ ์ž‘๋™ํ•˜๋„๋ก ๋Œ€์ƒ์ด ๋˜๋Š” ํ•จ์ˆ˜ / ๋ฏธ๋ถ„ ๊ฐ€๋Šฅ ํ•จ์ˆ˜

→ ๋งŒ๋“ค์–ด์„œ ์“ธ ์ˆ˜๋„ ์žˆ๊ณ , ๋งŒ๋“ค์–ด์ง„๊ฑธ ์จ๋„ ๋จ. (๊ฑฐ์˜ ๋งŒ๋“ค์–ด์ง„๊ฑฐ ์”€)

→ ํ’€๋ ค๋Š” ๋ฌธ์ œ์— ๋”ฐ๋ผ ๊ฑฐ์˜ 3๊ฐœ๋กœ ๋‚˜๋‰˜์–ด์ง

  1. Linear Regression (์„ ํ˜•ํšŒ๊ท€) : MSE (Mean Squared Error)

    $$MSE = \frac{1}{n} \sum_{i=1}^{n} (\hat Y_i - Y_i)^2$$
    →  Mse๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” w์™€ b๋ฅผ optimize

  2. Binary Classification (์ด์ง„๋ถ„๋ฅ˜ / Logistic Regression) : Binary-Cross-Entropy
    $$J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^{m} \left( (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) + y^{(i)} \log(h_\theta(x^{(i)})) \right) \right]$$
    ์ •๋‹ต์— ๋”ฐ๋ฅธ ์†์‹ค๊ฐ’ ํ™•์ธ ๊ทธ๋ž˜ํ”„
    1. \( If \ \ ๐‘ฆ^{(๐‘–)} = 1 : J(๐œƒ) = −๐‘™๐‘œ๐‘” \ โ„Ž_๐œƒ (๐‘ฅ^{(๐‘–)})\)
      where โ„Ž๐œƒ (๐‘ฅ(๐‘–)) should be close to 1
    2. \( If \ \ ๐‘ฆ^{(๐‘–)} = 1 : J(๐œƒ) = −๐‘™๐‘œ๐‘” \ (1-โ„Ž_๐œƒ (๐‘ฅ^{(๐‘–)}))\)
      where โ„Ž๐œƒ (๐‘ฅ(๐‘–)) should be close to 0
     
  3. Multi-Class Classification (๋‹ค์ค‘๋ถ„๋ฅ˜) : Categorical-Cross-Entropy
    (\(t_i\) : 0์ด ์•„๋‹Œ ํƒ€๊ฒŸ C : multi-classes)
    • softmax : ํ™œ์„ฑํ™”ํ•จ์ˆ˜ f
    • CE = softmax ํ•จ์ˆ˜์˜ ์ถœ๋ ฅ๊ฐ’์ธ ํ™•๋ฅ ๋ถ„ํฌ์— log๋ฅผ ์ทจํ•œ ๊ฒƒ * t ์˜ sum
    • ์˜ˆ์‹œ

      - True value : ํ•˜๋‚˜๋งŒ 1์ด๊ณ  ๋‚˜๋จธ์ง„ 0
      - prediction : ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ’
      - ํ˜„์žฌ : 7์ผ ํ™•๋ฅ ์ด 60% → ๋‚˜์ค‘ ์ตœ์ข… ๋ชฉํ‘œ : 7์ผ ํ™•๋ฅ ์ด 100% ์— ๊ฐ€๊น๊ฒŒ๋˜๋„๋ก
      - LCE = -log 0.6
      - true value๊ฐ€ ์•„๋‹Œ ๊ฐ’์€ 0์ด ๊ณฑํ•ด์ง
      - ํ˜„์žฌ ์šฐ๋ฆฌ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•œ ์†์‹ค๊ฐ’

Backpropagation (์˜ค์ฐจ์—ญ์ „ํŒŒ)

: ์†์‹คํ•จ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์‹ ๊ฒฝ๋ง ์ „์ฒด์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜๋„๋ก ํ•˜๋Š” ๊ธฐ๋ฒ•

→ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•

→ ์—ฌ๋Ÿฌ ์€๋‹‰์ธต์˜ ๊ฐ€์ค‘์น˜ ๊ฐ’์„ ์ตœ์ข… ์˜ˆ์ธก ๊ฐ’์ด ๋ ˆ์ด๋ธ”๊ฐ’๊ณผ ๊ทผ์‚ฌํ•œ ๊ฐ’์„ ๊ฐ€์ง€๋„๋ก ์กฐ์ •ํ•ด์•ผํ•จ

Forward Propagation : ์ˆœ์ „ํŒŒ

: ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ์˜ ์ž‘๋™๋ฐฉ์‹๊ณผ ๋™์ผ

 

Backward Propagation : ์—ญ์ „ํŒŒ

  1. ์•„์ด๋””์–ด : ๋ณตํ•ฉํ•จ์ˆ˜ ์ด๋ฏ€๋กœ ๋ฏธ๋ถ„์— chain rule (์—ฐ์‡„ ๋ฒ•์น™)์„ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.
  2. chain rule

      - w1์„ ์–ผ๋งŒํผ ๋ฐ”๊ฟจ์„ ๋•Œ, p๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋ฐ”๋€Œ์—ˆ์„์ง€ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Œ ⇒ P์— ๋Œ€ํ•œ w1์˜ ๋ณ€ํ™”์œจ ๊ณ„์‚ฐ ⇒ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•œ ๊ฒƒ    - ํ•œ๋ฒˆ๋งŒ ๊ณ„์‚ฐํ•˜๋ฉด ์—ฌ๋Ÿฌ๋ฒˆ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ (๊ณ„์‚ฐ๋Ÿ‰์„ ํš๊ธฐ์ ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Œ)
      - ์ด๋ฅผ ํ†ตํ•ด ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ
  3. EXAMPLE) 2๊ฐœ์˜ ํ”ผ์ฒ˜ / ํžˆ๋“  ๋ ˆ์ด์–ด 1๊ฐœ / ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜

  <์ˆœ๋ฐฉํ–ฅ>

  • w1, x1์„ ์ด์šฉํ•ด pre-activation๊ฐ’์ธ z๊ฐ’์„ ๊ณ„์‚ฐ
  • z๊ฐ’์„ ํ™œ์„ฑํ™”ํ•จ์ˆ˜์ธ ์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜์— ๋„ฃ์Œ
  • ์œ„์˜ ๊ฒฐ๊ณผ๋กœ output์ด ๋‚˜์˜ด → ์˜ˆ์ธก์น˜๋ผ๊ณ  ๊ฐ€์ •
  • ์†์‹ค ๊ณ„์‚ฐ : loss function์ธ cross entropy ํ•จ์ˆ˜์— a, y(์ •๋‹ต)์„ ๋„ฃ์Œ

  <์—ญ๋ฐฉํ–ฅ>
  : ์†์‹ค์„ ์ค„์ด๊ธฐ ์œ„ํ•œ w ์กฐ์ • (๊ธฐ์šธ๊ธฐ <0 ⇒ ํ•™์Šต๋ฅ ๋งŒํผ ๋”ํ•จ / ๊ธฐ์šธ๊ธฐ > 0 ⇒ ๋บŒ

  1. binary cross-entropy ํ•จ์ˆ˜ ๋ฏธ๋ถ„
    $$(\frac{dL}{da} ) : -[y\frac{1}{a} - (1-y)\frac{1}{1-a}] $$
  2. sigmoid ํ•จ์ˆ˜ ๋ฏธ๋ถ„
    $$ (\frac{da}{dz}) : \sigma(z)(1-\sigma(z)) $$
    ⇒ a, b๋กœ (\ \frac{dL}{dz}\) ๊ณ„์‚ฐ ๊ฐ€๋Šฅ

  3. \(  \frac{dz}{dw_1} \) ๊ณ„์‚ฐ : ax2+bx+c=0
    \(  z=w_1x_1+ w_2x_2 + b \) ์‹ w1์œผ๋กœ ํŽธ๋ฏธ๋ถ„ ⇒ x1

    $$ \frac{dL}{dW_1} = \frac{dL}{dz} \ \frac{dz}{dW_1} = -[y\frac{1}{a} - (1-y)\frac{1}{1-a}] * \sigma(z)(1-\sigma(z)) * x_1 $$
  4. ๋ฐฉ๋ฒ•
    • ๊ฐ๊ฐ์˜ input data ์— ๋Œ€ํ•˜์—ฌ, ์—ฌ๋Ÿฌ๊ฐœ์˜ ํžˆ๋“ ๋ ˆ์ด์–ด๊ฐ€ ์žˆ์„ ๋•Œ, ๊ฐ layer ๋ณ„๋กœ forward pass output ๊ฐ’์„ ๊ณ„์‚ฐ
      - ๋ฐฉ๋ฒ• : ๊ฐ ๋‰ด๋Ÿฐ์˜ pre-activation์—์„œ wx+b๋ฅผ ๊ตฌํ•จ ⇒ activation ์—์„œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ฅผ ํ†ต๊ณผ ⇒ ์ถœ๋ ฅ๊ฐ’ ๋‚˜์˜ด
    • Output layer์— ๋„๋‹ฌ์‹œ, ์˜ˆ์ธก ์ˆ˜ํ–‰
    • ์˜ˆ์ธก๊ฐ’๊ณผ ์‹ค์ œ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•œ cost function ๊ณ„์‚ฐ
    • Backpropagation ์„ ํ†ตํ•ด ์†์‹คํ•จ์ˆ˜์˜ ์†์‹ค๊ฐ’์„ ์ „๋‹จ๊ณ„์˜ layer ๋กœ ์ „๋‹ฌํ•˜์—ฌ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ตฌํ•จ
    • Error term ์˜ ๊ฐ’์— ๋”ฐ๋ผ ๊ฐ layer ์˜ weight ๋ฅผ update
      ⇒ ๊ธฐ์šธ๊ธฐ <0 ๋˜๋Š” ๊ธฐ์šธ๊ธฐ >0 ์— ๋”ฐ๋ผ์„œ ์—…๋ฐ์ดํŠธ

Global minimum, Learning rate, optimizer

๋‹ค์ฐจ์› ๊ณต๊ฐ„์—์„œ์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ

-  local minimum์˜ ์œ„์น˜์—์„œ ์ œ์ผ ์ž‘์€ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ฒƒ ๊ฐ™์ง€๋งŒ, global minimum ๊ฐ’์ด ์กด์žฌํ•œ๋‹ค.
    ⇒ local minimum์— ๋น ์ง€๋ฉด, ์ด์ƒ์ ์ธ ๊ฐ€์ค‘์น˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Œ

- optimizer : global minimum์— ๊ฐ€๋„๋ก ํ•˜๋Š” ์—ญํ• 

 

[ Learning rate ]

: step size๋ฅผ ์–ผ๋งˆ๋งŒํผ์”ฉ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ •ํ•  ๊ฒƒ์ธ์ง€

  1. defalut : 0.01
  2. ํ•™์Šต๋ฅ ์ด ๋„ˆ๋ฌด ๋†’์€ ๊ฒฝ์šฐ : ํ•™์Šต์€ ๋นจ๋ฆฌ๋จ → ๋ฌธ์ œ์  : global minimum์— ๋„๋‹ฌํ•  ์ˆ˜ ์—†๊ณ , ์™”๋‹ค๊ฐ”๋‹คํ•จ
    ํ•™์Šต๋ฅ ์ด ๋„ˆ๋ฌด ์ ์€ ๊ฒฝ์šฐ : ๋„ˆ๋ฌด ์ฒœ์ฒœํžˆ ๋‚ด๋ ค๊ฐ
  3. ํ•ด๊ฒฐ๋ฒ• : Adaptive Learning rate ๊ธฐ๋ฒ• → ์ฒ˜์Œ์—” ํฌ๊ฒŒ, ๊ฐˆ์ˆ˜๋ก ์กฐ๊ธˆ์”ฉ ์ค„์—ฌ์ค€๋‹ค

[optimization ๋ฐฉ๋ฒ• : SGD (stochastic Gradient Descent, ํ™•๋ฅ ์  ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•)]

  1. ๋ฐฉ๋ฒ•
    • ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๊ฑด์”ฉ ์†์‹ค์„ ๊ณ„์‚ฐ
    • ๊ธฐ์šธ๊ธฐ๊ฐ€ +, -๋ฅผ ๋ฐ˜๋ณตํ•˜๋ฉฐ ์ „์ฒด ๋ฐ์ดํ„ฐ๊ฐ€ ์ตœ์†Œ์ ์„ ํ–ฅํ•ด ์ง„ํ–‰๋จ
  2. ๋ฌธ์ œ์ 
    • ๋‹ค๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ํ•œ ๊ฑด์”ฉ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์€ ๋„ˆ๋ฌด ์˜ค๋ž˜๊ฑธ๋ฆผ
    • ์ง„ํญ์ด ํฐ ์—…๋ฐ์ดํŠธ ๊ถค์ 
  3. ํ•ด๊ฒฐ๋ฒ• 1 : batch gradient descent
    • ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ๊ธฐ์šธ๊ธฐ๋ฅผ ํ•œ๋ฒˆ์— ๊ณ„์‚ฐ → ๊ธฐ์šธ๊ธฐ์˜ ํ‰๊ท  ๊ณ„์‚ฐ
      → ํ‰๊ท ๋˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰์‹œ, global minimum์ธ ๊ฐ’์œผ๋กœ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ ๊ฐ€๋Šฅ
    • ๋ฌธ์ œ์  : ์ „์ฒด ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ผ๊ฐ€์•ผํ•จ (ํ•œ์ •๋œ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์ด์Šˆ)
    • ์žฅ์  : ์Šค๋ฌด์Šคํ•œ ์—…๋ฐ์ดํŠธ ๊ถค์ น
  4. ํ•ด๊ฒฐ : mini batch gradient descent
    • ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋ฅผ ์ •ํ•จ → ์ž‘์€ ์ƒ˜ํ”Œ๋“ค์„ ๋ชจ์•„ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ• ์ ์šฉ → global minimum์ธ ๊ฐ’์œผ๋กœ ์ด๋™
    • ์ค‘๊ฐ„ ์ •๋„์˜ ์—…๋ฐ์ดํŠธ ๊ถค์ 

Momentum

: ๋ฐฉํ–ฅ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉฐ ๊ฐ€์† → Global minimum ์— ๋นจ๋ฆฌ ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด์„œ !

  • vertically ๋Š” ๋ณ€ํ™”๊ฐ€ ์ ๊ณ , horizontally ๋Š” ๋ณ€ํ™”๊ฐ€ ํฌ๋„๋ก parameter ์กฐ์ ˆ
  • ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์ ์šฉํ•  ๋•Œ, ์—…๋ฐ์ดํŠธ ๋˜๋˜ ๋ฐฉํ–ฅ์„ ์–ด๋А์ •๋„ ์œ ์ง€ํ•˜๋ฉฐ ์ง„ํ–‰ ์‹œ, ์ง„ํญ์ด ์ปค์ง€๋Š”๊ฒƒ์„ ๋ง‰์•„ ๋น ๋ฅด๊ฒŒ ๋„๋‹ฌ
  1.  
  2. ๋‹ค์ฐจ์› ๊ณต๊ฐ„์—์„œ์˜ Local minumum ํƒˆ์ถœ ๊ฐ€๋Šฅํ•ด์ง
    • local minium ๋ฐœ์ƒ ์›์ธ
      : ๊ธฐ์šธ๊ธฐ = 0์ธ ๊ฐ’๋งŒ์„ ์ด์šฉํ•ด, ๊ทธ๋™์•ˆ ์›€์ง์ด๋˜ global minimum ๋ฐฉํ–ฅ์„ ๋ฌด์‹œํ•˜์—ฌ
        local minimum ์ฃผ๋ณ€๋งŒ ์™”๋‹ค๊ฐ”๋‹ค ์›€์ง์ด๊ฒŒ๋˜๋Š” ๊ฒƒ
    • momentum์˜ ํƒˆ์ถœ ๋ฐฉ๋ฒ•
      : ์›€์ง์ด๋˜ ๋ฐฉํ–ฅ์„ฑ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ๋‚จ๊ฒจ๋†“๊ณ  ๊ณต์‹์„ ๊ตฌํ˜„
  3. saddle point์—์„œ ํƒˆ์ถœ ๊ฐ€๋Šฅ  
    • ๊ธฐ์กด : ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ํ˜„์ƒ ๋ฐœ์ƒ → ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์ด ๋˜์–ด ํ•ด๋‹น ํฌ์ธํŠธ์—์„œ ๋ฉˆ์ถค
    • momentum ์ถ”๊ฐ€ : ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ฏ€๋กœ ๋ฐฉํ–ฅ์„ ๋”ฐ๋ผ์„œ ๋‚ด๋ ค๊ฐˆ ์ˆ˜ ์žˆ์Œ
  4. ์‚ฌ๋žŒ์ด ์„ค์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ž„!! → ์Šค์Šค๋กœ ์ฐพ์•„์ฃผ์ง€ ์•Š์Œ

Optimizers : ์ตœ์†Œ, ์ตœ๋Œ€ ๊ฐ’ ์ฐพ๊ธฐ

→ ํ˜„์žฌ๋Š” ์ตœ์†Ÿ๊ฐ’ ์ฐพ๋Š” optimizer๋ฅผ ๊ณต๋ถ€ํ•จ

→ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋”ฐ๋ผ global minimum์„ ์ฐพ์•„๊ฐ€๋Š” ๊ถค์ ์ด ๋‹ค๋ฆ„

→ optimizer๋ณ„๋กœ saddle point์—์„œ ๋ฒ—์–ด๋‚˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๊ตฌํ˜„๋˜์–ด์žˆ์Œ

→ ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ค ๋‹ค์ฐจ์› ๊ณต๊ฐ„์„ ๊ตฌ์„ฑํ•˜๋ƒ์— ๋”ฐ๋ผ optimizer ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง

 

  1. ์ข…๋ฅ˜
    • Stochastic Gradient Descent Optimizer
    • RMSProp Optimzer
    • Adagrad Optimizer
    • Adam Optimizer, etc

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์™€ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€ ๊ธฐ๋ฒ•

Epoch

: ์ „์ฒด dataset ์ด neural network ์„ ํ†ตํ•ด ํ•œ๋ฒˆ ์ฒ˜๋ฆฌ๋œ ๊ฒƒ

  1. ํŠน์ง•
    • Epoch ์€ model ์˜ training ์‹œ์— hyper parameter ๋กœ ํšŸ์ˆ˜ ์ง€์ •
    • ํ•˜๋‚˜์˜ epoch ์€ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๊ธฐ ์–ด๋ ค์šด size ์ด๋ฏ€๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ batch ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌ (๋ฉ”๋ชจ๋ฆฌ ์ด์Šˆ)
      ⇒ ์ „์ฒด ๋ฐ์ดํ„ฐ (1 ์—ํญ)๋ฅผ ๋ฐฐ์น˜ ์‚ฌ์ด์ฆˆ๋กœ ๋‚˜๋ˆ„์–ด ์‚ฌ์šฉ (๋ฏธ๋‹ˆ๋ฐฐ์น˜)
    • Parameter training ์„ ์œ„ํ•ด์„œ๋Š” ์—ฌ๋Ÿฌ ๋ฒˆ epoch ์„ ๋ฐ˜๋ณตํ•ด์•ผ ํ•œ๋‹ค.
      ⇒ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฐ˜๋ณตํ•˜์—ฌ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•จ
    • One epoch ๋‚ด์—์„œ์˜ iteration ํšŸ์ˆ˜ : total sample size / batch size
      ⇒ Ex) 1 epoch = 2000 training example / 500 batches = 4 iterations

Hyper-parameter

  1. parameter : w, b → ์Šค์Šค๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ
  2. ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ : ํŒŒ๋ผ๋ฏธํ„ฐ ์ œ์™ธ ๋‚˜๋จธ์ง€ ์ „๋ถ€ → ์‚ฌ๋žŒ์ด ์ง€์ •ํ•ด์ฃผ๋Š” ๊ฒƒ
    • ํ•™์Šต๋ฅ   : \( \alpha \)
    • momentum
    • number of layers : ํžˆ๋“  ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜
      ⇒ ์ ์œผ๋ฉด, ๋ฐ์ดํ„ฐ์˜ ์ถฉ๋ถ„ํ•œ ๋””ํ…Œ์ผ์„ ๋ชจ๋ธ์ด ํ•™์Šตํ•  ์ˆ˜ ์—†์Œ
      ⇒ ๋งŽ๋‹ค๊ณ  ์ข‹์€๊ฒŒ ์•„๋‹˜ / ๊ฐ€์ค‘์น˜ ๊ฐœ์ˆ˜๊ฐ€ ์—„์ฒญ๋‚˜๊ฒŒ ๋Š˜์–ด๋‚˜๊ณ , ์—„์ฒญ ํฐ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ํ•„์š”๋กœ ํ•ด์ง

    • Dropout rate : ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•ด ๋‰ด๋Ÿฐ์„ ์ฃฝ์˜€๋‹ค๊ฐ€ ์‚ด๋ ธ๋‹ค๊ฐ€ ํ•˜๋Š” ๋น„์œจ
      ⇒ ํŠน์ • ๋‰ด๋Ÿฐ์ด ๊ฐ•ํ•˜๊ฒŒ ์˜์กด๋˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฒƒ
      ⇒ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒƒ / ๊ณผ์ ํ•ฉ ๋˜๋Š” ๊ฒƒ / ํŠน์ • ํ”ผ์ฒ˜์— ์˜์กดํ•˜๋Š” ๊ฒƒ ์„ ๋ฐฉ์ง€
      ⇒ Dropout Regularization : ๋žœ๋คํ•œ ๋‰ด๋Ÿฐ ์„ ํƒ์„ ์ง„ํ–‰ํ•˜๋Š” ๋“œ๋กญ์•„์›ƒ์„ ํ†ตํ•œ ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€
    • number of epochs
    • batch size
  3. ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ• : ์ •ํ•ด์ง„ ๋ฃฐ์ด ์—†์Œ
    • ์œ ์‚ฌํ•œ ๋ชจ๋ธ ์ฐธ์กฐ
    • ๊ฒฝํ—˜์— ์˜ํ•œ ์ถ”์ธก
    • ๊ทธ๋ฆฌ๋“œ ์„œ์น˜ : ์ข…๋ฅ˜๋ณ„๋กœ ๋‹ค ํ•ด๋ด๋ผ

 

tensorflow ์†Œ๊ฐœ ๋ฐ neural network๋ฅผ ์ด์šฉํ•œ ํšŒ๊ท€ ์›๋ฆฌ

Tensorflow๋ž€ ?

: ๊ตฌ๊ธ€์—์„œ ์‚ฌ์šฉํ•˜๋˜ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์˜คํ”ˆ ์†Œ์Šค๋กœ ๊ณต๊ฐœํ•œ๊ฒƒ

  1. Tensorflow ์„ค์น˜
# GPU๊ฐ€ ์žฅ์ฐฉ๋˜์–ด์žˆ์ง€ ์•Š์€ ์ปดํ“จํ„ฐ์ผ ๋–„
# ANACONDA Prompt์—์„œ ์‹คํ–‰
pip install --upgrade tensorflow

# python ์„ ์ž…๋ ฅํ•˜์—ฌ ํŒŒ์ด์ฌ ์ธํ„ฐํ”„๋ฆฌํ„ฐ์— ์ ‘
import tensorflow as tf
tf._version_ # ํ˜„์žฌ ์‚ฌ์šฉ ์ค‘์ธ ๋ฒ„์ „์„ ํ™•์ธ ํ•  ์ˆ˜ ์žˆ์Œ

# GPU๊ฐ€ ์ง€์›๋˜๋Š” ์ปดํ“จํ„ฐ์ผ๋•Œ,
GPU ์ง€์›์—์„œ CUDA ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ํ›„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• ํ™•์ธ ๊ฐ€๋Šฅ

 

  2. Keras

 

# ์ดˆ๋ณด์ž์šฉ : Sequential API๋ฅผ ์‚ฌ์šฉํ•ด ๋ชจ๋ธ ๊ตฌ์„ฑ
# ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ์ด ๊ฐ€๋Šฅ
model = tf.keras.models.Sequentioal([
	tf.keras.layers.Flatten()
])

# ์ „๋ฌธ๊ฐ€์šฉ : Subclassing API -> pandas์˜ class ๋ฌธ๋ฒ•์„ ์‚ฌ์šฉ
# tf.keras.Model์„ ์ƒ์†๋ฐ›์•„ ์‚ฌ์šฉ
class MyModel(tf.keras.Model):
	def __init__(self):
		super(MyModel, self).__init__()
		self.conv1 = Conv2D(32,3,activation='relu')
		๋“ฑ๋“ฑ
	# ์ˆœ์ „ํŒŒ
	def call(self, x):
		x = self.conv1(x)
		๋“ฑ๋“ฑ
model = MyModel()
# ์—ญ์ „ํŒŒ
with tf.GradientTape() as tape :
	logits = model(images)
	loss_value = loss(logits, labels)
grads = tape.gradient(loss_value, model.trainalble_variable)

 

[ ์ „ํ†ต์ ์ธ linear Regression vs ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ๋กœ ๊ตฌํ˜„ํ•œ linear Regression ]

  1. ์ „ํ†ต์ ์ธ linear Regression
    1. one layer
      : ํžˆ๋“ ๋ ˆ์ด์–ด ์—†์ด ์ž…๋ ฅ, ์ถœ๋ ฅ์ธต์„ ๋ฐ”๋กœ ์—ฐ๊ฒฐ๋œ ํ˜•ํƒœ์˜ ๋‹จ์ผ ๋ ˆ์ด์–ด ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ
      ⇒ ์ „ํ†ต์ ์ธ ์„ ํ˜•ํšŒ๊ท€๋Š” one layer์ผ ๋•Œ, ๋™์ผํ•œ ๋ชจ์–‘ (์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผ)
      ์„ ํ˜•์ ์ธ ํŠน์„ฑ๋งŒ ํŒŒ์•…
  2. ๋‰ด๋Ÿด๋„คํŠธ์›Œํฌ
    1. multiple-layers์ธ ํžˆ๋“ ๋ ˆ์ด์–ด
      • ํžˆ๋“ ๋ ˆ์ด์–ด์˜ ๋…ธ๋“œ : ๊ฐ ๋‰ด๋Ÿฐ์— ์„ ํ˜•์ ์ธ pre-activation, ๋น„์„ ํ˜•์ ์ธ activation ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋‰จ
        ⇒ ํžˆ๋“ ๋ ˆ์ด์–ด ๋‚ด์—์„œ ์„ ํ˜•์„ฑ, ๋น„์„ ํ˜•์ ์ธ ํŒจํ„ด์„ ํŒŒ์•…ํ•ด ๋‘ ํŠน์„ฑ์„ ํ•จ๊ป˜ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Œ

์‹ค์Šต : neural network๋ฅผ ์ด์šฉํ•œ linear regression

Boston ์ฃผํƒ ๊ฐ€๊ฒฉ ์˜ˆ์ธก

  • ๊ตฌ์„ฑ : 13๊ฐœ์˜ ์ข…์†๋ณ€์ˆ˜ + 1๊ฐœ์˜ ๋…๋ฆฝ๋ณ€์ˆ˜(์ฃผํƒ๊ฐ€๊ฒฉ ์ค‘์•™๊ฐ’)

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ + ์ •์ œ

# ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
df_boston = pd.read_csv("boston_house.csv", index_col=0)
boston = df_boston.drop('MEDV', axis=1)
target = df_boston.pop('MEDV')

X = boston.values
y = target.values

print(X.shape) (506, 13) print(y.shape) (506,)

# train, test ๋ถ„๋ฆฌ
X_train, X_test, y_train, y_test = train_test_split(X, y)

# feature scailing
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

๋ชจ๋ธ ์„ค๊ณ„, ๋ชจ๋ธ compile, ํ›ˆ๋ จ → ์ด๊ฒŒ ์ฐจ์ด์ 

๋ชจ๋ธ ์„ค๊ณ„

# sequntial ๋ชจ๋ธ ์‚ฌ์šฉ : ๊ณ„์† addํ•ด๊ฐ€๋ฉด์„œ ๋ชฏ๋ชจ๋ธ์„ ๋งŒ๋“ค๋ฉด 
model = Sequential()
# ํžˆ๋“  ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜ : 64 / input shape : tuple ํ˜•ํƒœ๋กœ ํ•ด์•ผํ•จ -> (13, ) / ํ™œ์„ฑํ™”ํ•จ์ˆ˜ = ReLU
model.add(Dense(64, input_shape=(13,), activation='relu'))
# ์ธํ’‹์„ ์ง€์ •์•ˆํ•ด๋„ ๋จ (์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์ด ์ธํ’‹์ด ๋  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—) / ํ™œ์„ฑํ™”ํ•จ์ˆ˜ = ReLU
model.add(Dense(32, activation='relu'))
# 1๊ฐœ๋กœ ์—ฐ๊ฒฐ / ํ™œ์„ฑํ™”ํ•จ์ˆ˜ = ์—†์Œ (์ด์œ  : linear regression ๋ชจ๋ธ์„ ๋งŒ๋“ค ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ)
model.add(Dense(1))

model.summary()
# Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 64)                896  (13 * 64+1)     
                                                                 
 dense_1 (Dense)             (None, 32)                2080 (64 * 32+1)    
                                                                 
 dense_2 (Dense)             (None, 1)                 33   (32 * 1 + 1)     
                                                                 
=================================================================
Total params: 3,009
Trainable params: 3,009
Non-trainable params: 0
_________________________________________________________________

# ๋ชจ๋ธ ์ปดํŒŒ์ผ 
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae', 'mse'])

๋ชจ๋ธ ํ›ˆ๋ จ ๋ฐ ์˜ˆ์ธก

# ํ›ˆ๋ จ : loss, metrics ๊ฐ’์ด ์ €์žฅ
# batch_size : ํด์ˆ˜๋ก ์ข‹์œผ๋‚˜, ๋ฉ”๋ชจ๋ฆฌ ์ฐจ์ง€๊ฐ€ ๋Š˜์–ด๋‚จ (ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ)
# epoch : ๋ฐ์ดํ„ฐ๋ฅผ ๋ช‡๋ฒˆ ๋ณด์—ฌ์ค„ ๊ฒƒ์ธ๊ฐ€
# validation ๋ฐ์ดํ„ฐ ์ง€์ • : X_test, y_test๋กœ
# verbose : ํ›ˆ๋ จ ์‹œ, ์ถœ๋ ฅ ์ •๋„ (๋ณดํ†ต 1, 2 ์‚ฌ)
history = model.fit(X_train, y_train, batch_size=32, 
                    epochs=500, validation_data=(X_test, y_test), verbose=1)

# ๋ชจ๋ธ ํ‰๊ฐ€ : ๋งˆ์ง€๋ง‰์— ์ฐํžŒ ๊ฒƒ์„ ํ•œ๋ฒˆ ๋” ๋ฐ˜๋ณต                   
model.evaluate(X_test, y_test, verbose=0)

# ๋ชจ๋ธ ์˜ˆ์ธก
y_pred = model.predict(X_test)

# ์˜ˆ์ธก ํ‰๊ฐ€
# MSE(mean squared error) ๊ณ„์‚ฐ : Mean squared error: 7.38
print("Mean squared error: {:.2f}".format(mean_squared_error(y_test, y_pred)))

# R2 ๊ณ„์‚ฐ : R2 score: 0.91 (1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ์Œ)
print("R2 score: {:.2f}".format(r2_score(y_test, y_pred)))

# ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
plt.scatter(y_test, y_test, label='true')
plt.scatter(y_test, y_pred, label='predict')
plt.xlabel('y_test')
plt.ylabel('y')
plt.legend()
plt.title('Boston House Price Prediction ($1,000)')

# ๋ชจ๋ธ ํ›ˆ๋ จ ๊ณผ์ • ์‹œ๊ฐํ™”
plt.plot(history.history['mse'], label='Train error')
plt.plot(history.history['val_mse'], label='Test error')
plt.ylim([0, 50])
plt.legend()

 

ํŒŒ๋ž‘ : ์‹ค์ œ๊ฐ’, ์ฃผํ™ฉ : ์˜ˆ์ธก ๊ฐ’ → ๋น„์Šทํ•˜๊ฒŒ ๋‹ค ์ž˜ ๋งž์ท„์Œ

sklearn LinearRegression ๋น„๊ต

from sklearn.linear_model import LinearRegression

# ๋ชจ๋ธ ์„ค๊ณ„
regr = LinearRegression()
# ๋ชจ๋ธ ํ›ˆ๋ จ
regr.fit(X_train, y_train)
# ์˜ˆ์ธก
y_pred = regr.predict(X_test)

# The coefficients : 
print('Coefficients: \\n', regr.coef_)
print('Intercept: \\n', regr.intercept_)

# MSE(mean squared error) ๊ณ„์‚ฐ
print("Mean squared error: {:.2f}".format(mean_squared_error(y_test, y_pred)))

# R2 ๊ณ„์‚ฐ 
print("R2 score: {:.2f}".format(r2_score(y_test, y_pred)))

plt.scatter(y_test, y_pred)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], '--', c='r')
plt.xlabel('y_test')
plt.ylabel('y_pred')
plt.title('Boston House Price Prediction ($1,000)')


์ž๋™์ฐจ ์—ฐ๋น„ ๊ณ„์‚ฐ

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ + ์ •์ œ

# ๋งํฌ๋กœ ๋ถˆ๋Ÿฌ์™€์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ ๋ฐ›๊ธฐ
data_path = tf.keras.utils.get_file("auto-mpg.data", 
        "<https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data>")
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model year', 'origin']
rawdata = pd.read_csv(data_path, names=column_names, na_values="?", comment="\\t", sep=" ", skipinitialspace=True)

# null ๋ฐ์ดํ„ฐ ์‚ญ์ œ
rawdata.dropna(inplace=True)
data = rawdata.copy()
# ์›ํ•ซ ์ธ์ฝ”๋”ฉ
data = pd.get_dummies(data, columns=['cylinders', 'origin'])
# ๋ ˆ์ด๋ธ” ์ง€์ •
label = data.pop('mpg')
# train / test split
X_train, X_test, y_train, y_test = train_test_split(data.values, label.values)
# ํ”ผ์ฒ˜ ์Šค์ผ€์ผ๋ง
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Regression Model Build

# sequntial ๋ชจ๋ธ
model = Sequential()
model.add(Dense(64, input_shape=(13,), activation='relu')
model.add(Dense(32, activation='relu'))
model.add(Dense(1))
model.summary()

# ๋ชจ๋ธ ์ปดํŒŒ์ผ 
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mae', 'mse'])

train / predict

# ํ›ˆ๋ จ
history = model.fit(X_train, y_train, batch_size=32, 
                    epochs=500, validation_data=(X_test, y_test), verbose=1)
# ๋ชจ๋ธ ํ‰๊ฐ€ : ๋งˆ์ง€๋ง‰์— ์ฐํžŒ ๊ฒƒ์„ ํ•œ๋ฒˆ ๋” ๋ฐ˜๋ณต                   
model.evaluate(X_test, y_test, verbose=0)
# ๋ชจ๋ธ ์˜ˆ์ธก
y_pred = model.predict(X_test)

r2 ๊ณ„์‚ฐ

# R2 ๊ณ„์‚ฐ : R2 score: 0.91 (1์— ๊ฐ€๊นŒ์šธ ์ˆ˜๋ก ์Œ)
print("R2 score: {:.2f}".format(r2_score(y_test, y_pred)))

์‹œ๊ฐํ™”

# ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”
plt.scatter(y_test, y_test, label='true')
plt.scatter(y_test, y_pred, label='predict')
plt.xlabel('y_test')
plt.ylabel('y')
plt.legend()
plt.title('Boston House Price Prediction ($1,000)')

# displacement๊ฐ€ ์–ผ๋งˆ์ผ๋•Œ, ์—ฐ๋น„๊ฐ€ ์–ผ๋งˆ์ธ์ง€ ํ™•์ธ
plt.scatter(X_test[:, 0], y_test, label='true value')
plt.scatter(X_test[:, 0], y_pre, label='predicted value')
plt.xlabel('displacement')
plt.ylabel('mpg')
plt.legend()

 

๋น„์Šทํ•˜๊ฒŒ ๋งž์ท„๋‹ค.


Neural network๋ฅผ ์ด์šฉํ•œ ์ด์ง„๋ถ„๋ฅ˜

(Logistic Regression, binary classification)

Sigmoid ํ•จ์ˆ˜

$$ f(z) = \frac{1}{1+e^{-z}} \ \ \ (z = \theta X) $$

  1. ํ•จ์ˆ˜ ์†Œ๊ฐœ
    • z : logit (logit ์•ˆ์— linear regression์˜ wx+b์˜ ๊ณต์‹์ด ๋“ค์–ด๊ฐ„๋‹ค)
    • S curve ํ˜•์„ฑ : 0.5 ๋ถ€๊ทผ์—์„œ ๊ธ‰๊ฒฉํžˆ ๋ณ€ํ™”ํ•˜๋Š” ํ•จ
    • [0, 1] ๋กœ ๋ฐ”์šด๋“œ ⇒ ํ™•๋ฅ ๋กœ ์ƒ๊ฐ๊ฐ€๋Šฅ
    • ๋ฏธ๋ถ„์ด ์‰ฌ์›€ (์ด์œ  : e )
  2. linear regression๊ณผ ๋น„๊ต
    • ๊ณต๋™์  : ๋กœ์ง“์„ ๊ตฌํ•˜๋Š” ๋ถ€๋ถ„์ด ๋™์ผ
    • ์ฐจ์ด์  : ์ด์ง„๋ถ„๋ฅ˜์˜ ๊ฒฝ์šฐ, sigmoid๋ฅผ ์ ์šฉํ•œ๋‹ค

neural net์„ ์ ์šฉํ•œ logistic regression

neural network regression ๋ชจ๋ธ์˜ output layer์— activation ํ•จ์ˆ˜๋งŒ ์ง€์ •ํ•œ ํ˜•ํƒœ

  1. output
    1. ํ™œ์„ฑํ™” ํ•จ์ˆ˜ $\sigma (z)$ : sigmoid ํ•จ์ˆ˜
    2. ๋‰ด๋Ÿฐ : 1๊ฐœ (์‹œ๊ทธ๋ชจ์ด๋“œ ํ•จ์ˆ˜๋กœ ๋‚˜์˜จ ํ™•๋ฅ ๊ฐ’ >0.5 ⇒ 1 / < 0.5 ⇒ 0)

์‹ค์Šต : Malware Detection

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋ฐ ์ •์ œ

df = pd.read_csv('datasets/malware.csv', index_col=0)
# X, y๋กœ ์ง€์ •
X = df.values
y = df.pop('legitimate').values

X.shape, y.shape #((10000, 54), (10000,))

# train, test ๋ถ„๋ฆฌ
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# ํ”ผ์ฒ˜ ์Šค์ผ€์ผ๋ง
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test  = sc.transform(X_test)

๋ชจ๋ธ ์„ค๊ณ„

# ๋ชจ๋ธ ๊ตฌ์„ฑ
model = tf.keras.Sequential()
model.add(Dense(32, input_shape=(54,), activation="relu"))
model.add(Dense(16, activation="relu"))
model.add(Dense(1, **activation="sigmoid"**))

model.summary()
# Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 32)                1760  (32 * 54+1)    
                                                                 
 dense_1 (Dense)             (None, 16)                528        
                                                                 
 dense_2 (Dense)             (None, 1)                 17        
                                                                 
=================================================================
Total params: 2,305
Trainable params: 2,305
Non-trainable params: 0

# ๋ชจ๋ธ ์ปดํŒŒ์ผ : ์ด์ง„๋ถ„๋ฅ˜ ์ด๋ฏ€๋กœ ์†์‹ค์€ binary_crossentropy
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=["accuracy"])

๋ชจ๋ธ ํ›ˆ๋ จ + ํ‰๊ฐ€

# ๋ชจ๋ธ ํ›ˆ๋ จ 
history = model.fit(X_train, y_train, epochs=20, batch_size=32,
                    validation_data=(X_test, y_test))

score = model.evaluate(X_test, y_test, verbose=0)

print(model.metrics_names)
print("Test score : {:.2f}".format(score[0]))
print("Test accuracy : {:.2f}".format(score[1]))

๊ฒฐ๊ณผ ์‹œ๊ฐํ™”

# ๋ชจ๋ธ ๊ฒฐ๊ณผ ์ •ํ™•๋„ ์‹œํ™”
plt.figure(figsize=(12,4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.xlabel('epoch')
plt.ylabel('accuracy')
plt.legend(['train', 'test'])
# ๋ชจ๋ธ ๊ฒฐ๊ณผ ์†์‹ค : ์‚ด์ง ๊ณผ์ ํ•ฉ๋จ -> ๊ฐˆ์ˆ˜๋ก ์ ์  ์†์‹ค์ด ์˜ค๋ฅด๋Š” ์ค‘
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['train', 'test'])

์˜ˆ์ธก

# y_pred๊ฐ€ 1์ผ ํ™•๋ฅ ์ด๋ฏ€๋กœ 0.5์™€ ๋น„๊ตํ•˜์—ฌ T/F๋กœ ๋ฐ”๊พธ๋Š” ์ž‘์—…์ด ํ•„์š”
y_pred = model.predict(X_test) > 0.5
accuracy_score(y_test, y_pred) # 0.9905