AI/๋…ผ๋ฌธ

[๋…ผ๋ฌธ ์ •๋ฆฌ] Implementation of DNN-based real-time voice conversion and itsimprovements by audio data augmentation and mask-shaped device

egahyun 2025. 9. 2. 00:00

๋…ผ๋ฌธ ๋งํฌ

 

https://www.semanticscholar.org/paper/Implementation-of-DNN-based-real-time-voice-and-its-Arakawa-Takamichi/fd5d3be9e8e293cce5b56f55de37af3d9734e0f4

 

www.semanticscholar.org

 

DNN ๊ธฐ๋ฐ˜์˜ voice conversion system

: 3๊ฐ€์ง€ ๋‹จ๊ณ„ (Analysis → Conversion → Synthesis)๋ฅผ ๊ฑฐ์ณ ์‹œ์Šคํ…œ ๊ตฌํ˜„

 

[ ์ž์„ธํ•œ ์‹œ์Šคํ…œ ๊ตฌ์กฐ ]

1. Analysis

  • ์ž…๋ ฅ ์Œ์„ฑ์—์„œ Mel-Cepstral Coefficients, ์Œ๋ ฅ, ์ „๋ ฅ, ์ถ”์ถœ
  • FFT Mel-Cepstral Coefficients ๋ถ„์„ ⇒ ๊ณ ์„ฑ๋Šฅ ์ŠคํŽ™ํŠธ๋Ÿผ ๋ถ„์„๊ธฐ ์‚ฌ์šฉ์œผ๋กœ ์ธํ•œ ๋†’์€ ๊ณ„์‚ฐ๋Ÿ‰, ์ฒ˜๋ฆฌ ์ง€์—ฐ ํ•ด๊ฒฐ
  • Trajectory Smoothing ์ ์šฉ : ๊ณ ์ฃผํŒŒ ์„ฑ๋ถ„ ์ œ๊ฑฐ + ์Œ์„ฑ ์˜ˆ์ธก ์ •ํ™•๋„ ํ–ฅ์ƒ

2. Conversion

  • ๋ชฉํ‘œ : ์†Œ์Šค ์Œ์„ฑ ํŠน์ง•์„ ํƒ€๊ฒŸ ์Œ์„ฑ์˜ ํŠน์ง•์œผ๋กœ ๋ณ€ํ™˜
  • ๊ณผ์ •
    • ์†Œ์Šค์˜ FFT Mel-Cepstral Coefficients ⇒ ํƒ€๊ฒŸ์˜ WORLD Mel-Cepstral Coefficients๋กœ ๋ณ€ํ™˜ (DNN ์‚ฌ์šฉ)
    • ์†Œ์Šค์˜ FFT Mel-Cepstral Coefficients ⇒ ํƒ€๊ฒŸ์˜ ๋Œ€์—ญ ํ‰๊ท  ๋น„์ฃผ๊ธฐ์„ฑ์œผ๋กœ ๋ณ€ํ™˜ (DNN ์‚ฌ์šฉ)
    • ์†Œ์Šค์˜ ๋กœ๊ทธ ์Šค์ผ€์ผ F0 ⇒ ํƒ€๊ฒŸ์˜ log scale F0๋กœ ๋ณ€ํ™˜
    • ์†Œ์Šค์˜ ์ „๋ ฅ ⇒ ํƒ€๊ฒŸ์˜ ์ „๋ ฅ์œผ๋กœ ๋ณ€ํ™˜
  • ์†์‹คํ•จ์ˆ˜ : MSE
  • ํ–ฅํ›„ ํ’ˆ์งˆ ํ–ฅ์ƒ ๊ณ„ํš : GAN ๊ธฐ๋ฐ˜ ํ•™์Šต ๋ฐฉ๋ฒ• ์ ์šฉ

3. Synthesis step

  • ๋ชฉํ‘œ : ๋ณ€ํ™˜๋œ ์Œ์„ฑ ํŠน์ง• ๊ธฐ๋ฐ˜ ์Œ์„ฑ ํ•ฉ์„ฑ
  • ์Œ์„ฑ ์‹ ํ˜ธ ์ƒ์„ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜ : recursive maximum likelihood parameter generation (R-MLPG)
  • ์ตœ์ข… ์Œ์„ฑ ์ถœ๋ ฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜ : WORLD’s recursive waveform generation algorithm

 

Audio data augmentation

: 3๊ฐ€์ง€์˜ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๊ธฐ๋ฒ• ์‚ฌ์šฉ (pitch shift → time stretch → time shift)

 

[ ์ž์„ธํ•œ ์‹œ์Šคํ…œ ๊ตฌ์กฐ ]

  1. Pitch Shift
    • ์Œ์„ฑ ํ”ผ์น˜๋ฅผ ์•ฝ๊ฐ„์”ฉ ๋ณ€ํ™”
    • ์‚ฌ์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜ : WSOLA ์•Œ๊ณ ๋ฆฌ์ฆ˜ + waveform resampling
  2. Time Stretch
    • ์Œ์„ฑ ์‹ ํ˜ธ๋ฅผ ๋Š˜์ด๊ณ , ์ค„์ž„
    • ์‚ฌ์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜ : WSOLA algorithm
    • ํšจ๊ณผ : ์Œ์„ฑ ์†๋„ ๋ณ€ํ™”๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•จ
  3. Time Shift
    • FFT Mel-Cepstral Coefficients ๋ถ„์„์˜ ์‹œ์ž‘ ์‹œ๊ฐ„์„ ํ”„๋ ˆ์ž„ ์ด๋™ ๊ธธ์ด ๋‚ด์—์„œ ๋ฌด์ž‘์œ„๋กœ ๋ณ€๊ฒฝ
    • ํšจ๊ณผ : ์Œ์„ฑ ํŠน์ง• ์ถ”์ถœ ๊ณผ์ •์ด ์‹œ์ž‘ ์‹œ๊ฐ„์— ์˜์กดํ•˜์ง€ ์•Š๋„๋ก
    • FFT Mel-Cepstral Coefficients ๋‹จ์  ๋ณด์•ˆ์„ ์œ„ํ•จ

 

์‹คํ—˜ ์„ค์ •

  1. dataset : 2๊ฐœ์˜ ์ผ๋ณธ์–ด ๋ฐœํ™” 100๊ฐœ (์ด 12๋ถ„) → training 0.9 / test 0.1
  2. ํ”„๋ ˆ์ž„ ๊ธธ์ด: 25ms, FFT ํฌ๊ธฐ: 512
  3. FFT Mel-Cepstral Coefficients ⇒ WORLD Mel-Cepstral Coefficients ๋ชจ๋ธ
    • ๋ชจ๋ธ ๊ตฌ์„ฑ
      • MLP
      • input layer : 195 unit (39*5)
      • hidden layer : 2๊ฐœ (๊ฐ 500 unit, activation function : Leaky ReLU)
      • output layer : 78 unit (39*2)
  4. FFT Mel-Cepstral Coefficients → Band-Aperiodicity ๋ณ€ํ™˜
    • ๋ชจ๋ธ ๊ตฌ์„ฑ
      • Single-layer Perceptron
      • input layer : 195 unit
      • output layer : 1 unit (activation function : sigmoid)
  5. Mel-Cepstral Coefficients ์ •๊ทœํ™” : ํ‰๊ท  0, ๋ถ„์‚ฐ 1
  6. ์ตœ์ ํ™” ๋ฐฉ๋ฒ•: Adam

'AI > ๋…ผ๋ฌธ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] AutoRec: Autoencoders Meet Collaborative Filtering  (0) 2025.10.17