Skip to content

Commit

Permalink
beginner/audio_data_augmentation_tutorial λ²ˆμ—­ (#581)
Browse files Browse the repository at this point in the history
beginner/audio_data_augmentation_tutorial λ²ˆμ—­ (#581)
  • Loading branch information
bub3690 authored Sep 11, 2022
1 parent 06a6f70 commit 5edf398
Showing 1 changed file with 76 additions and 85 deletions.
161 changes: 76 additions & 85 deletions beginner_source/audio_data_augmentation_tutorial.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# -*- coding: utf-8 -*-
"""
Audio Data Augmentation
μ˜€λ””μ˜€ 데이터 증강
=======================
``torchaudio`` provides a variety of ways to augment audio data.
*μ—­μž*: Lee Jong Bub <https://github.com/bub3690>
In this tutorial, we look into a way to apply effects, filters,
RIR (room impulse response) and codecs.
``torchaudio`` λŠ” μ˜€λ””μ˜€ 데이터λ₯Ό μ¦κ°•μ‹œν‚€λŠ” λ‹€μ–‘ν•œ 방법듀을 μ œκ³΅ν•©λ‹ˆλ‹€.
At the end, we synthesize noisy speech over phone from clean speech.
이 νŠœν† λ¦¬μ–Όμ—μ„œλŠ” 효과, ν•„ν„°,
곡간 μž„νŽ„μŠ€ 응닡(RIR, Room Impulse Response)κ³Ό 코덱을 μ μš©ν•˜λŠ” 방법을 μ‚΄νŽ΄λ³΄κ² μŠ΅λ‹ˆλ‹€.
ν•˜λ‹¨λΆ€μ—μ„œλŠ”, κΉ¨λ—ν•œ μŒμ„±μœΌλ‘œ λΆ€ν„° νœ΄λŒ€ν° λ„ˆλ¨Έμ˜ 작음이 λ‚€ μŒμ„±μ„ ν•©μ„±ν•˜κ² μŠ΅λ‹ˆλ‹€.
"""

import torch
Expand All @@ -19,10 +21,10 @@
print(torchaudio.__version__)

######################################################################
# Preparation
# μ€€λΉ„
# -----------
#
# First, we import the modules and download the audio assets we use in this tutorial.
# λ¨Όμ €, λͺ¨λ“ˆμ„ 뢈러였고 νŠœν† λ¦¬μ–Όμ— μ‚¬μš©ν•  μ˜€λ””μ˜€ μžλ£Œλ“€μ„ λ‹€μš΄λ‘œλ“œν•©λ‹ˆλ‹€.
#

import math
Expand All @@ -39,64 +41,59 @@


######################################################################
# Applying effects and filtering
# νš¨κ³Όμ™€ 필터링 μ μš©ν•˜κΈ°
# ------------------------------
#
# :py:func:`torchaudio.sox_effects` allows for directly applying filters similar to
# those available in ``sox`` to Tensor objects and file object audio sources.
# :py:func:`torchaudio.sox_effects` λŠ” ``sox`` 와 μœ μ‚¬ν•œ 필터듀을
# ν…μ„œ 객체듀과 파일 객체 μ˜€λ””μ˜€ μ†ŒμŠ€λ“€μ— 직접 적용 ν•΄μ€λ‹ˆλ‹€.
#
# There are two functions for this:
# 이λ₯Ό μœ„ν•΄ 두가지 ν•¨μˆ˜κ°€ μ‚¬μš©λ©λ‹ˆλ‹€:
#
# - :py:func:`torchaudio.sox_effects.apply_effects_tensor` for applying effects
# to Tensor.
# - :py:func:`torchaudio.sox_effects.apply_effects_file` for applying effects to
# other audio sources.
# - :py:func:`torchaudio.sox_effects.apply_effects_tensor` λŠ” ν…μ„œμ—
# 효과λ₯Ό μ μš©ν•©λ‹ˆλ‹€.
# - :py:func:`torchaudio.sox_effects.apply_effects_file` λŠ” λ‹€λ₯Έ μ˜€λ””μ˜€ μ†ŒμŠ€λ“€μ—
# 효과λ₯Ό μ μš©ν•©λ‹ˆλ‹€.
#
# Both functions accept effect definitions in the form
# ``List[List[str]]``.
# This is mostly consistent with how ``sox`` command works, but one caveat is
# that ``sox`` adds some effects automatically, whereas ``torchaudio``’s
# implementation does not.
# 두 ν•¨μˆ˜λ“€μ€ 효과의 μ •μ˜λ₯Ό ``List[List[str]]`` ν˜•νƒœλ‘œ λ°›μ•„λ“€μž…λ‹ˆλ‹€.
# ``sox`` 와 μž‘λ™ν•˜λŠ” 방법이 거의 μœ μ‚¬ν•©λ‹ˆλ‹€. ν•˜μ§€λ§Œ, ν•œκ°€μ§€ μœ μ˜μ μ€
# ``sox`` λŠ” μžλ™μœΌλ‘œ 효과λ₯Ό μΆ”κ°€ν•˜μ§€λ§Œ, ``torchaudio`` 의 κ΅¬ν˜„μ€ 그렇지 μ•Šλ‹€λŠ” μ μž…λ‹ˆλ‹€.
#
# For the list of available effects, please refer to `the sox
# documentation <http://sox.sourceforge.net/sox.html>`__.
# μ‚¬μš© κ°€λŠ₯ν•œ νš¨κ³Όλ“€μ˜ λͺ©λ‘μ„ μ•Œκ³ μ‹Άλ‹€λ©΄, `the sox
# documentation <http://sox.sourceforge.net/sox.html>`__ 을 μ°Έμ‘°ν•΄μ£Όμ„Έμš”.
#
# **Tip** If you need to load and resample your audio data on the fly,
# then you can use :py:func:`torchaudio.sox_effects.apply_effects_file`
# with effect ``"rate"``.
# **Tip** μ¦‰μ„μœΌλ‘œ μ˜€λ””μ˜€ 데이터 λ‘œλ“œμ™€ λ‹€μ‹œ μƒ˜ν”Œλ§ ν•˜κ³ μ‹Άλ‹€λ©΄,
# 효과 ``"rate"`` 와 ν•¨κ»˜ :py:func:`torchaudio.sox_effects.apply_effects_file` 을 μ‚¬μš©ν•˜μ„Έμš”.
#
# **Note** :py:func:`torchaudio.sox_effects.apply_effects_file` accepts a
# file-like object or path-like object.
# Similar to :py:func:`torchaudio.load`, when the audio format cannot be
# inferred from either the file extension or header, you can provide
# argument ``format`` to specify the format of the audio source.
# **Note** :py:func:`torchaudio.sox_effects.apply_effects_file` λŠ” 파일 ν˜•νƒœμ˜ 객체 λ˜λŠ” μ£Όμ†Œ ν˜•νƒœμ˜ 객체λ₯Ό λ°›μŠ΅λ‹ˆλ‹€.
# :py:func:`torchaudio.load` 와 μœ μ‚¬ν•˜κ²Œ, μ˜€λ””μ˜€ 포맷이
# 파일 ν™•μž₯μžλ‚˜ 헀더λ₯Ό 톡해 좔둠될 수 μ—†μœΌλ©΄,
# μ „λ‹¬μΈμž ``format`` 을 μ£Όμ–΄, μ˜€λ””μ˜€ μ†ŒμŠ€μ˜ 포맷을 ꡬ체화 해쀄 수 μžˆμŠ΅λ‹ˆλ‹€.
#
# **Note** This process is not differentiable.
# **Note** 이 과정은 λ―ΈλΆ„ λΆˆκ°€λŠ₯ν•©λ‹ˆλ‹€.
#

# Load the data
# 데이터λ₯Ό λΆˆλŸ¬μ˜΅λ‹ˆλ‹€.
waveform1, sample_rate1 = torchaudio.load(SAMPLE_WAV)

# Define effects
# νš¨κ³Όλ“€μ„ μ •μ˜ν•©λ‹ˆλ‹€.
effects = [
["lowpass", "-1", "300"], # apply single-pole lowpass filter
["speed", "0.8"], # reduce the speed
# This only changes sample rate, so it is necessary to
# add `rate` effect with original sample rate after this.
["lowpass", "-1", "300"], # 단극 μ €μ£ΌνŒŒ 톡과 ν•„ν„°λ₯Ό μ μš©ν•©λ‹ˆλ‹€.
["speed", "0.8"], # 속도λ₯Ό κ°μ†Œμ‹œν‚΅λ‹ˆλ‹€.
# 이 뢀뢄은 μƒ˜ν”Œ 레이트만 λ³€κ²½ν•˜κΈ°μ—, 이후에
# ν•„μˆ˜μ μœΌλ‘œ `rate` 효과λ₯Ό κΈ°μ‘΄ μƒ˜ν”Œ 레이트둜 μ£Όμ–΄μ•Όν•©λ‹ˆλ‹€.
["rate", f"{sample_rate1}"],
["reverb", "-w"], # Reverbration gives some dramatic feeling
["reverb", "-w"], # μž”ν–₯은 μ•½κ°„μ˜ 극적인 λŠλ‚Œμ„ μ€λ‹ˆλ‹€.
]

# Apply effects
# νš¨κ³Όλ“€μ„ μ μš©ν•©λ‹ˆλ‹€.
waveform2, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor(waveform1, sample_rate1, effects)

print(waveform1.shape, sample_rate1)
print(waveform2.shape, sample_rate2)

######################################################################
# Note that the number of frames and number of channels are different from
# those of the original after the effects are applied. Let’s listen to the
# audio.
# νš¨κ³Όκ°€ 적용되면, ν”„λ ˆμž„μ˜ μˆ˜μ™€ μ±„λ„μ˜ μˆ˜λŠ” 기쑴에 적용된 것듀과 달라짐에 μ£Όμ˜ν•˜μ„Έμš”.
# 이제 μ˜€λ””μ˜€λ₯Ό λ“€μ–΄λ΄…μ‹œλ‹€.
#

def plot_waveform(waveform, sample_rate, title="Waveform", xlim=None):
Expand Down Expand Up @@ -139,7 +136,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
plt.show(block=False)

######################################################################
# Original:
# κΈ°μ‘΄:
# ~~~~~~~~~
#

Expand All @@ -148,7 +145,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
Audio(waveform1, rate=sample_rate1)

######################################################################
# Effects applied:
# 효과 적용 ν›„:
# ~~~~~~~~~~~~~~~~
#

Expand All @@ -157,24 +154,22 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
Audio(waveform2, rate=sample_rate2)

######################################################################
# Doesn’t it sound more dramatic?
# μ’€ 더 극적으둜 듀리지 μ•Šλ‚˜μš”?
#

######################################################################
# Simulating room reverberation
# λ°© μž”ν–₯ λͺ¨μ˜ μ‹€ν—˜ν•˜κΈ°
# -----------------------------
#
# `Convolution
# reverb <https://en.wikipedia.org/wiki/Convolution_reverb>`__ is a
# technique that's used to make clean audio sound as though it has been
# produced in a different environment.
# reverb <https://en.wikipedia.org/wiki/Convolution_reverb>`__ λŠ”
# κΉ¨λ—ν•œ μ˜€λ””μ˜€λ₯Ό λ‹€λ₯Έ ν™˜κ²½μ—μ„œ μƒμ„±λœ κ²ƒμ²˜λŸΌ λ§Œλ“€μ–΄μ£ΌλŠ” κΈ°μˆ μž…λ‹ˆλ‹€.
#
# Using Room Impulse Response (RIR), for instance, we can make clean speech
# sound as though it has been uttered in a conference room.
# 예λ₯Όλ“€μ–΄, 곡간 μž„νŽ„μŠ€ 응닡 (RIR)을 ν™œμš©ν•˜μ—¬, κΉ¨λ—ν•œ μŒμ„±μ„
# 마치 νšŒμ˜μ‹€μ—μ„œ 발음된 κ²ƒμ²˜λŸΌ λ§Œλ“€ 수 μžˆμŠ΅λ‹ˆλ‹€.
#
# For this process, we need RIR data. The following data are from the VOiCES
# dataset, but you can record your own β€” just turn on your microphone
# and clap your hands.
# 이 과정을 μœ„ν•΄μ„œ, RIR 데이터가 ν•„μš”ν•©λ‹ˆλ‹€. λ‹€μŒ 데이터듀은 VOiCES λ°μ΄ν„°μ…‹μ—μ„œ μ™”μŠ΅λ‹ˆλ‹€.
# ν•˜μ§€λ§Œ, 직접 λ…ΉμŒν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€. - 직접 마이크λ₯Ό μΌœμ‹œκ³ , λ°•μˆ˜λ₯Ό μΉ˜μ„Έμš”!
#

rir_raw, sample_rate = torchaudio.load(SAMPLE_RIR)
Expand All @@ -183,8 +178,8 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
Audio(rir_raw, rate=sample_rate)

######################################################################
# First, we need to clean up the RIR. We extract the main impulse, normalize
# the signal power, then flip along the time axis.
# λ¨Όμ €, RIR을 κΉ¨λ—ν•˜κ²Œ λ§Œλ“€μ–΄μ€˜μ•Όν•©λ‹ˆλ‹€. μ£Όμš”ν•œ μž„νŽ„μŠ€λ₯Ό μΆ”μΆœν•˜κ³ ,
# μ‹ ν˜Έ μ „λ ₯을 μ •κ·œν™” ν•©λ‹ˆλ‹€. 그리고 λ‚˜μ„œ μ‹œκ°„μΆ•μ„ 뒀집어 μ€λ‹ˆλ‹€.
#

rir = rir_raw[:, int(sample_rate * 1.01) : int(sample_rate * 1.3)]
Expand All @@ -194,7 +189,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
plot_waveform(rir, sample_rate, title="Room Impulse Response")

######################################################################
# Then, we convolve the speech signal with the RIR filter.
# κ·Έ ν›„, RIR 필터와 μŒμ„± μ‹ ν˜Έλ₯Ό ν•©μ„±κ³± ν•©λ‹ˆλ‹€.
#

speech, _ = torchaudio.load(SAMPLE_SPEECH)
Expand All @@ -203,7 +198,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
augmented = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0]

######################################################################
# Original:
# κΈ°μ‘΄:
# ~~~~~~~~~
#

Expand All @@ -212,7 +207,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
Audio(speech, rate=sample_rate)

######################################################################
# RIR applied:
# RIR 적용 ν›„:
# ~~~~~~~~~~~~
#

Expand All @@ -222,13 +217,12 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):


######################################################################
# Adding background noise
# λ°°κ²½ μ†ŒμŒ μΆ”κ°€ν•˜κΈ°
# -----------------------
#
# To add background noise to audio data, you can simply add a noise Tensor to
# the Tensor representing the audio data. A common method to adjust the
# intensity of noise is changing the Signal-to-Noise Ratio (SNR).
# [`wikipedia <https://en.wikipedia.org/wiki/Signal-to-noise_ratio>`__]
# μ˜€λ””μ˜€ 데이터에 μ†ŒμŒμ„ μΆ”κ°€ν•˜κΈ° μœ„ν•΄μ„œ, κ°„λ‹¨νžˆ μ†ŒμŒ ν…μ„œλ₯Ό μ˜€λ””μ˜€ 데이터 ν…μ„œμ— 더할 수 μžˆμŠ΅λ‹ˆλ‹€.
# μ†ŒμŒμ˜ 정도λ₯Ό μ‘°μ ˆν•˜λŠ” ν”ν•œ 방법은 μ‹ ν˜Έ λŒ€ μž‘μŒλΉ„ (SNR)λ₯Ό λ°”κΎΈλŠ” κ²ƒμž…λ‹ˆλ‹€.
# [`wikipedia <https://ko.wikipedia.org/wiki/%EC%8B%A0%ED%98%B8_%EB%8C%80_%EC%9E%A1%EC%9D%8C%EB%B9%84>`__]
#
# $$ \\mathrm{SNR} = \\frac{P_{signal}}{P_{noise}} $$
#
Expand All @@ -250,7 +244,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
noisy_speeches.append((scale * speech + noise) / 2)

######################################################################
# Background noise:
# 배경 작음:
# ~~~~~~~~~~~~~~~~~
#

Expand Down Expand Up @@ -290,13 +284,12 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):


######################################################################
# Applying codec to Tensor object
# 코덱을 ν…μ„œ 객체에 μ μš©ν•˜κΈ°
# -------------------------------
#
# :py:func:`torchaudio.functional.apply_codec` can apply codecs to
# a Tensor object.
# :py:func:`torchaudio.functional.apply_codec` λŠ” ν…μ„œ μ˜€λΈŒμ νŠΈμ— 코덱을 μ μš©ν•©λ‹ˆλ‹€.
#
# **Note** This process is not differentiable.
# **Note** 이 과정은 λ―ΈλΆ„ λΆˆκ°€λŠ₯ν•©λ‹ˆλ‹€.
#


Expand Down Expand Up @@ -349,29 +342,27 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):
Audio(waveforms[2], rate=sample_rate)

######################################################################
# Simulating a phone recoding
# μ „ν™” λ…ΉμŒ λͺ¨μ˜ μ‹€ν—˜ν•˜κΈ°
# ---------------------------
#
# Combining the previous techniques, we can simulate audio that sounds
# like a person talking over a phone in a echoey room with people talking
# in the background.
# 이전 κΈ°μˆ λ“€μ„ ν˜Όν•©ν•˜μ—¬, 반ν–₯μžˆλŠ” 방의 μ‚¬λžŒλ“€μ΄ μ΄μ•ΌκΈ°ν•˜λŠ” λ°°κ²½μ—μ„œ μ „ν™” ν†΅ν™”ν•˜λŠ”
# 것 처럼 λ“€λ¦¬λŠ” μ˜€λ””μ˜€λ₯Ό λͺ¨μ˜ μ‹€ν—˜ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
#

sample_rate = 16000
original_speech, sample_rate = torchaudio.load(SAMPLE_SPEECH)

plot_specgram(original_speech, sample_rate, title="Original")

# Apply RIR
# RIR μ μš©ν•˜κΈ°
speech_ = torch.nn.functional.pad(original_speech, (RIR.shape[1] - 1, 0))
rir_applied = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0]

plot_specgram(rir_applied, sample_rate, title="RIR Applied")

# Add background noise
# Because the noise is recorded in the actual environment, we consider that
# the noise contains the acoustic feature of the environment. Therefore, we add
# the noise after RIR application.
# λ°°κ²½ 작음 μΆ”κ°€ν•˜κΈ°
# 작음이 μ‹€μ œ ν™˜κ²½μ—μ„œ λ…ΉμŒλ˜μ—ˆκΈ° λ•Œλ¬Έμ—, 작음이 ν™˜κ²½μ˜ 음ν–₯ νŠΉμ§•μ„ 가지고 μžˆλ‹€κ³  κ³ λ €ν–ˆμŠ΅λ‹ˆλ‹€.
# λ”°λΌμ„œ, RIR 적용 후에 μž‘μŒμ„ μΆ”κ°€ν–ˆμŠ΅λ‹ˆλ‹€
noise, _ = torchaudio.load(SAMPLE_NOISE)
noise = noise[:, : rir_applied.shape[1]]

Expand All @@ -381,7 +372,7 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):

plot_specgram(bg_added, sample_rate, title="BG noise added")

# Apply filtering and change sample rate
# 필터링을 μ μš©ν•˜κ³  μƒ˜ν”Œ 레이트 μˆ˜μ •ν•˜κΈ°
filtered, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor(
bg_added,
sample_rate,
Expand All @@ -401,42 +392,42 @@ def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None):

plot_specgram(filtered, sample_rate2, title="Filtered")

# Apply telephony codec
# μ „ν™” 코덱 μ μš©ν•˜κΈ°
codec_applied = F.apply_codec(filtered, sample_rate2, format="gsm")

plot_specgram(codec_applied, sample_rate2, title="GSM Codec Applied")


######################################################################
# Original speech:
# κΈ°μ‘΄ μŒμ„±:
# ~~~~~~~~~~~~~~~~
#

Audio(original_speech, rate=sample_rate)

######################################################################
# RIR applied:
# RIR 적용 ν›„:
# ~~~~~~~~~~~~
#

Audio(rir_applied, rate=sample_rate)

######################################################################
# Background noise added:
# λ°°κ²½ 작음 μΆ”κ°€ ν›„:
# ~~~~~~~~~~~~~~~~~~~~~~~
#

Audio(bg_added, rate=sample_rate)

######################################################################
# Filtered:
# 필터링 적용 ν›„:
# ~~~~~~~~~
#

Audio(filtered, rate=sample_rate2)

######################################################################
# Codec aplied:
# 코덱 적용 ν›„:
# ~~~~~~~~~~~~~
#

Expand Down

0 comments on commit 5edf398

Please sign in to comment.