WaveNet 源碼

by allenlu2007

Reference: 谷歌WaveNet 源码详解

繁體版:https://zhuanlan.zhihu.com/p/24568596


WaveNet是谷歌deepmind最新推出基於深度學習的語音生成模型。該模型可以直接對原始語音數據進行建模,在 text-to-speech和語音生成任務中效果非常好(詳情請參見:谷歌WaveNet如何通過深度學習方法來生成聲音?)。本文將對WaveNet的tensorflow實現的源碼進行詳解(本文解析的源代碼為github上的ibab發佈的採用tensorflow實現的WaveNet,github鏈接:ibab發佈的wavenet源碼)。

本文的結構如下:一,wavenet結構介紹;二,源代碼詳解;三,總結

 

1,wavenet結構介紹

wavenet採用了擴大卷積和因果卷積的方法,讓感受野隨著網絡深度增加而成倍增加,可以對原始語音數據進行建模。(詳情請參見:谷歌WaveNet如何通過深度學習方法來生成聲音?

 

2,源碼詳解

2.1 概況

github下載下來的文件夾如圖所示:

其中,關鍵的文件為train.py,generate.py和wavenet文件夾。train.py為訓練代碼,generate.py為生成代碼。wavenet文件夾包括了所需的模型,語音讀取,以及其它功能類和方法。wavenet文件夾包含文件如圖所示:

2.2 train.py解析

讓我們正式開始wavenet之旅把。首先看看train.py。train.py包括了一系列參數,模型保存(save())/加載(load())方法以及main()方法。

2.2.1 一系列參數

BATCH_SIZE = 1 #batchsize
DATA_DIRECTORY = './VCTK-Corpus' #訓練數據路徑
LOGDIR_ROOT = './logdir'#log路徑
CHECKPOINT_EVERY = 50#每個多少輪check
NUM_STEPS = 4000#每一輪訓練步數
LEARNING_RATE = 0.02#學習速率
WAVENET_PARAMS = './wavenet_params.json'#模型參數
STARTED_DATESTRING = "{0:%Y-%m-%dT%H-%M-%S}".format(datetime.now())#初始生成種子
SAMPLE_SIZE = 100000#樣本大小
L2_REGULARIZATION_STRENGTH = 0#l2 正則強度
SILENCE_THRESHOLD = 0.3
EPSILON = 0.001
ADAM_OPTIMIZER = 'adam'#adam優化器
SGD_OPTIMIZER = 'sgd'#sgd優化器
SGD_MOMENTUM = 0.9#學習動量

2.2.2 模型保存/加載方法

該部分代碼很簡單,其中關鍵函數為tensorflow的保存和加載模型函數。

saver.save(sess, checkpoint_path, global_step=step)#保存模型參數
saver.restore(sess, ckpt.model_checkpoint_path)#加載模型

2.2.3 main方法

main()包含了訓練的主要內容:一,讀取wavenet模型參數;二,建立tensorflow的coordinator;三, 從VCTK 數據集生成input;四,建模wavenet模型;五,訓練並保存模型

def main():
   #獲取參數
   args = get_arguments()
   …
   #略過部分不重要的代碼

   #讀取wavenet模型參數
   with open(args.wavenet_params, ‘r’) as f:
      wavenet_params = json.load(f)

   # 建立coordinator.
   coord = tf.train.Coordinator()

   # 從VCTK 數據集生成input.
   with tf.name_scope(‘create_inputs’):
       # Allow silence trimming to be skipped by specifying a threshold near
       # zero.
       silence_threshold = args.silence_threshold if args.silence_threshold > \
                                                                                 EPSILON else None
      #此處採用了audio_reader.py中的AudioReader類,後面將對該類進行詳解
      reader = AudioReader(
             args.data_dir,
             coord,
             sample_rate=wavenet_params[‘sample_rate’],
             sample_size=args.sample_size,
             silence_threshold=args.silence_threshold)
        audio_batch = reader.dequeue(args.batch_size)#傳入batch數據

   #建立網絡,使用model.py的WaveNetModel類.後面將對該類進行詳解。
   net = WaveNetModel(
         batch_size=args.batch_size,
         dilations=wavenet_params[“dilations”],
         filter_width=wavenet_params[“filter_width”],
         residual_channels=wavenet_params[“residual_channels”],
         dilation_channels=wavenet_params[“dilation_channels”],
         skip_channels=wavenet_params[“skip_channels”],
         quantization_channels=wavenet_params[“quantization_channels”],
         use_biases=wavenet_params[“use_biases”],
         scalar_input=wavenet_params[“scalar_input”],
         initial_filter_width=wavenet_params[“initial_filter_width”])
   #是否使用L2正則
   if args.l2_regularization_strength == 0:
        args.l2_regularization_strength = None
   #計算loss
   loss = net.loss(audio_batch, args.l2_regularization_strength)
   #選擇使用sgd還是adam優化器
   if args.optimizer == ADAM_OPTIMIZER:
      optimizer = tf.train.AdamOptimizer(learning_rate=args.learning_rate)
   elif args.optimizer == SGD_OPTIMIZER:
      optimizer = tf.train.MomentumOptimizer(learning_rate=args.learning_rate,
                           momentum=args.sgd_momentum)
   else:
       # This shouldn’t happen, given the choices specified in argument
       # specification.
       raise RuntimeError(‘Invalid optimizer option.’)
   trainable = tf.trainable_variables()
   optim = optimizer.minimize(loss, var_list=trainable)#優化,最小化loss函數

   # 將日誌數據寫入 TensorBoard.
   writer = tf.train.SummaryWriter(logdir)
   writer.add_graph(tf.get_default_graph())
   run_metadata = tf.RunMetadata()
   summaries = tf.merge_all_summaries()

   #開始 session
   sess = tf.Session(config=tf.ConfigProto(log_device_placement=False))
   init = tf.initialize_all_variables()
   sess.run(init)

   # Saver for storing checkpoints of the model.
   #保存模型
   saver = tf.train.Saver(var_list=tf.trainable_variables())
 
   try:
      saved_global_step = load(saver, sess, restore_from)
       if is_overwritten_training or saved_global_step is None:
            # The first training step will be saved_global_step + 1,
            # therefore we put -1 here for new or overwritten trainings.
            saved_global_step = 1

   except:
         print(“Something went wrong while restoring checkpoint. “
                  “We will terminate training to avoid accidentally overwriting “
                  “the previous model.”)
         raise

   #此處採用了tensorflow的線程和隊列的方法
   threads = tf.train.start_queue_runners(sess=sess, coord=coord)
   reader.start_threads(sess)

    …
    …
   #此處略過部分代碼
   finally:
       if step > last_saved_step:
            save(saver, sess, logdir, step)
      coord.request_stop()
      coord.join(threads)

main方法中使用了audio_reader.py和model.py中的類,讓我們進一步探究。

2.3 audio_reader.py解析

audio_reader.py包含了四個方法(find_files(),load_generic_audio(),load_vctk_audio(),trim_silence())和一個類 AudioReader()。

四個方法中,需要關注一下的是trim_silence()方法。該方法是去除音頻數據開始和結尾的空白段。

#去除音頻數據開始和結尾的空白段。
def trim_silence(audio, threshold):
   '''Removes silence at the beginning and end of a sample.'''
   energy = librosa.feature.rmse(audio)#獲取音頻energy(能量)
   frames = np.nonzero(energy > threshold)#大於閾值
   indices = librosa.core.frames_to_samples(frames)[1]
 
   # Note: indices can be an empty array, if the whole audio was silence.
   return audio[indices[0]:indices[-1]] if indices.size else audio[0:0]

讓我們再看一下 AudioReader()類。該類包含四個方法,功能是將預處理好的音頻數據打包成tensorflow queue。

class AudioReader(object):
  '''Generic background audio reader that preprocesses audio files
  and enqueues them into a TensorFlow queue.'''

  def __init__(self,
             audio_dir,
             coord,
             sample_rate,
             sample_size=None,
             silence_threshold=None,
             queue_size=256):
     self.audio_dir = audio_dir#訓練文件路徑
     self.sample_rate = sample_rate#採樣率
     self.coord = coord
     self.sample_size = sample_size#樣本大小
     self.silence_threshold = silence_threshold#閾值,低於多少就為零         
     self.threads = []#線程
     self.sample_placeholder = tf.placeholder(dtype=tf.float32, shape=None)

     #隊列初始化 self.queue = tf.PaddingFIFOQueue(queue_size,
                                            ['float32'],      
                                            shapes=[(None, 1)])

     #隊列入棧
     self.enqueue = self.queue.enqueue([self.sample_placeholder])

  #隊列出站
  def dequeue(self, num_elements):
      output = self.queue.dequeue_many(num_elements)
      return output

  #主線程
  def thread_main(self, sess):
     buffer_ = np.array([])#緩衝
     stop = False
     # Go through the dataset multiple times
     while not stop:
       iterator = load_generic_audio(self.audio_dir, self.sample_rate)#加載音頻數據
       for audio, filename in iterator:
         if self.coord.should_stop():
           stop = True
           break
         if self.silence_threshold is not None:
           # Remove silence
           audio = trim_silence(audio[:, 0],self.silence_threshold)#去除音頻開始和結尾的空白
           if audio.size == 0:
             print("Warning: {} was ignored as it contains only "
                   "silence. Consider decreasing trim_silence "
                   "threshold, or adjust volume of the audio."
                   .format(filename))
     
         if self.sample_size:
           # Cut samples into fixed size pieces
           buffer_ = np.append(buffer_, audio)
           while len(buffer_) > self.sample_size:
             piece = np.reshape(buffer_[:self.sample_size], [-1, 1])
             sess.run(self.enqueue,
                feed_dict={self.sample_placeholder: piece})
             buffer_ = buffer_[self.sample_size:]
         else:
            sess.run(self.enqueue,
              feed_dict={self.sample_placeholder: audio})#將讀取的音頻數據壓入隊列

  #開始線程
  def start_threads(self, sess, n_threads=1):
     for _ in range(n_threads):
       thread = threading.Thread(target=self.thread_main, args=(sess,))
       thread.daemon = True # Thread will close when parent quits.
       thread.start()
       self.threads.append(thread)
     return self.threads

這部分有參考價值的是tensorflow的線程和隊列的使用(詳情請參見tensorflow官方中文文檔)。隊列的使用包括幾個步驟:一,建立隊列;二,初始化隊列;三,隊列的入棧和出棧

(此圖摘自tensorflow官方中文文檔

線程的使用步驟:一,建立Coordinator對象;二,建立線程;將線程加入Coordinator運行。

# 線程體:循環執行,直到`Coordinator`收到了停止請求。
# 如果某些條件為真,請求`Coordinator`去停止其他線程。
def MyLoop(coord):
  while not coord.should_stop():
    ...do something...
    if ...some condition...:
      coord.request_stop()

# 建立Coordinator
coord = Coordinator()

# Create 10 threads that run 'MyLoop()'
#建立線程
threads = [threading.Thread(target=MyLoop, args=(coord)) for i in xrange(10)]

# Start the threads and wait for all of them to stop.
#開始線程
for t in threads: t.start()
  coord.join(threads)
 
(以上代碼摘自tensorflow官方中文文檔


2.4 model.py解析

該部分是源碼最精華的部分,包括了建立網絡模型和語音生成器相關函數。因為內容較多,同時生成器相關函數和建立網絡模型相關函數大同小異,本文只詳解網絡模型建立相關函數。

model.py包含兩個函數一個類。其中,create_variable()和create_bias_variable()功能分別為創建/初始化權值和bias的函數,很簡單。

WaveNetModel類中關鍵函數有:創建變量函數_create_variables(),創建因果卷積函數_create_causal_layer(),創建擴大卷積函數_create_dilation_layer(),建立網絡模型函數_create_network()以及loss()函數。本文對一些函數進行了測試,以便觀察函數是如何實現相關功能的。下文代碼註釋中的’測試中’,是表示在單獨測試該函數功能時設定的值。

2.4.1 創建變量函數_create_variables()

該函數創建了模型建立所需的所有變量(因果/擴大卷積層以及後處理層所需的變量),並存為字典待使用。

2.4.2 因果卷積函數_create_causal_layer()

該函數功能是建立因果卷積。函數調用了ops.py 文件中的causal_conv()函數。讓我們看看causal_conv()函數到底幹了什麼事情。下文代碼註釋中的’測試中’,是表示在單獨測試該函數功能時設定的值。從測試可以看出來,該因果卷積的實現方法是採用將輸出偏移幾步來實現,具體採用的是tf.pad()方法來實現偏移。


def time_to_batch(value, dilation, name=None):
  with tf.name_scope('time_to_batch'): #測試中,傳入的value的shape為(1,9,1)    
  #dilation=4
  shape = tf.shape(value)
  #pad_elements計算為3
  pad_elements = dilation - 1 - (shape[1] + dilation - 1) % dilation
  #padded後的shape為(1,12,1)即在第二個維度後加3個零
  padded = tf.pad(value, [[0, 0], [0, pad_elements], [0, 0]])
  #reshape後的shape為(3,4,1)
  reshaped = tf.reshape(padded, [-1, dilation, shape[2]])
  #轉置後的shape為(4,3,1)
  transposed = tf.transpose(reshaped, perm=[1, 0, 2])
  #最後返回的shape為(4,3,1)
  return tf.reshape(transposed, [shape[0] * dilation, -1, shape[2]])


def batch_to_time(value, dilation, name=None):
  with tf.name_scope('batch_to_time'):
    shape = tf.shape(value)
    prepared = tf.reshape(value, [dilation, -1, shape[2]])
    transposed = tf.transpose(prepared, perm=[1, 0, 2])
    #最後返回的是前面time_to_batch的最初輸入數值的shape
    #測試中,shape為(1,9,1)
    return tf.reshape(transposed,
        [tf.div(shape[0], dilation), -1, shape[2]])


def causal_conv(value, filter_, dilation, name='causal_conv'):
  with tf.name_scope(name):
  # Pad beforehand to preserve causality.
  #測試中,filter_width=2
  filter_width = tf.shape(filter_)[0]
  #測試中,dilation設定為4
  #因此,padding為padding=[[0, 0], [4, 0], [0, 0]]
  padding = [[0, 0], [(filter_width - 1) * dilation, 0], [0, 0]]
  #測試中,value的shape為(1,5,1)
  #測試中,padding為在value的第二維度前面加4個零
  #padded的shape變為(1,9,1)
  padded = tf.pad(value, padding)
  if dilation > 1:
    #見time_to_batch函數測試
    #測試中,最後返回來的shape為(4,3,1)
    transformed = time_to_batch(padded, dilation)

    conv = tf.nn.conv1d(transformed, filter_, stride=1, padding='SAME')
    #最後返回最開始的shape形式,詳情見op測試
    restored = batch_to_time(conv, dilation)
  else:
    restored = tf.nn.conv1d(padded, filter_, stride=1, padding='SAME')
  # Remove excess elements at the end.
  result = tf.slice(restored,
               [0, 0, 0],
               [-1, tf.shape(value)[1], -1])
  #最後返回的結果形式和padding後的即padded數據shape一樣
  return result

2.4.3 創建擴大卷積_create_dilation_layer()

該函數實現擴大卷積層,同時在該層創建了residual 和skip connection,讓模型更快收斂。該層的網絡結構如註釋中所示。

def _create_dilation_layer(self, input_batch, layer_index, dilation):
  '''Creates a single causal dilated convolution layer.

  The layer contains a gated filter that connects to dense output
  and to a skip connection:

    |-> [gate] -| |-> 1x1 conv -> skip output
    | |-> (*) -|
  input -|-> [filter] -| |-> 1x1 conv -|
    | |-> (+) -> dense output
    |------------------------------------|

  Where `[gate]` and `[filter]` are causal convolutions with a
  non-linear activation at the output.
  '''
  variables = self.variables['dilated_stack'][layer_index]

  weights_filter = variables['filter']
  weights_gate = variables['gate']

  #filter卷積
  conv_filter = causal_conv(input_batch, weights_filter, dilation)
  #gate卷積
  conv_gate = causal_conv(input_batch, weights_gate, dilation)

  #是否使用bias
  if self.use_biases:
    filter_bias = variables['filter_bias']
    gate_bias = variables['gate_bias']
    conv_filter = tf.add(conv_filter, filter_bias)
    conv_gate = tf.add(conv_gate, gate_bias)

  #gate和filter共同輸出
  out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate)

  # The 1x1 conv to produce the residual output
  #採用1×1卷積實現殘差輸出
  weights_dense = variables['dense']
  transformed = tf.nn.conv1d(
      out, weights_dense, stride=1, padding="SAME", name="dense")

  # The 1x1 conv to produce the skip output
  #採用1×1卷積實現skip輸出
  weights_skip = variables['skip']
  #skip output
  skip_contribution = tf.nn.conv1d(
    out, weights_skip, stride=1, padding="SAME", name="skip")

  if self.use_biases:
    dense_bias = variables['dense_bias']
    skip_bias = variables['skip_bias']
    transformed = transformed + dense_bias
    skip_contribution = skip_contribution + skip_bias

  layer = 'layer{}'.format(layer_index)
  #加入summary
  tf.histogram_summary(layer + '_filter', weights_filter)
  tf.histogram_summary(layer + '_gate', weights_gate)
  tf.histogram_summary(layer + '_dense', weights_dense)
  tf.histogram_summary(layer + '_skip', weights_skip)
  if self.use_biases:
    tf.histogram_summary(layer + '_biases_filter', filter_bias)
    tf.histogram_summary(layer + '_biases_gate', gate_bias)
    tf.histogram_summary(layer + '_biases_dense', dense_bias)
    tf.histogram_summary(layer + '_biases_skip', skip_bias)

  #返回skip output和(殘差+input)
  return skip_contribution, input_batch + transformed

 

2.4.4 建立網絡模型函數_create_network()

該函數採用前面的_create_dilation_layer()建立網絡模型。在因果卷積後面,加上了後續處理層(postprocessing layer)。後續處理層結構為:Perform (+) -> ReLU -> 1×1 conv -> ReLU -> 1×1 conv。

#建立模型
def _create_network(self, input_batch):
  '''Construct the WaveNet network.'''
  outputs = []
  current_layer = input_batch

  # Pre-process the input with a regular convolution
  if self.scalar_input:
    initial_channels = 1
  else:
    initial_channels = self.quantization_channels

  #初始層
  current_layer = self._create_causal_layer(current_layer)

  # Add all defined dilation layers.
  #建立dilated層,總共18層
  with tf.name_scope('dilated_stack'):
    for layer_index, dilation in enumerate(self.dilations):
      with tf.name_scope('layer{}'.format(layer_index)):
        output, current_layer = self._create_dilation_layer(
          current_layer, layer_index, dilation)
        outputs.append(output)

  #postprocess層
  with tf.name_scope('postprocessing'):
    # Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to
    # postprocess the output.
    #創建後續處理層變量
    w1 = self.variables['postprocessing']['postprocess1']
    w2 = self.variables['postprocessing']['postprocess2']
    if self.use_biases:
      b1 = self.variables['postprocessing']['postprocess1_bias']
      b2 = self.variables['postprocessing']['postprocess2_bias']

    tf.histogram_summary('postprocess1_weights', w1)
    tf.histogram_summary('postprocess2_weights', w2)
    if self.use_biases:
      tf.histogram_summary('postprocess1_biases', b1)
      tf.histogram_summary('postprocess2_biases', b2)
  
  # We skip connections from the outputs of each layer, adding them
  # all up here.
  # 將每一層的skip connection輸出累加
  total = sum(outputs)
  transformed1 = tf.nn.relu(total)
  conv1 = tf.nn.conv1d(transformed1, w1, stride=1, padding="SAME")
  if self.use_biases:
    conv1 = tf.add(conv1, b1)
  transformed2 = tf.nn.relu(conv1)
  conv2 = tf.nn.conv1d(transformed2, w2, stride=1, padding="SAME")
  if self.use_biases:
    conv2 = tf.add(conv2, b2)

return conv2

 

2.4.5 loss()函數

該函數首先將輸入語音數據進行\mu -law編碼(mu_law_encode())後再使用one-hot編碼。loss函數採用的是tf.nn.softmax_cross_entropy_with_logits()。

#損失函數
def loss(self,
     input_batch,
     l2_regularization_strength=None,
     name='wavenet'):
  ”’Creates a WaveNet network and returns the autoencoding loss.

  The variables are all scoped to the given name.
  ”’

  with tf.name_scope(name):

    #使用mu-law編碼
    input_batch = mu_law_encode(input_batch,
                                self.quantization_channels)

    #再使用one hot編碼
    encoded = self._one_hot(input_batch)
    #如果使用標量輸入,則轉換成標量
    if self.scalar_input:
      network_input = tf.reshape( tf.cast(input_batch, tf.float32), [self.batch_size, 1, 1])
    else:
      network_input = encoded

    #網絡預測輸出
    raw_output = self._create_network(network_input)

    with tf.name_scope(‘loss’):
      # Shift original input left by one sample, which means that
      # each output sample has to predict the next input sample.
      #向左偏移一位,即減去第一位,保證每次是預測下一個輸出。
      #測試中,encoded的shape為(1,9,1),比如[0,0,0,0,1~5]
      #shifted後的shape為(1,8,1),比如[0,0,0,1~5]
      shifted = tf.slice(encoded, [0, 1, 0], [1, tf.shape(encoded)[1] 1, 1])
     
      #加零後,shape重新變為(1,9,1),比如比如[0,0,0,1~5,0]
      shifted = tf.pad(shifted, [[0, 0], [0, 1], [0, 0]])

      #將模型預測轉換shape為prediction
      prediction = tf.reshape(raw_output, [1, self.quantization_channels])

      #loss函數
      loss = tf.nn.softmax_cross_entropy_with_logits(
           prediction,
           tf.reshape(shifted, [1, self.quantization_channels]))
      reduced_loss = tf.reduce_mean(loss)

      tf.scalar_summary(‘loss’, reduced_loss)

      if l2_regularization_strength is None:
        return reduced_loss
      else:
        # L2 regularization for all trainable parameters
        l2_loss = tf.add_n([tf.nn.l2_loss(v)
                                  for v in tf.trainable_variables()
                                  if not(‘bias’ in v.name)])

        # Add the regularization term to the loss
        total_loss = (reduced_loss + l2_regularization_strength * l2_loss)
        tf.scalar_summary(‘l2_loss’, l2_loss)
        tf.scalar_summary(‘total_loss’, total_loss)

         return total_loss


2.5 generate.py解析

這部分代碼用於模型語音生成。有了前面的解析,代碼就相對比較簡單,略過。github上還有Fast Wavenet,解決了wavenet原文中的語音生成方法的問題是語音生成太慢,有興趣可以參考。

3,總結

WaveNet結合了因果卷積和擴展卷積方法,讓感受野隨著模型深度增加而成倍增加。該神經網絡模型不僅適用於原始語音數據的生成,也適用於文字生成(tex-wavenet),圖像生成(image-wavenet)等,是值得深入研究的一個神經網絡模型。

 

Advertisements