使用 TensorFlow 训练聊天机器人示例

本帖最后由 Oner 于 2017-9-28 13:26 编辑
问题导读：
1. seq2seq 是什么？
2. 循环神经网络是什么？
3. 训练样本集如何选取？
4. 数据预处理如何做？
5. 如何构建图？
6. 如何构建会话？
7. 如何进行预测？

前言

实际工程中很少有直接用深度学习实现端对端的聊天机器人，但这里我们来看看怎么用深度学习的seq2seq模型来实现一个简易的聊天机器人。这篇文章将尝试使用TensorFlow来训练一个基于seq2seq的聊天机器人，实现根据语料库的训练让机器人回答问题。

seq2seq

关于seq2seq的机制原理可看之前的文章《深度学习的seq2seq模型》（http://blog.csdn.net/wangyangzhizhou/article/details/77883152）。

循环神经网络

在seq2seq模型中会使用到循环神经网络，目前流行的几种循环神经网络包括RNN、LSTM和GRU。这三种循环神经网络的机制原理可看之前的文章《循环神经网络》(http://blog.csdn.net/wangyangzhizhou/article/details/76278375)、《LSTM神经网络》(http://blog.csdn.net/wangyangzhizhou/article/details/76651116)、《GRU神经网络》（http://blog.csdn.net/wangyangzhizhou/article/details/77332582）。

训练样本集

主要是一些QA对，开放数据也很多可以下载，这里只是随便选用一小部分问题和回答，存放的格式是第一行为问题，第二行为回答，第三行又是问题，第四行为回答，以此类推。

数据预处理

要训练就肯定要将数据转成数字，可以用0到n的值来表示整个词汇，每个值表示一个单词，这里用VOCAB_SIZE来定义。还有问题的最大最小长度，回答的最大最小长度。除此之外还要定义UNK、GO、EOS和PAD符号，分别表示未知单词，比如你超过 VOCAB_SIZE范围的则认为未知单词，GO表示decoder开始的符号，EOS表示回答结束的符号，而PAD用于填充，因为所有QA对放到同个seq2seq模型中输入和输出都必须是相同的，于是就需要将较短长度的问题或回答用PAD进行填充。

[mw_shl_code=python,true]limit = {
'maxq': 10,
'minq': 0,
'maxa': 8,
'mina': 3
}

UNK = 'unk'
GO = '<go>'
EOS = '<eos>'
PAD = '<pad>'
VOCAB_SIZE = 1000[/mw_shl_code]

按照QA长度的限制进行筛选。

[mw_shl_code=python,true]def filter_data(sequences):
filtered_q, filtered_a = [], []
raw_data_len = len(sequences) // 2

for i in range(0, len(sequences), 2):
      qlen, alen = len(sequences.split(' ')), len(sequences[i + 1].split(' '))
      if qlen >= limit['minq'] and qlen <= limit['maxq']:
         if alen >= limit['mina'] and alen <= limit['maxa']:
            filtered_q.append(sequences)
            filtered_a.append(sequences[i + 1])
filt_data_len = len(filtered_q)
filtered = int((raw_data_len - filt_data_len) * 100 / raw_data_len)
print(str(filtered) + '% filtered from original data')

return filtered_q, filtered_a[/mw_shl_code]

我们还要得到整个语料库所有单词的频率统计，还要根据频率大小统计出排名前n个频率的单词作为整个词汇，也就是前面对应的VOCAB_SIZE。另外我们还需要根据索引值得到单词的索引，还有根据单词得到对应索引值的索引。

[mw_shl_code=python,true]def index_(tokenized_sentences, vocab_size):
freq_dist = nltk.FreqDist(itertools.chain(*tokenized_sentences))
vocab = freq_dist.most_common(vocab_size)
index2word = [GO] + [EOS] + [UNK] + [PAD] + [x[0] for x in vocab]
word2index = dict([(w, i) for i, w in enumerate(index2word)])
return index2word, word2index, freq_dist[/mw_shl_code]

前面也说到在我们的seq2seq模型中，对于encoder来说，问题的长短是不同的，那么不够长的要用PAD进行填充，比如问题为”how are you”，假如长度定为10，则需要将其填充为”how are you pad pad pad pad pad pad pad”。对于decoder来说，要以GO开始，以EOS结尾，不够长还得填充，比如”fine thank you”，则要处理成”go fine thank you eos pad pad pad pad pad “。第三个要处理的则是我们的target，target其实和decoder的输入是相同的，只不过它刚好有一个位置的偏移，比如上面要去掉go，变成”fine thank you eos pad pad pad pad pad pad”。

[mw_shl_code=python,true]def zero_pad(qtokenized, atokenized, w2idx):
data_len = len(qtokenized)
# +2 dues to '<go>' and '<eos>'
idx_q = np.zeros([data_len, limit['maxq']], dtype=np.int32)
idx_a = np.zeros([data_len, limit['maxa'] + 2], dtype=np.int32)
idx_o = np.zeros([data_len, limit['maxa'] + 2], dtype=np.int32)

for i in range(data_len):
      q_indices = pad_seq(qtokenized, w2idx, limit['maxq'], 1)
      a_indices = pad_seq(atokenized, w2idx, limit['maxa'], 2)
      o_indices = pad_seq(atokenized, w2idx, limit['maxa'], 3)
      idx_q = np.array(q_indices)
      idx_a = np.array(a_indices)
      idx_o = np.array(o_indices)

return idx_q, idx_a, idx_o

def pad_seq(seq, lookup, maxlen, flag):
if flag == 1:
      indices = []
elif flag == 2:
      indices = [lookup[GO]]
elif flag == 3:
      indices = []
for word in seq:
      if word in lookup:
         indices.append(lookup[word])
      else:
         indices.append(lookup[UNK])
if flag == 1:
      return indices + [lookup[PAD]] * (maxlen - len(seq))
elif flag == 2:
      return indices + [lookup[EOS]] + [lookup[PAD]] * (maxlen - len(seq))
elif flag == 3:
      return indices + [lookup[EOS]] + [lookup[PAD]] * (maxlen - len(seq) + 1)[/mw_shl_code]

然后将上面处理后的结构都持久化起来，供训练时使用。

构建图

[mw_shl_code=python,true]encoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
decoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
targets = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
weights = tf.placeholder(dtype=tf.float32, shape=[batch_size, sequence_length])[/mw_shl_code]

创建四个占位符，分别为encoder的输入占位符、decoder的输入占位符和decoder的target占位符，还有权重占位符。其中batch_size是输入样本一批的数量，sequence_length为我们定义的序列的长度。

[mw_shl_code=python,true]cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)[/mw_shl_code]

创建循环神经网络结构，这里使用LSTM结构，hidden_size是隐含层数量，用MultiRNNCell是因为我们希望创建一个更复杂的网络，num_layers为LSTM的层数。

[mw_shl_code=python,true]results, states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
tf.unstack(encoder_inputs, axis=1),
tf.unstack(decoder_inputs, axis=1),
cell,
num_encoder_symbols,
num_decoder_symbols,
embedding_size,
feed_previous=False
)[/mw_shl_code]

使用TensorFlow为我们准备好了的embedding_rnn_seq2seq函数搭建seq2seq结构，当然我们也可以自己从LSTM搭起，分别创建encoder和decoder，但为了方便直接使用embedding_rnn_seq2seq即可。使用tf.unstack函数是为了将encoder_inputs和decoder_inputs展开成一个列表，

num_encoder_symbols和num_decoder_symbols对应到我们的词汇数量。embedding_size则是我们的嵌入层的数量，feed_previous这个变量很重要，设为False表示这是训练阶段，训练阶段会使用decoder_inputs作为decoder的其中一个输入，但feed_previous为True时则表示预测阶段，而预测阶段没有decoder_inputs，所以只能依靠decoder上一时刻输出作为当前时刻的输入。

[mw_shl_code=python,true]logits = tf.stack(results, axis=1)
loss = tf.contrib.seq2seq.sequence_loss(logits, targets=targets, weights=weights)
pred = tf.argmax(logits, axis=2)
train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)[/mw_shl_code]

接着使用sequence_loss来创建损失，这里根据embedding_rnn_seq2seq的输出来计算损失，同时该输出也可以用来做预测，最大的值对应的索引即为词汇的单词，优化器使用的事AdamOptimizer。

创建会话

[mw_shl_code=python,true]with tf.Session() as sess:
ckpt = tf.train.get_checkpoint_state(model_dir)
if ckpt and ckpt.model_checkpoint_path:
      saver.restore(sess, ckpt.model_checkpoint_path)
else:
      sess.run(tf.global_variables_initializer())
epoch = 0
while epoch < 5000000:
      epoch = epoch + 1
      print("epoch:", epoch)
      for step in range(0, 1):
         print("step:", step)
         train_x, train_y, train_target = loadQA()
         train_encoder_inputs = train_x[step * batch_size:step * batch_size + batch_size, :]
         train_decoder_inputs = train_y[step * batch_size:step * batch_size + batch_size, :]
         train_targets = train_target[step * batch_size:step * batch_size + batch_size, :]
         op = sess.run(train_op, feed_dict={encoder_inputs: train_encoder_inputs, targets: train_targets,
                                             weights: train_weights, decoder_inputs: train_decoder_inputs})
         cost = sess.run(loss, feed_dict={encoder_inputs: train_encoder_inputs, targets: train_targets,
                                          weights: train_weights, decoder_inputs: train_decoder_inputs})
         print(cost)
         step = step + 1
      if epoch % 100 == 0:
         saver.save(sess, model_dir + '/model.ckpt', global_step=epoch + 1)[/mw_shl_code]

创建会话开始执行，这里会用到tf.train.Saver对象来保存和读取模型，保险起见可以每隔一定间隔保存一次模型，下次重启会接着训练而不用从头重新来过，这里因为是一个例子，QA对数量不多，所以直接一次性当成一批送进去训练，而并没有分成多批。

预测

[mw_shl_code=python,true]with tf.device('/cpu:0'):
batch_size = 1
sequence_length = 10
num_encoder_symbols = 1004
num_decoder_symbols = 1004
embedding_size = 256
hidden_size = 256
num_layers = 2

encoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
decoder_inputs = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])

targets = tf.placeholder(dtype=tf.int32, shape=[batch_size, sequence_length])
weights = tf.placeholder(dtype=tf.float32, shape=[batch_size, sequence_length])

cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)

results, states = tf.contrib.legacy_seq2seq.embedding_rnn_seq2seq(
      tf.unstack(encoder_inputs, axis=1),
      tf.unstack(decoder_inputs, axis=1),
      cell,
      num_encoder_symbols,
      num_decoder_symbols,
      embedding_size,
      feed_previous=True,
)
logits = tf.stack(results, axis=1)
pred = tf.argmax(logits, axis=2)

saver = tf.train.Saver()
with tf.Session() as sess:
      module_file = tf.train.latest_checkpoint('./model/')
      saver.restore(sess, module_file)
      map = Word_Id_Map()
      encoder_input = map.sentence2ids(['you', 'want', 'to', 'turn', 'twitter', 'followers', 'into', 'blog', 'readers'])

      encoder_input = encoder_input + [3 for i in range(0, 10 - len(encoder_input))]
      encoder_input = np.asarray([np.asarray(encoder_input)])
      decoder_input = np.zeros([1, 10])
      print('encoder_input : ', encoder_input)
      print('decoder_input : ', decoder_input)
      pred_value = sess.run(pred, feed_dict={encoder_inputs: encoder_input, decoder_inputs: decoder_input})
      print(pred_value)
      sentence = map.ids2sentence(pred_value[0])
      print(sentence)[/mw_shl_code]

预测阶段也同样要创建相同的模型，然后将训练时保存的模型加载进来，然后实现对问题的回答的预测。预测阶段我们用cpu来执行就行了，避免使用GPU。创建图的步骤和训练时基本一致，参数也要保持一致，不同的地方在于我们要将embedding_rnn_seq2seq函数的feed_previous参数设为True,因为我们已经没有decoder输入了。另外我们也不需要损失函数和优化器，仅仅提供预测函数即可。

创建会话后开始执行，先加载model目录下的模型，然后再将待测试的问题转成向量形式，接着进行预测，得到输出如下：
[‘how’, ‘do’, ‘you’, ‘do’, ‘this’, ‘’, ‘’, ‘’, ‘’, ‘’]。

示例 github 地址

https://github.com/sea-boat/seq2seq_chatbot.git

来源：http://blog.csdn.net/wangyangzhizhou/article/details/78119339
作者：汪洋之舟---seaboat