分享

TensorFlow ML cookbook 第七章2、3节 实施TF-IDF及使用Skip-gram嵌入

本帖最后由 levycui 于 2019-4-23 22:46 编辑
问题导读:
1、我们如何理解TF-IDF?
2、TensorFlow中如何实现TF-IDF嵌入?
3、如何使用Skip-gram嵌入方式?
4、如何在电影评论数据上实现skip-gram模型?




上一篇:TensorFlow ML cookbook 第七章1节 自然语言处理 使用包含单词工作

实施TF-IDF
由于我们可以为每个单词选择嵌入,我们可能会决定更改某些单词的加权。其中一种策略是增加有用的单词和减轻过分常见或过于罕见的单词。我们将在此配方中探索的嵌入是尝试实现此目的。

做好准备
TF-IDF是一个缩写,代表文本频率 - 反向文档频率。该术语基本上是每个单词的文本频率和反向文档频率的乘积。
在之前的配方中,我们引入了词袋方法,它为句子中每个词的出现分配一个值。这可能并不理想,因为每个类别的句子(先前食谱示例中的垃圾邮件和火腿)很可能具有相同的频率,而且,和其他词汇相反,而诸如伟哥和销售等词语在解决时可能应该具有更高的重要性。文本是否是垃圾邮件。

我们首先要考虑频率这个词。在这里,我们考虑单个条目中单词出现的频率。这部分(TF)的目的是找到在每个条目中看起来很重要的术语:
2019-04-23_160355.jpg

但是诸如和之类的词可能在每个条目中都经常出现。我们希望减轻这些词的重要性,因此我们可以想象将上述文本频率(TF)乘以整个文档频率的倒数可能有助于找到重要的单词。但由于文本集(语料库)可能非常大,因此通常采用逆文档频率的对数。这为我们留下了每个文档条目中每个单词的TF-IDF的以下公式:
2019-04-23_160536.jpg
以下 2019-04-23_160629.jpg 是文档中的单词频率, 2019-04-23_160629.jpg 是所有文档中此类单词的总频率。 我们可以想象,TF-IDF的高值可能表示对于确定文档内容非常重要的单词。
创建TF-IDF向量要求我们将所有文本加载到内存中,并在开始训练模型之前计算每个单词的出现次数。 因此,它没有在TensorFlow中完全实现,因此我们将使用scikit-learn来创建我们的TF-IDF嵌入,但使用TensorFlow来适应逻辑模型。

怎么做…
1.我们首先加载必要的库,这次我们正在为我们的文本加载Scikit-learn TF-IDF预处理库。 使用以下代码:
[mw_shl_code=python,true]import tensorflow as tf
import matplotlib.pyplot as plt
import csv
import numpy as np
import os
import string
import requests
import io
import nltk
from zipfile import ZipFile
from sklearn.feature_extraction.text import TfidfVectorizer [/mw_shl_code]

2.我们开始图表会话并声明我们的词汇表的批量大小和最大特征尺寸:
[mw_shl_code=python,true]sess = tf.Session()
batch_size= 200
max_featurtes = 1000[/mw_shl_code]

3.接下来我们从Web或我们的临时数据文件夹中加载数据(如果我们之前保存过它)。 使用以下代码:
[mw_shl_code=python,true]save_file_name = os.path.join('temp','temp_spam_data.csv')
if os.path.isfile(save_file_name):
  text_data = []
  with open(save_file_name, 'r') as temp_output_file:
    reader = csv.reader(temp_output_file)
    for row in reader:
      text_data.append(row)
else:
  zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
  r = requests.get(zip_url)
  z = ZipFile(io.BytesIO(r.content))
  file = z.read('SMSSpamCollection')
  # Format Data
  text_data = file.decode()
  text_data = text_data.encode('ascii',errors='ignore')
  text_data = text_data.decode().split('\n')
  text_data = [x.split('\t') for x in text_data if len(x)>=1]
  # And write to csv
  with open(save_file_name, 'w') as temp_output_file:
    writer = csv.writer(temp_output_file)
    writer.writerows(text_data)
texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]
# Relabel 'spam' as 1, 'ham' as 0
target = [1. if x=='spam' else 0. for x in target] [/mw_shl_code]
4.就像在之前的食谱中一样,我们将通过将所有内容转换为小写,删除标点符号以及删除数字来减少词汇量:
[mw_shl_code=python,true]# Lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts][/mw_shl_code]

5.为了使用scikt-learn的TF-IDF处理函数,我们必须告诉它如何标记我们的句子。 通过这个,我们只是指如何将一个句子分解成相应的单词。 我们已经在nltk包中为我们构建了一个很好的标记器,它可以很好地将句子分解为相应的单词:
[mw_shl_code=python,true]def tokenizer(text):
  words = nltk.word_tokenize(text)
  return words
# Create TF-IDF of texts
tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features)
sparse_tfidf_texts = tfidf.fit_transform(texts) [/mw_shl_code]
6.接下来我们将数据集分解为一个列车和测试集。 使用以下代码:
[mw_shl_code=python,true]train_indices = np.random.choice(sparse_tfidf_texts.shape[0], round(0.8*sparse_tfidf_texts.shape[0]), replace=False)3test_
indices = np.array(list(set(range(sparse_tfidf_texts.shape[0])) - set(train_indices)))
texts_train = sparse_tfidf_texts[train_indices]
texts_test = sparse_tfidf_texts[test_indices]
target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])
target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices]) [/mw_shl_code]

7.现在我们可以为逻辑回归和数据占位符声明我们的模型变量:
[mw_shl_code=python,true]A = tf.Variable(tf.random_normal(shape=[max_features,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Initialize placeholders
x_data = tf.placeholder(shape=[None, max_features], dtype=tf. float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32) [/mw_shl_code]

8.我们现在可以声明模型操作和损失函数。 请记住,逻辑回归的sigmoid部分在我们的损失函数中。 使用以下代码:
[mw_shl_code=python,true]model_output = tf.add(tf.matmul(x_data, A), b)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_ logits(model_output, y_target))[/mw_shl_code]
9.我们在图表中添加了预测和精度函数,以便在我们的模型训练时可以看到列车和测试集的准确性:
[mw_shl_code=python,true]prediction = tf.round(tf.sigmoid(model_output))
predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32)
accuracy = tf.reduce_mean(predictions_correct)
10.我们声明一个优化器,然后初始化我们的图形变量:
my_opt = tf.train.GradientDescentOptimizer(0.0025)
train_step = my_opt.minimize(loss)
# Intitialize Variables
init = tf.initialize_all_variables()
sess.run(init) [/mw_shl_code]
11.我们现在训练我们的模型超过10,000代,并记录每100代的测试/火车损失和准确度,并打印每500代的状态。 使用以下代码:
[mw_shl_code=python,true]train_loss = []
test_loss = []
train_acc = []
test_acc = []
i_data = []
for i in range(10000):
  rand_index = np.random.choice(texts_train.shape[0], size=batch_size)
  rand_x = texts_train[rand_index].todense()
  rand_y = np.transpose([target_train[rand_index]])
  sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
  # Only record loss and accuracy every 100 generations
  if (i+1)%100==0:
    i_data.append(i+1)
    train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
    train_loss.append(train_loss_temp)
    test_loss_temp = sess.run(loss, feed_dict={x_data: texts_ test.todense(), y_target: np.transpose([target_test])})
    test_loss.append(test_loss_temp)
    train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})
    train_acc.append(train_acc_temp)
    test_acc_temp = sess.run(accuracy, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
    test_acc.append(test_acc_temp)
  if (i+1)%500==0:
    acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
    acc_and_loss = [np.round(x,2) for x in acc_and_loss]
    print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_ loss)) [/mw_shl_code]
12.这导致以下输出:
[mw_shl_code=python,true]Generation # 500. Train Loss (Test Loss): 0.69 (0.73). Train Acc (Test Acc): 0.62 (0.57)
Generation # 1000. Train Loss (Test Loss): 0.62 (0.63). Train Acc (Test Acc): 0.68 (0.66)
...
Generation # 9500. Train Loss (Test Loss): 0.39 (0.45). Train Acc (Test Acc): 0.89 (0.85)
Generation # 10000. Train Loss (Test Loss): 0.48 (0.45). Train Acc (Test Acc): 0.84 (0.85) [/mw_shl_code]

13.以下是绘制列车和测试装置准确度和损耗的代码:

2019-04-23_160939.jpg

这个怎么运作
使用模型的TF-IDF值增加了我们对先前词包模型的预测,从80%准确度到接近90%准确度。 我们通过使用scikit-learn的TF-IDF词汇处理函数并使用这些TF-IDF值进行逻辑回归来实现这一目标。

还有更多
虽然我们可能已经解决了重要性这个问题,但我们还没有解决字序问题。 这两个单词和TF-IDF都没有考虑到句子中单词排序的功能。 我们将在接下来的几节中尝试解决这个问题,这将向我们介绍Word2vec技术。


使用Skip-gram嵌入
在之前的配方中,我们在训练模型之前决定了我们的文本嵌入。 使用神经网络,我们可以使嵌入值成为训练过程的一部分。 我们将探索的第一个这样的方法叫做skip-gram嵌入。

做好准备
在此配方之前,我们没有考虑与创建单词嵌入相关的单词顺序。 2013年初,Tomas Mikolov和Google的其他研究人员撰写了一篇关于创建解决此问题的词嵌入的论文(https://arxiv.org/abs/1301.3781),他们将他们的方法命名为Word2vec。

基本思想是创建捕获单词关系方面的单词嵌入。 我们试图了解各种单词如何相互关联。 这些嵌入可能如何表现的一些示例如下:
king – man + woman = queen
India pale ale – hops + malt = stout
如果我们只考虑它们之间的位置关系,我们可能会实现这种数字表示。如果我们能够分析足够大的连贯文献来源,我们可能会发现在我们的文本中,国王,男人和女王这两个词彼此密切相关。如果我们也知道男人和女人以不同的方式相关,那么我们可能会得出结论,男人是女王,女人是女王,等等。
为了找到这样的嵌入,我们将使用一个神经网络来预测输入单词的周围单词。我们可以轻松地切换它并尝试在给定一组周围单词的情况下预测目标单词,但我们将从先前的方法开始。两者都是Word2vec过程的变体。但是,从目标词预测周围词(上下文)的现有方法称为跳过 - 克模型。在下一个配方中,我们将实现另一个方法,从上下文中预测目标词,这被称为连续词袋(CBOW)方法:


2019-04-23_161609.jpg
图4:Word2vec的skip-gram实现的图示。 skip-gram从目标词(每侧的窗口大小为1)预测上下文窗口。

对于这个配方,我们将在康奈尔大学(http://www.cs.cornell.edu/people/pabo/movie-review-data/)的一组电影评论数据上实现skip-gram模型。 CBOW方法将在下一个章节中实施。

怎么做…
对于这个配方,我们将创建几个辅助函数:将加载数据,规范化文本,生成词汇表和生成数据批处理的函数。 只有在这之后我们才开始训练我们的单词嵌入。 为了清楚起见,我们不预测任何目标变量,但我们将适合嵌入这个词:

1.我们加载必要的库并启动图表会话:
[mw_shl_code=python,true]import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import string
import requests
import collections
import io
import tarfile
import urllib.request
from nltk.corpus import stopwords
sess = tf.Session() [/mw_shl_code]
2. 我们宣布了一些模型参数。 我们将一次查看50对单词嵌入(批量大小)。 每个单词的嵌入大小将是一个长度为200的向量,我们只考虑10,000个最常用的单词(每隔一个单词将被归类为未知单词)。 我们将训练50,000代并且每500次打印丢失。然后我们声明一个我们将在损失函数中使用的num_sampled变量(稍后解释),并且我们还声明我们的skip-gram窗口大小。 在这里我们将窗口大小设置为2,因此我们将查看目标每侧的周围两个单词。 我们从Python包nltk设置了停用词。 我们还想要一种方法来检查我们的单词嵌入是如何执行的,所以我们选择了一些常见的电影评论单词,我们将每2,000次迭代打印出最近的邻居单词:
[mw_shl_code=python,true]batch_size = 50
embedding_size = 200
vocabulary_size = 10000
generations = 50000
print_loss_every = 500
num_sampled = int(batch_size/2)
window_size = 2

stops = stopwords.words('english')
print_valid_every = 2000
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad'] [/mw_shl_code]
3.接下来我们声明我们的数据加载功能,它检查以确保我们在下载之前没有下载数据,或者如果我们之前保存了数据,它将从磁盘加载数据。 使用以下代码:
[mw_shl_code=python,true]def load_movie_data():
  save_folder_name = 'temp'
  pos_file = os.path.join(save_folder_name, 'rt-polarity.pos')
  neg_file = os.path.join(save_folder_name, 'rt-polarity.neg')
  # Check if files are already downloaded
  if os.path.exists(save_folder_name):
    pos_data = []
    with open(pos_file, 'r') as temp_pos_file:
      for row in temp_pos_file:
        pos_data.append(row)
    neg_data = []
    with open(neg_file, 'r') as temp_neg_file:
      for row in temp_neg_file:
        neg_data.append(row)
  else: # If not downloaded, download and save
    movie_data_url = 'http://www.cs.cornell.edu/people/pabo/ movie-review-data/rt-polaritydata.tar.gz'
    stream_data = urllib.request.urlopen(movie_data_url)
    tmp = io.BytesIO()
    while True:
      s = stream_data.read(16384)
      if not s:
        break
      tmp.write(s)
      stream_data.close()
      tmp.seek(0)
    tar_file = tarfile.open(fileobj=tmp, mode="r:gz")
    pos = tar_file.extractfile('rt-polaritydata/rt-polarity. pos')
    neg = tar_file.extractfile('rt-polaritydata/rt-polarity. neg')
    # Save pos/neg reviews
    pos_data = []
    for line in pos:
      pos_data.append(line.decode('ISO-8859-1'). encode('ascii',errors='ignore').decode())
    neg_data = []
    for line in neg:
      neg_data.append(line.decode('ISO-8859-1'). encode('ascii',errors='ignore').decode())
    tar_file.close()
    # Write to file
    if not os.path.exists(save_folder_name):
       os.makedirs(save_folder_name)
   # Save files
    with open(pos_file, "w") as pos_file_handler:
      pos_file_handler.write(''.join(pos_data))
    with open(neg_file, "w") as neg_file_handler:
      neg_file_handler.write(''.join(neg_data))
  texts = pos_data + neg_data
  target = [1]*len(pos_data) + [0]*len(neg_data)
  return(texts, target)
texts, target = load_movie_data() [/mw_shl_code]
4.接下来我们为文本创建一个规范化函数。 此函数将输入字符串列表并应用小写,删除标点,删除数字,跳过额外的空格,并删除停用词。 使用以下代码:
[mw_shl_code=python,true]def normalize_text(texts, stops):
# Lower case
  texts = [x.lower() for x in texts]
# Remove punctuation
  texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
  texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Remove stopwords
  texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]
# Trim extra whitespace
  texts = [' '.join(x.split()) for x in texts]
  return(texts)
texts = normalize_text(texts, stops) [/mw_shl_code]
5.为了确保我们所有的电影评论都能提供信息,我们应该确保它们足够长,以包含重要的单词关系。 我们任意将其设置为三个或更多单词:
[mw_shl_code=python,true]target = [target[ix] for ix, x in enumerate(texts) if len(x. split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]
neg_data.append(line.decode('ISO-8859-1'). encode('ascii',errors='ignore').decode())
tar_file.close()
# Write to file
if not os.path.exists(save_folder_name):
os.makedirs(save_folder_name)
# Save files
with open(pos_file, "w") as pos_file_handler:
pos_file_handler.write(''.join(pos_data))
with open(neg_file, "w") as neg_file_handler:
neg_file_handler.write(''.join(neg_data))
texts = pos_data + neg_data
target = [1]*len(pos_data) + [0]*len(neg_data)
return(texts, target)
texts, target = load_movie_data()[/mw_shl_code]
6.接下来我们为文本创建一个规范化函数。 此函数将输入字符串列表并应用小写,删除标点,删除数字,跳过额外的空格,并删除停用词。 使用以下代码:
[mw_shl_code=python,true]def normalize_text(texts, stops):
# Lower case
  texts = [x.lower() for x in texts]
# Remove punctuation
  texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
  texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Remove stopwords
  texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]
# Trim extra whitespace
  texts = [' '.join(x.split()) for x in texts]
  return(texts)
texts = normalize_text(texts, stops)[/mw_shl_code]
7.为了确保我们所有的电影评论都能提供信息,我们应该确保它们足够长,以包含重要的单词关系。 我们任意将其设置为三个或更多单词:
[mw_shl_code=python,true]target = [target[ix] for ix, x in enumerate(texts) if len(x. split()) > 2]
  texts = [x for x in texts if len(x.split()) > 2]
[/mw_shl_code]

8.现在我们可以实际创建我们的字典并将我们的句子列表转换为单词索引列表:
[mw_shl_code=python,true]word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_ dictionary.keys()))
text_data = text_to_numbers(texts, word_dictionary) [/mw_shl_code]
9.从前面的单词字典中,我们可以查找我们在步骤2中选择的验证单词的索引。使用以下代码:
valid_examples = [word_dictionary[x] for x in valid_words]
10.我们现在创建一个函数,它将返回我们的skip-gram批次。 我们想训练一对单词,其中一个单词是训练输入(来自我们窗口中心的目标单词),另一个单词是从窗口中选择的。 例如,帽子中的猫的句子可能会导致(输入,输出)对,如下所示:(,in),(cat,in),(the,in),(hat,in),如果在 是目标词,我们每个方向的窗口大小为2:
[mw_shl_code=python,true]def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
# Fill up data batch
  batch_data = []
  label_data = []
  while len(batch_data) < batch_size:
# select random sentence to start
    rand_sentence = np.random.choice(sentences)
# Generate consecutive windows to look at
    window_sequences = [rand_sentence[max((ix-window_ size),0):(ix+window_size+1)] for ix, x in enumerate(rand_ sentence)]
# Denote which element of each window is the center word of interest
    label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
# Pull out center word of interest for each window and create a tuple for each window
    if method=='skip_gram':
      batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
# Make it in to a big list of tuples (target word, surrounding word)
      tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
    else:
      raise ValueError('Method {} not implmented yet.'.format(method))
# extract batch and labels
    batch, labels = [list(x) for x in zip(*tuple_data)]
    batch_data.extend(batch[:batch_size])
    label_data.extend(labels[:batch_size])
# Trim batch and label at the end
  batch_data = batch_data[:batch_size]
  label_data = label_data[:batch_size]
# Convert to numpy array
  batch_data = np.array(batch_data)
  label_data = np.transpose(np.array([label_data]))
  return(batch_data, label_data) [/mw_shl_code]
11.我们现在可以初始化嵌入矩阵,并声明我们的占位符和嵌入查找功能。 使用以下代码:
[mw_shl_code=python,true]embeddings = tf.Variable(tf.random_uniform([vocabulary_size,
embedding_size], -1.0, 1.0))
# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Lookup the word embedding:
embed = tf.nn.embedding_lookup(embeddings, x_inputs) [/mw_shl_code]
12.损失函数应该是诸如softmax之类的函数,它计算预测错误单词类别时的损失。 但由于我们的目标有10,000个不同的类别,因此非常稀疏。 这种稀疏性导致模型适合或收敛的问题。 为了解决这个问题,我们使用称为噪声对比误差(NCE)的损失函数。 这种NCE损失函数通过预测单词类与随机噪声预测将我们的问题转化为二元预测。 num_sampled参数是批量变成随机噪声的程度:
[mw_shl_code=python,true]nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,
embedding_size], stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, embed,
y_target, num_sampled, vocabulary_size))[/mw_shl_code]

13.现在我们需要创建一种方法来查找附近的单词到我们的验证词。 我们将通过计算验证集和所有单词嵌入之间的余弦相似性来完成此操作,然后我们可以为每个验证字打印出最接近的单词集。 使用以下代码:
[mw_shl_code=python,true]norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_ dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)[/mw_shl_code]
14.我们现在声明我们的优化器函数,并初始化我们的模型变量:
[mw_shl_code=python,true]optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0). minimize(loss)
init = tf.initialize_all_variables()
sess.run(init) [/mw_shl_code]
15.现在我们可以训练我们的嵌入并在训练期间打印丢失和最接近我们验证集的单词。 使用以下代码:
[mw_shl_code=python,true]loss_vec = []
loss_x_vec = []
for i in range(generations):
  batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, window_size)
  feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}
# Run the train step
  sess.run(optimizer, feed_dict=feed_dict)
# Return the loss
  if (i+1) % print_loss_every == 0:
    loss_val = sess.run(loss, feed_dict=feed_dict)
    loss_vec.append(loss_val)
    loss_x_vec.append(i+1)
    print("Loss at step {} : {}".format(i+1, loss_val))
# Validation: Print some random words and top 5 related words
  if (i+1) % print_valid_every == 0:
    sim = sess.run(similarity, feed_dict=feed_dict)
    for j in range(len(valid_words)):
      valid_word = word_dictionary_rev[valid_examples[j]]
      top_k = 5 # number of nearest neighbors
      nearest = (-sim[j, :]).argsort()[1:top_k+1]
      log_str = "Nearest to {}:".format(valid_word)
      for k in range(top_k):
        close_word = word_dictionary_rev[nearest[k]]
        log_str = "%s %s," % (log_str, close_word)
    print(log_str) [/mw_shl_code]

16.这导致以下输出:
[mw_shl_code=python,true]Loss at step 500 : 13.387781143188477
Loss at step 1000 : 7.240757465362549
Loss at step 49500 : 0.9395825862884521
Loss at step 50000 : 0.30323168635368347
Nearest to cliche: walk, intrigue, brim, eileen, dumber,
Nearest to love: plight, fiction, complete, lady, bartleby,
Nearest to hate: style, throws, players, fearlessness, astringent,
Nearest to silly: delivers, meow, regain, nicely, anger,
Nearest to sad: dizzying, variety, existing, environment, tunney, [/mw_shl_code]

这个怎么运作…
我们通过skip-gram方法在电影评论数据集上训练了一个Word2vec模型。 我们下载了数据,将单词转换为带有字典的索引,并将这些索引号用作嵌入查找,我们对其进行了训练,以便附近的单词可以相互预测。

还有更多…
乍一看,我们可能期望验证集附近的单词集合是同义词。 事实并非如此,因为很少有同义词实际上在句子中彼此相邻。 我们真正得到的是预测我们的数据集中哪些单词彼此接近。 我们希望使用这样的嵌入会使预测更容易。
为了使用这些嵌入,我们必须使它们可重用并保存它们。 我们通过实现CBOW嵌入在下一个配方中执行此操作。

最新经典文章,欢迎关注公众号



原文:
Implementing TF-IDF
Since we can choose the embedding for each word, we might decide to change the weighting on certain words. One such strategy is to upweight useful words and downweight overly common or too rare words. The embedding we will explore in this recipe is an attempt to achieve this.


Getting ready
TF-IDF is an acronym that stands for Text Frequency – Inverse Document Frequency. This term is essentially the product of text frequency and inverse document frequency for each word.
In the prior recipe, we introduced the bag of words methodology, which assigned a value of one for every occurrence of a word in a sentence. This is probably not ideal as each category of sentence (spam and ham for the prior recipe example) most likely has the same frequency of the, and, and other words, whereas words such as viagra and sale probably should have increased importance in figuring out whether or not the text is spam.


We first want to take into consideration the word frequency. Here we consider the frequency with which a word occurs in an individual entry. The purpose of this part (TF) is to find terms that appear to be important in each entry:
2019-04-23_160355.jpg
But words such as the and and may appear very frequently in every entry. We want to down weight the importance of these words, so we can imagine that multiplying the above text frequency (TF) by the inverse of the whole document frequency might help find important words. But since a collection of texts (a corpus) may be quite large, it is common to take the logarithm of the inverse document frequency. This leaves us with the following formula for TF-IDF for each word in each document entry:
2019-04-23_160536.jpg
Here   2019-04-23_160629.jpg is the word frequency by document, and 2019-04-23_160629.jpg is the total frequency of such words across all documents. We can imagine that high values of TF-IDF might indicate words that are very important to determining what a document is about.
Creating the TF-IDF vectors requires us to load all the text into memory and count the occurrences of each word before we can start training our model. Because of this, it is not implemented fully in TensorFlow, so we will use scikit-learn for creating our TF-IDF embedding, but use TensorFlow to fit the logistic model.


How to do it…
1.We start by loading the necessary libraries, and this time we are loading the Scikit-learn TF-IDF preprocessing library for our texts. Use the following code:
import tensorflow as tf
import matplotlib.pyplot as plt
import csv
import numpy as np
import os
import string
import requests
import io
import nltk
from zipfile import ZipFile
from sklearn.feature_extraction.text import TfidfVectorizer


2.We start a graph session and declare our batch size and maximum feature size for our vocabulary:
sess = tf.Session()
batch_size= 200
max_featurtes = 1000

3.Next we load the data, either from the Web or from our temp data folder if we have saved it before. Use the following code:
save_file_name = os.path.join('temp','temp_spam_data.csv')
if os.path.isfile(save_file_name):
text_data = []
with open(save_file_name, 'r') as temp_output_file:
reader = csv.reader(temp_output_file)
for row in reader:
text_data.append(row)
else:
zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
r = requests.get(zip_url)
z = ZipFile(io.BytesIO(r.content))
file = z.read('SMSSpamCollection')
# Format Data
text_data = file.decode()
text_data = text_data.encode('ascii',errors='ignore')
text_data = text_data.decode().split('\n')
text_data = [x.split('\t') for x in text_data if len(x)>=1]
# And write to csv
with open(save_file_name, 'w') as temp_output_file:
writer = csv.writer(temp_output_file)
writer.writerows(text_data)
texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]
# Relabel 'spam' as 1, 'ham' as 0
target = [1. if x=='spam' else 0. for x in target]


4.Just like in the prior recipe, we will decrease our vocabulary size by converting everything to lowercase, removing punctuation, and getting rid of numbers:
# Lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]


5.In order to use scikt-learn's TF-IDF processing functions, we have to tell it how to tokenize our sentences. By this, we just mean how to break up a sentence into the corresponding words. A great tokenizer is already built for us in the nltk package that does a great job of breaking up sentences into the corresponding words:
def tokenizer(text):
words = nltk.word_tokenize(text)
return words
# Create TF-IDF of texts
tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english', max_features=max_features)
sparse_tfidf_texts = tfidf.fit_transform(texts)


6.Next we break up our data set into a train and test set. Use the following code:
train_indices = np.random.choice(sparse_tfidf_texts.shape[0], round(0.8*sparse_tfidf_texts.shape[0]), replace=False)3test_
indices = np.array(list(set(range(sparse_tfidf_texts.shape[0])) - set(train_indices)))
texts_train = sparse_tfidf_texts[train_indices]
texts_test = sparse_tfidf_texts[test_indices]
target_train = np.array([x for ix, x in enumerate(target) if ix in train_indices])
target_test = np.array([x for ix, x in enumerate(target) if ix in test_indices])


7.Now we can declare our model variables for logistic regression and our data placeholders:
A = tf.Variable(tf.random_normal(shape=[max_features,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Initialize placeholders
x_data = tf.placeholder(shape=[None, max_features], dtype=tf. float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)


8.We can now declare the model operations and the loss function. Remember that the sigmoid part of the logistic regression is in our loss function. Use the following code:
model_output = tf.add(tf.matmul(x_data, A), b)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_ logits(model_output, y_target))

9.We add a prediction and accuracy function to the graph so that we can see the accuracy of the train and test set as our model is training:
prediction = tf.round(tf.sigmoid(model_output))
predictions_correct = tf.cast(tf.equal(prediction, y_target), tf.float32)
accuracy = tf.reduce_mean(predictions_correct)
10.We declare an optimizer and initialize our graph variables next:
my_opt = tf.train.GradientDescentOptimizer(0.0025)
train_step = my_opt.minimize(loss)
# Intitialize Variables
init = tf.initialize_all_variables()
sess.run(init)


11.We now train our model over 10,000 generations and record the test/train loss and accuracy every 100 generations and print out the status every 500 generations. Use the following code:
train_loss = []
test_loss = []
train_acc = []
test_acc = []
i_data = []
for i in range(10000):
rand_index = np.random.choice(texts_train.shape[0], size=batch_size)
rand_x = texts_train[rand_index].todense()
rand_y = np.transpose([target_train[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target: rand_y})
# Only record loss and accuracy every 100 generations
if (i+1)%100==0:
i_data.append(i+1)
train_loss_temp = sess.run(loss, feed_dict={x_data: rand_x, y_target: rand_y})
train_loss.append(train_loss_temp)
test_loss_temp = sess.run(loss, feed_dict={x_data: texts_ test.todense(), y_target: np.transpose([target_test])})
test_loss.append(test_loss_temp)
train_acc_temp = sess.run(accuracy, feed_dict={x_data: rand_x, y_target: rand_y})

train_acc.append(train_acc_temp)
test_acc_temp = sess.run(accuracy, feed_dict={x_data: texts_test.todense(), y_target: np.transpose([target_test])})
test_acc.append(test_acc_temp)
if (i+1)%500==0:
acc_and_loss = [i+1, train_loss_temp, test_loss_temp, train_acc_temp, test_acc_temp]
acc_and_loss = [np.round(x,2) for x in acc_and_loss]
print('Generation # {}. Train Loss (Test Loss): {:.2f} ({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_ loss))


12.This results in the following output:
Generation # 500. Train Loss (Test Loss): 0.69 (0.73). Train Acc (Test Acc): 0.62 (0.57)
Generation # 1000. Train Loss (Test Loss): 0.62 (0.63). Train Acc (Test Acc): 0.68 (0.66)
...
Generation # 9500. Train Loss (Test Loss): 0.39 (0.45). Train Acc (Test Acc): 0.89 (0.85)
Generation # 10000. Train Loss (Test Loss): 0.48 (0.45). Train Acc (Test Acc): 0.84 (0.85)

13.And here is the code to plot the accuracy and loss for both the train and test set:

2019-04-23_162028.jpg

How it works…
Using TF-IDF values for the model has increased our prediction over the prior bag of words model from 80% accuracy to almost 90% accuracy. We achieved this by using scikit-learn's TF-IDF vocabulary processing functions and using those TF-IDF values for the logistic regression.


There's more…
While we might have addressed the issue of word importance, we have not addressed the issue of word ordering. Both bag of words and TF-IDF have no features that take into account word ordering in a sentence. We will attempt to address this in the next few sections, which will introduce us to Word2vec techniques.

Working with Skip-gram Embeddings
In the prior recipes, we dictated our textual embeddings before training the model. With neural networks, we can make the embedding values part of the training procedure. The first such method we will explore is called skip-gram embedding.


Getting ready
Prior to this recipe, we have not considered the order of words to be relevant in creating word embeddings. In early 2013, Tomas Mikolov and other researchers at Google authored a paper about creating word embeddings that addresses this issue (https://arxiv.org/ abs/1301.3781), and they named their method Word2vec.

The basic idea is to create word embeddings that capture the relational aspect of words. We seek to understand how various words are related to each other. Some examples of how these embeddings might behave are as follows:
king – man + woman = queen
India pale ale – hops + malt = stout


We might achieve such numerical representation of words if we only consider their positional relationship to each other. If we could analyze a large enough source of coherent documents, we might find that the words king, man, and queen are mentioned closely to each other in our texts. If we also know that man and woman are related in a different way, then we might conclude that man is to king as woman is to queen, and so on.


To go about finding such an embedding, we will use a neural network that predicts surrounding words giving an input word. We could, just as easily, switch that and try to predict a target word given a set of surrounding words, but we will start with the prior method. Both are variations of the Word2vec procedure. But the prior method of predicting the surrounding words (the context) from a target word is called the skip-gram model. In the next recipe, we will implement the other method, predicting the target word from the context, which is called the continuous bag of words (CBOW) method:

2019-04-23_161609.jpg
Figure 4: An illustration of the skip-gram implementations of Word2vec. The skip-gram predicts a window of context from the target word (window size of 1 on each side).

For this recipe, we will implement the skip-gram model on a set of movie review data from Cornell University (http://www.cs.cornell.edu/people/pabo/movie-review-data/). The CBOW method will be implemented in the next recipe.


How to do it…
For this recipe, we will create several helper functions: functions that will load the data, normalize the text, generate the vocabulary, and generate data batches. Only after all this will we then start training our word embeddings. To be clear, we are not predicting any target variables, but we will be fitting the word embeddings instead:


1.We load the necessary libraries and start a graph session:
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import string
import requests
import collections
import io
import tarfile
import urllib.request
from nltk.corpus import stopwords
sess = tf.Session()


2.We declare some model parameters. We will look at 50 pairs of word embeddings at a time (batch size). The embedding size of each word will be a vector of length 200, and we will only consider the 10,000 most frequent words (every other word will be classified as unknown). We will train for 50,000 generations and print out the loss every 500. Then we declare a num_sampled variable that we will use in the loss function (explained later), and we also declare our skip-gram window size. Here we set our window size to two, so we will look at the surrounding two words on each side of the target. We set our stopwords from the Python package nltk. We also want a way to check how our word embeddings are performing, so we choose some common movie review words and we will print out the nearest neighbor words from these every 2,000 iterations:
batch_size = 50
embedding_size = 200
vocabulary_size = 10000
generations = 50000
print_loss_every = 500
num_sampled = int(batch_size/2)
window_size = 2

stops = stopwords.words('english')
print_valid_every = 2000
valid_words = ['cliche', 'love', 'hate', 'silly', 'sad']


3.Next we declare our data loading function, which checks to make sure we have not downloaded the data before it downloads, or it will load the data from the disk if we have it saved before. Use the following code:
def load_movie_data():
save_folder_name = 'temp'
pos_file = os.path.join(save_folder_name, 'rt-polarity.pos')
neg_file = os.path.join(save_folder_name, 'rt-polarity.neg')
# Check if files are already downloaded
if os.path.exists(save_folder_name):
pos_data = []
with open(pos_file, 'r') as temp_pos_file:
for row in temp_pos_file:
pos_data.append(row)
neg_data = []
with open(neg_file, 'r') as temp_neg_file:
for row in temp_neg_file:
neg_data.append(row)
else: # If not downloaded, download and save
movie_data_url = 'http://www.cs.cornell.edu/people/pabo/ movie-review-data/rt-polaritydata.tar.gz'
stream_data = urllib.request.urlopen(movie_data_url)
tmp = io.BytesIO()
while True:
s = stream_data.read(16384)
if not s:
break
tmp.write(s)
stream_data.close()
tmp.seek(0)
tar_file = tarfile.open(fileobj=tmp, mode="r:gz")
pos = tar_file.extractfile('rt-polaritydata/rt-polarity. pos')
neg = tar_file.extractfile('rt-polaritydata/rt-polarity. neg')
# Save pos/neg reviews
pos_data = []
for line in pos:
pos_data.append(line.decode('ISO-8859-1'). encode('ascii',errors='ignore').decode())
neg_data = []
for line in neg:

neg_data.append(line.decode('ISO-8859-1'). encode('ascii',errors='ignore').decode())
tar_file.close()
# Write to file
if not os.path.exists(save_folder_name):
os.makedirs(save_folder_name)
# Save files
with open(pos_file, "w") as pos_file_handler:
pos_file_handler.write(''.join(pos_data))
with open(neg_file, "w") as neg_file_handler:
neg_file_handler.write(''.join(neg_data))
texts = pos_data + neg_data
target = [1]*len(pos_data) + [0]*len(neg_data)
return(texts, target)
texts, target = load_movie_data()


4.Next we create a normalization function for text. This function will input a list of strings and apply lowercasing, remove punctuation, remove numbers, trip extra whitespace, and remove stop words. Use the following code:
def normalize_text(texts, stops):
# Lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Remove stopwords
texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]
return(texts)
texts = normalize_text(texts, stops)


5.To make sure that all our movie reviews are informative, we should make sure they are long enough to contain important word relationships. We arbitrarily set this to three or more words:
target = [target[ix] for ix, x in enumerate(texts) if len(x. split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]

neg_data.append(line.decode('ISO-8859-1'). encode('ascii',errors='ignore').decode())
tar_file.close()
# Write to file
if not os.path.exists(save_folder_name):
os.makedirs(save_folder_name)
# Save files
with open(pos_file, "w") as pos_file_handler:
pos_file_handler.write(''.join(pos_data))
with open(neg_file, "w") as neg_file_handler:
neg_file_handler.write(''.join(neg_data))
texts = pos_data + neg_data
target = [1]*len(pos_data) + [0]*len(neg_data)
return(texts, target)
texts, target = load_movie_data()


6.Next we create a normalization function for text. This function will input a list of strings and apply lowercasing, remove punctuation, remove numbers, trip extra whitespace, and remove stop words. Use the following code:
def normalize_text(texts, stops):
# Lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in texts]
# Remove stopwords
texts = [' '.join([word for word in x.split() if word not in (stops)]) for x in texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]
return(texts)
texts = normalize_text(texts, stops)


7.To make sure that all our movie reviews are informative, we should make sure they are long enough to contain important word relationships. We arbitrarily set this to three or more words:
target = [target[ix] for ix, x in enumerate(texts) if len(x. split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]


8.Now we can actually create our dictionary and transform our list of sentences into lists of word indices:
word_dictionary = build_dictionary(texts, vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_ dictionary.keys()))
text_data = text_to_numbers(texts, word_dictionary)


9.From the preceding word dictionary, we can look up the index for the validation words we choose in step 2. Use the following code:
valid_examples = [word_dictionary[x] for x in valid_words]


10.We now create a function that will return our skip-gram batches. We want to train on pairs of words where one word is the training input (from the target word at the center of our window) and the other word is selected from the window. For example, the sentence the cat in the hat may result in (input, output) pairs such as the following: (the, in), (cat, in), (the, in), (hat, in), if in was the target word, and we had a window size of two in each direction:
def generate_batch_data(sentences, batch_size, window_size, method='skip_gram'):
# Fill up data batch
batch_data = []
label_data = []
while len(batch_data) < batch_size:
# select random sentence to start
rand_sentence = np.random.choice(sentences)
# Generate consecutive windows to look at
window_sequences = [rand_sentence[max((ix-window_ size),0):(ix+window_size+1)] for ix, x in enumerate(rand_ sentence)]
# Denote which element of each window is the center word of interest
label_indices = [ix if ix<window_size else window_size for ix,x in enumerate(window_sequences)]
# Pull out center word of interest for each window and create a tuple for each window
if method=='skip_gram':
batch_and_labels = [(x[y], x[:y] + x[(y+1):]) for x,y in zip(window_sequences, label_indices)]
# Make it in to a big list of tuples (target word, surrounding word)
tuple_data = [(x, y_) for x,y in batch_and_labels for y_ in y]
else:

raise ValueError('Method {} not implmented yet.'.format(method))
# extract batch and labels
batch, labels = [list(x) for x in zip(*tuple_data)]
batch_data.extend(batch[:batch_size])
label_data.extend(labels[:batch_size])
# Trim batch and label at the end
batch_data = batch_data[:batch_size]
label_data = label_data[:batch_size]
# Convert to numpy array
batch_data = np.array(batch_data)
label_data = np.transpose(np.array([label_data]))
return(batch_data, label_data)


11.We can now initialize our embedding matrix, and declare our placeholders and our embedding lookup function. Use the following code:
embeddings = tf.Variable(tf.random_uniform([vocabulary_size,
embedding_size], -1.0, 1.0))
# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Lookup the word embedding:
embed = tf.nn.embedding_lookup(embeddings, x_inputs)


12.The loss function should be something such as a softmax, which calculates the loss on predicting the wrong word category. But since our target has 10,000 different categories, it is very sparse. This sparsity causes problems fitting or converging for a model. To tackle this, we use a loss function called noise-contrastive error (NCE). This NCE loss function turns our problem into a binary prediction, by predicting the word class versus random noise predictions. The num_sampled parameter is how much of the batch to turn into random noise:
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,
embedding_size], stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, embed,
y_target, num_sampled, vocabulary_size))

13.Now we need to create a way to find nearby words to our validation words. We will do this by computing the cosine similarity between the validation set and all of our word embeddings, then we can print out the closest set of words for each validation word. Use the following code:

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_ dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True)


14.We now declare our optimizer function, and initialize our model variables:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0). minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)


15.Now we can train our embeddings and print off the loss and the closest words to our validation set during training. Use the following code:
loss_vec = []
loss_x_vec = []
for i in range(generations):
batch_inputs, batch_labels = generate_batch_data(text_data, batch_size, window_size)
feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}
# Run the train step
sess.run(optimizer, feed_dict=feed_dict)
# Return the loss
if (i+1) % print_loss_every == 0:
loss_val = sess.run(loss, feed_dict=feed_dict)
loss_vec.append(loss_val)
loss_x_vec.append(i+1)
print("Loss at step {} : {}".format(i+1, loss_val))
# Validation: Print some random words and top 5 related words
if (i+1) % print_valid_every == 0:
sim = sess.run(similarity, feed_dict=feed_dict)
for j in range(len(valid_words)):
valid_word = word_dictionary_rev[valid_examples[j]]
top_k = 5 # number of nearest neighbors
nearest = (-sim[j, :]).argsort()[1:top_k+1]
log_str = "Nearest to {}:".format(valid_word)
for k in range(top_k):
close_word = word_dictionary_rev[nearest[k]]

log_str = "%s %s," % (log_str, close_word)
print(log_str)
16.This results in the following output:
16.这导致以下输出:
Loss at step 500 : 13.387781143188477
Loss at step 1000 : 7.240757465362549
Loss at step 49500 : 0.9395825862884521
Loss at step 50000 : 0.30323168635368347
Nearest to cliche: walk, intrigue, brim, eileen, dumber,
Nearest to love: plight, fiction, complete, lady, bartleby,
Nearest to hate: style, throws, players, fearlessness, astringent,
Nearest to silly: delivers, meow, regain, nicely, anger,
Nearest to sad: dizzying, variety, existing, environment, tunney,


How it works…
We have trained a Word2vec model, via the skip-gram method, on a corpus of movie review data. We downloaded the data, converted the words to an index with a dictionary, and used those index numbers as an embedding lookup, which we trained so that nearby words could be predictive of each other.


There's more…
At first glance, we might expect the set of nearby words to the validation set to be synonyms. This is not quite the case because very rarely do synonyms actually appear next to each other in sentences. What we are really getting at is predicting which words are in proximity to each other in our data set. We hope that using an embedding like this would make prediction easier.
In order to use these embeddings, we must make them reusable and save them. We do this in the next recipe by implementing the CBOW embeddings.


没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条