BERT Word Embeddings Tutorial

本文译自BERT Word Emebddings Tutorial,我将其中部分内容进行了精简。转载请注明出处

1. Loading Pre-Trained BERT

通过Hugging Face安装BERT的PyTorch接口,该库还包含其它预训练语言模型的接口,如OpenAI的GPT和GPT-2

如果您在Google Colab上运行此代码,每次重新连接时都必须安装此库

!pip install transformers

BERT是由Google发布的预训练模型,该模型使用Wikipedia和Book Corpus数据进行训练(Book Corpus是一个包含不同类型的10000+本书的数据集)。Google发布了一系列BERT的变体,但我们在这里使用的是两种可用尺寸(“base” 和 “large”)中较小的一种,并且我们设置忽略单词大小写


现在让我们导入PyTorch,预训练BERT 模型以及BERT tokenizer

import torch
from transformers import BertTokenizer, BertModel# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
import logging
# logging.basicConfig(level=logging.INFO)import matplotlib.pyplot as plt
%matplotlib inline# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

2. Input Formatting


  1. A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
  2. A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.
  3. Tokens that conform with the fixed vocabulary used in BERT
  4. The Token IDs for the tokens, from BERT’s tokenizer
  5. Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
  6. Segment IDs used to distinguish different sentences
  7. Positional Embeddings used to show token position within the sequence



2.1 Special Tokens



[CLS] The man went to the store. [SEP] He bought a gallon of milk.


[CLS] The man went to the store. [SEP]

2.2 Tokenization


text = "Here is the sentence I want embeddings for."
marked_text = "[CLS] " + text + " [SEP]"# Tokenize our sentence with the BERT tokenizer.
tokenized_text = tokenizer.tokenize(marked_text)# Print out the tokens.
print (tokenized_text)
# ['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']

注意"embeddings"这个词是如何表示的:['em', '##bed', '##ding', '##s']



  1. 整个词
  2. 出现在单词开头或单独出现的子词("embddings"中的"em"与"go get em"中的"em"向量相同)
  3. 不在单词开头的子词,前面会添加上"##"
  4. 单个字符


因此,即使"embeddings"这个词不在词汇表中,我们也不会将这个词视为未知词汇,而是将其分为子词tokens [‘em’, ‘##bed’, ‘##ding’, ‘##s’],这将保留单词的一些上下文含义。我们甚至可以平均这些子词的嵌入向量以生成原始单词的近似向量。有关WordPeice的更多信息,请参考原论文




# Define a new example sentence with multiple meanings of the word "bank"
text = "After stealing money from the bank vault, the bank robber was seen " \"fishing on the Mississippi river bank."# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)# Map the token strings to their vocabulary indeces.
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)# Display the words with their indeces.
for tup in zip(tokenized_text, indexed_tokens):print('{:<12} {:>6,}'.format(tup[0], tup[1]))
[CLS]           101
after         2,044
stealing     11,065
money         2,769
from          2,013
the           1,996
bank          2,924
vault        11,632
,             1,010
the           1,996
bank          2,924
robber       27,307
was           2,001
seen          2,464
fishing       5,645
on            2,006
the           1,996
mississippi   5,900
river         2,314
bank          2,924
.             1,012
[SEP]           102

2.3 Segment ID


# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)

3. Extracting Embeddings

3.1 Running BERT on our text

接下来,我们需要将数据转换为PyTorch tensor类型

# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])


model.eval()会使得我们的模型处于测试模式,而不是训练模式。在测试模式下,模型将会关闭dropout regularization

# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True, # Whether the model returns all hidden-states.)# Put the model in "evaluation" mode, meaning feed-forward operation.



# Run the text through BERT, and collect all of the hidden states produced
# from all 12 layers. 
with torch.no_grad():outputs = model(tokens_tensor, segments_tensors)# Evaluating the model will return a different number of objects based on # how it's configured in the `from_pretrained` call earlier. In this case, # becase we set `output_hidden_states = True`, the third item will be the # hidden states from all layers. See the documentation for more details:# https://huggingface.co/transformers/model_doc/bert.html#bertmodelhidden_states = outputs[2]

3.2 Understanding the Output


  1. The Layer number(13 layers)
  2. The batch number(1 sentence)
  3. The word / token number(22 tokens in our sentence)
  4. The hidden unit / feature number(768 features)

ちょっと待って,13层?前面不是说BERT只有12层吗?因为最前面的一层是Word Embedding层,剩下的是12个Encoder Layer

第二个维度(batch size)是一次向模型提交多个句子时使用的;不过,在这里我们只有一个句子

print ("Number of layers:", len(hidden_states), " (initial embeddings + 12 BERT layers)")layer_i = 0
print ("Number of batches:", len(hidden_states[layer_i]))batch_i = 0
print ("Number of tokens:", len(hidden_states[layer_i][batch_i]))token_i = 0
print ("Number of hidden units:", len(hidden_states[layer_i][batch_i][token_i]))
Number of layers: 13   (initial embeddings + 12 BERT layers)
Number of batches: 1
Number of tokens: 22
Number of hidden units: 768

通过快速浏览指定token和网络层的数值范围,您会发现其中大部分值介于[-2, 2],少数在-12附近

# For the 5th token in our sentence, select its feature values from layer 5.
token_i = 5
layer_i = 5
vec = hidden_states[layer_i][batch_i][token_i]# Plot the values as a histogram to show their distribution.
plt.hist(vec, bins=200)


当前的维度:[layers, batchs, tokens, features]

期望的维度:[tokens, layers, features]


# Concatenate the tensors for all layers. We use `stack` here to
# create a new dimension in the tensor.
token_embeddings = torch.stack(hidden_states, dim=0)token_embeddings.size()
# torch.Size([13, 1, 22, 768])


# Remove dimension 1, the "batches".
token_embeddings = token_embeddings.squeeze(dim=1)token_embeddings.size()
# torch.Size([13, 22, 768])


# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)token_embeddings.size()
# torch.Size([22, 13, 768])

3.3 Creating word and sentence vectors from hidden states


Word Vectors


# Stores the token vectors, with shape [22 x 3,072]
token_vecs_cat = []# `token_embeddings` is a [22 x 12 x 768] tensor.# For each token in the sentence...
for token in token_embeddings:# `token` is a [12 x 768] tensor# Concatenate the vectors (that is, append them together) from # the last four layers.# Each layer vector is 768 values, so `cat_vec` is length 3072.cat_vec = torch.cat((token[-1], token[-2], token[-3], token[-4]), dim=0)# Use `cat_vec` to represent `token`.token_vecs_cat.append(cat_vec)print ('Shape is: %d x %d' % (len(token_vecs_cat), len(token_vecs_cat[0])))
# Shape is: 22 x 3072


# Stores the token vectors, with shape [22 x 768]
token_vecs_sum = []# `token_embeddings` is a [22 x 12 x 768] tensor.# For each token in the sentence...
for token in token_embeddings:# `token` is a [12 x 768] tensor# Sum the vectors from the last four layers.sum_vec = torch.sum(token[-4:], dim=0)# Use `sum_vec` to represent `token`.token_vecs_sum.append(sum_vec)print ('Shape is: %d x %d' % (len(token_vecs_sum), len(token_vecs_sum[0])))
# Shape is: 22 x 768

Sentence Vectors


# `hidden_states` has shape [13 x 1 x 22 x 768]# `token_vecs` is a tensor with shape [22 x 768]
token_vecs = hidden_states[-2][0]# Calculate the average of all 22 token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)print("Our final sentence embedding vector of shape:", sentence_embedding.size())
# Our final sentence embedding vector of shape: torch.Size([768])

3.4 Confirming contextually dependent vectors


“After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.”

for i, token_str in enumerate(tokenized_text):print(i, token_str)
0 [CLS]
1 after
2 stealing
3 money
4 from
5 the
6 bank
7 vault
8 ,
9 the
10 bank
11 robber
12 was
13 seen
14 fishing
15 on
16 the
17 mississippi
18 river
19 bank
20 .
21 [SEP]


print('First 5 vector values for each instance of "bank".')
print("bank vault ", str(token_vecs_sum[6][:5]))
print("bank robber ", str(token_vecs_sum[10][:5]))
print("river bank ", str(token_vecs_sum[19][:5]))
First 5 vector values for each instance of "bank".bank vault    tensor([ 3.3596, -2.9805, -1.5421,  0.7065,  ...])
bank robber   tensor([ 2.7359, -2.5577, -1.3094,  0.6797,  ...])
river bank    tensor([ 1.5266, -0.8895, -0.5152, -0.9298,  ...])


from scipy.spatial.distance import cosine# Calculate the cosine similarity between the word bank
# in "bank robber" vs "bank vault" (same meaning).
same_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[6])# Calculate the cosine similarity between the word bank 
# in "bank robber" vs "river bank" (different meanings).
diff_bank = 1 - cosine(token_vecs_sum[10], token_vecs_sum[19])print('Vector similarity for *similar* meanings: %.2f' % same_bank) # 0.94
print('Vector similarity for *different* meanings: %.2f' % diff_bank) # 0.69

3.5 Pooling Strategy & Layer Choice

BERT Authors



Han Xiao’s BERT-as-service



  1. 第一层是嵌入层,由于它没有上下文信息,因此同一个词在不同语境下的向量是相同的
  2. 随着进入网络的更深层次,单词嵌入从每一层中获得了越来越多的上下文信息
  3. 但是,当您接近最后一层时,词嵌入将开始获取BERT特定预训练任务的信息(MLM和NSP)
  4. 倒数第二层的词嵌入比较合理