Ryan Catullo

The introduction of the Transformer architecture in 2017 was a pivotal moment in NLP which enabled the widespread adoption of large-language models (LLMs) like ChatGPT. These models rely on word-embeddings which are high-dimensional vector representations of words that capture their semantic meaning. Probably one of the most famous ways of generating these word embeddings is by a transformer-based model called BERT, or Bidirectional Encoder Representations from Transformers.

This article is going to give an overview of the structure of BERT and how its trained to generate these embeddings. I will assume familiarity with the Transformer architecture and the masked attention mechanism.

Input Tokenization

The first step in training is input tokenization, which is the process of breaking sequences of words into smaller, more meaningful tokens. For example, a sentence like

"Do you like Piña Coladas?"

could be tokenized as

["do", "you", "like", "pina", "coladas", "?"].

BERT uses a tokenization algorithm called WordPiece, which has the following description given by HuggingFace. This identifies subwords, for example "hospitalization" would be tokenized as ["hospital", "##ization"], where the "##" prefix indicates a subword of the previous word.

Every sequence also starts with a "[CLS]" token which is intended as a classification objective and contains the main information about the relationship between sentences. For example, the embedding of "[CLS]" can indicate whether the sequence is a question-answering task (QA) or a sentiment analysis task.

There are other unique tokens as well, for example sentences are separated by a "[SEP]" token, unknown words are replaced by "[UNK]", and a "[MASK]" token is utilized for training BERT to predict masked tokens.

WordPiece starts by splitting words into characters along with prefixes, e.g. "hello" becomes ["h", "##e", "##l", "##l", "##o"]. It then learns merge rules to combine subwords into larger tokens. WordPiece computes a score for each pair to be merged using the following formula.

\mathrm{score} = \frac{\mathrm{freq\_of\_pair}}{\mathrm{freq\_of\_first\_elem} \times \mathrm{freq\_of\_second\_elem}}

This means that pairs that appear almost exclusively together in the corpus are merged faster than other pairs. In particular, even if "un" and "##able" appear very frequently together in the corpus, the might not be merged since "un" also appears very frequently with other fragments. On the other hand, pairs like "stan" and "##ford" appear less frequently individually and hence are more likely to be merged.

The word pair with the highest score is merged and added to the vocabulary until the vocabulary reaches the desired size. Then words are tokenized by finding the largest subword that is a token in the vocabulary, splitting on it, and repeating with the remainder of the word until it is tokenized. For example, if our word is "students" and we have "student" and "##s"in our vocabulary (but not "students") we would tokenize it as ["student", "##s"].

The tokens are then encoded into their id's, i.e. "student" -> 1034. BERT has a vocab size of 30522 so each token in the vocab has an ID between 1 and 30522.

Embedding Layer

To embed tokens into dense context-aware high dimensional vectors that capture the semantic meaning, we start by initializing three embedding matrices. The original BERT paper used a hidden size of 768, meaning tokens are embedded as vectors of this dimension. They also use a fixed sequence length of 512 tokens, and anything shorter is padded using the "[PAD]" token.

When training, we have two sentences in a sequence $A$ and $B$ separated by a "[SEP]" token. BERT is trained to determine if sentence $B$ logically follows sentence $A$ in a task called next sentence prediction (NSP). Suppose we have a sequence of tokens for sentence $A$ and sentence $B$ .

["[CLS]", "a_1", ..., "a_n", "[SEP]", "b_1", ..., "b_m", "[SEP]"]

Three trainable hidden vectors are generated for each token in the sequence.

Token embedding $\bfe_{\text{token}_t}$ - A trainable (30522, 768) matrix that maps the token id to its hidden vector.
Positional embedding $\bfe_{\text{position}_t}$ - A trainable (512, 768) matrix which captures token order in the sequence.
Segment embedding $\bfe_{\text{segment}_t}$ - A trainable (2, 768) matrix to indicate to which sentence a token belongs in next sentence prediction, i.e. $A$ or $B$ .

Then the input embedding for token $t$ in the sequence is given by the sum of the above embeddings.

\bfh^{(0)}_t = \bfe_{\text{token}_t} + \bfe_{\text{position}_t} + \bfe_{\text{segment}_t}

The input embeddings are then passed to the first encoder layer of BERT.

Transformer Encoder Stack

BERT is a stack of Transformer encoder blocks (not decoders). Each encoder block has multi-head self-attention mechanisms. The input to each block is the 512 hidden states from the previous block, so the first block takes $\bfh^{(0)}$ as input and outputs $\bfh^{(1)}$ of the same size, which is input to the second block and so forth.

For each attention head we have trainable query, key, and value matrices $W_Q, W_K, W_V$ which compute attention as

\mathrm{Attention}(Q, K, V) = \softmax\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V

where $Q = \bfh^{(\ell)} W_Q$ , $K = \bfh^{(\ell)} W_K$ , $V = \bfh^{(\ell)} W_V$ .

We then take the residual connection and layer norm. The output of the attention mechanism is added to the input and then normalized:

x = \mathrm{LayerNorm}(\bfh^{(\ell)} + \mathrm{Attention}(\bfh^{(\ell)}))

Then each position is passed independently through a 2-layer feedforward neural network.

\mathrm{FFN}(x_t) = \mathrm{GELU}(x_tW_1 + b_1)W_2 + b_2

Finally another residual and layernorm is added to get the output of the encoder block.

\bfh^{(\ell+1)} = \mathrm{LayerNorm}(x + \mathrm{FFN}(x))

The original BERT paper trained two models, BERT-base and BERT-large which used 12 and 24 encoder blocks respectively (BERT-large also used a hidden size of 1024 instead of 768).

The output of the final block is referred to as the output embeddings. For the sake of convenience we assume 12 encoder blocks, so the output embeddings are $\bfh^{(12)}$ .

Lastly, BERT has a prediction head which maps each output embedding back into the vocabulary distribution. The head takes the layernorm of single neural network layer.

\bfz_t = \mathrm{LayerNorm}(\mathrm{GELU}(\bfh^{(12)}_t W_1 + b_1))

Then we project back to the vocabulary using the original input embedding matrix $\bfh^{(0)}$ , which is referred to as weight tying, and take a softmax to get a probability distribution over the vocabulary.

\hat y_t = \softmax(\bfz_t {\bfh^{(0)}}^{\top} + b^{(0)})

Pre-training

BERT is initially trained on unlabeled pairs of sentences over two prediction tasks: masked language modeling (MLM) and next sentence prediction (NSP).

In masked language modeling, 15% of tokens in a sequence are masked and the model is trained to predict the masked words from both left and right context, which is where the bidirectionality of the model comes from. The masked tokens are replaced according to the following rules.

80% of tokens are replaced by "[MASK]".
10% of tokens are replaced by a random token.
10% of tokens are unchanged.

The reason we don't just mask every token is to prevent the model from overfitting to the "[MASK]" token.

The sequence is then fed to BERT which generates output embeddings for each input token. Then the output embeddings corresponding to masked tokens are passed through the prediction head. Then cross-entropy loss is calculated by comparing the probability distribution with the true masked token, and the model weights are updated via backpropogation.

In NSP, pairs of sentences are chosen from the corpus and arranged in a sequence such that 50% of the sentences are reversed from their original order in the corpus. In this task, we utilize the "[CLS]" token at the beginning to classify whether or not the sentences have been reversed.

For this task, we use a separate prediction head which outputs a binary probability distribution and cross-entropy loss is calculated for backpropogation. Both NSP and MLM are trained in-tandem during the pre-training phase.

Fine-tuning

After pre-training, BERT is capable of generating word embeddings that capture the contextual semantic meaning of the words. The model can then be fine-tuned for specific downstream tasks.

For example, one can train BERT to solve sentence pair classification which involves understanding the relationship between a pair of sentences. Another popular task is question answering, whose goal is to find an answer in a text paragraph corresponding to a given question. The wide range of downstream tasks BERT is capable of solving makes it a landmark model in the world of NLP.