Attention is all you need

model architecture

Inputs: A paragraph of English consists of $B$ (i.e. batch_size) sentences. Each sentence has $N$ (i.e. seq_length) words at most.

Outputs: A paragraph of Chinese translated from Inputs. $(B,N)$

Encoder outcome: the feature matrix containing position , context, semantic information

Decoder: auto-regressive , consuming the previously generated symbols as additional input when generating the next.

For example:

Inputs : I love u. ( $B=1$ )

Learning feature from Inputs (Encoder)
send the feature to Decoder
generate the Chinese ’ 我爱你’ word by word

Training period : Given the Inputs(Encoder) and Outputs,Label(Outputs),the loss function is the cross-entropy between the probability vector and ground true.

Attention Mechanism

$\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

$\begin{aligned}\mathrm{MultiHead}(Q,K,V)&=\mathrm{Concat}(\mathrm{head}_1,...,\mathrm{head}_\mathrm{h})W^O\\\mathrm{where~head_i}&=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V)\end{aligned}$

attention between query and key: $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})$

modified value: $\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$

Add&Norm: residual connections and Layer Norm
FFN: Linear layer , ReluActivation , Linear

Embedding

Positional Embedding:

$PE_{(pos,2i)}=sin(pos/10000^{2i/d_{\mathrm{model}}})\\PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{\mathrm{model}}})$

Transformer cant extract the position information without the Positional Encoding