model architecture

1

Inputs: A paragraph of English consists of BB (i.e. batch_size) sentences. Each sentence has NN (i.e. seq_length) words at most.

Outputs: A paragraph of Chinese translated from Inputs.(B,N)(B,N)

Encoder outcome: the feature matrix containing position , context, semantic information

Decoder: auto-regressive , consuming the previously generated symbols as additional input when generating the next.

For example:

Inputs : I love u. (B=1B=1)

  1. Learning feature from Inputs (Encoder)
  2. send the feature to Decoder
  3. generate the Chinese ’ 我爱你’ word by word

Training period : Given the Inputs(Encoder) and Outputs,Label(Outputs),the loss function is the cross-entropy between the probability vector and ground true.

Attention Mechanism

2

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

MultiHead(Q,K,V)=Concat(head1,...,headh)WOwhere headi=Attention(QWiQ,KWiK,VWiV)\begin{aligned}\mathrm{MultiHead}(Q,K,V)&=\mathrm{Concat}(\mathrm{head}_1,...,\mathrm{head}_\mathrm{h})W^O\\\mathrm{where~head_i}&=\text{Attention}(QW_i^Q,KW_i^K,VW_i^V)\end{aligned}

attention between query and key: softmax(QKTdk)\text{softmax}(\frac{QK^T}{\sqrt{d_k}})

modified value: softmax(QKTdk)V\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

  • Add&Norm: residual connections and Layer Norm
  • FFN: Linear layer , ReluActivation , Linear

Embedding

Positional Embedding:

PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(pos,2i)}=sin(pos/10000^{2i/d_{\mathrm{model}}})\\PE_{(pos,2i+1)}=cos(pos/10000^{2i/d_{\mathrm{model}}})

Transformer cant extract the position information without the Positional Encoding