Qwen 使用方法

參考文獻: Qwen-blog
Qwen 整體架構：

framework
其中:

tokenizer將文本轉為詞表裡面的數值。
數值經過embedding得到一一對應的向量。
attention_mask是用來看見左邊、右邊，雙向等等來設定。
各類下游任務，Casual,seqcls等，基本都是基礎模型model後面接對應的Linear層，還有損失函數不一樣。

首先從 github 下載 transformer 倉庫，並命名為 transformer_213（因為我是 2.13 日期下載的），此步驟為了防止與包混淆。

定義 run_demo.ipynb 來詳細了解 qwen2。

from transformers_213.src.transformers.models.qwen2 import Qwen2Config, Qwen2Model
import torch

def run_qwen2():
    qwen2config = Qwen2Config(
        vocab_size=151936,
        hidden_size=4096//2,
        num_hidden_layers=32//2,
        num_attention_heads=32,
        intermediate_size=2048//2
    )
    qwen2model = Qwen2Model(qwen2config)
    input_ids = torch.randint(0, qwen2config.vocab_size, (4, 30))
    res = qwen2model(input_ids)
    print(res)

if __name__ == "__main__":
    run_qwen2()

1 Qwen2Config#

Qwen2Config 中包含一些自定義的超參數，例如vocab_size,hidden_size,num_hidden_layers, num_attention_heads等。類似於dict可以調用裡面的超參數:config.pad_token_id。

1.1 Qwen2Model#

1.1.1 初始化#

設置了模型的兩個屬性:padding_idx（用於指定填充標記的索引），vocab_size（詞彙表的大小）
初始化了模型的嵌入層、解碼器層、歸一化層
嵌入層（nn.Embedding）：模型使用嵌入層將輸入的標記映射成密集的向量表示。
解碼器層（nn.ModuleList()）：模型包含多個解碼器層，這些層都是由 Qwen2DecoderLayer 定義
歸一化層 Qwen2RMSNorm：歸一化層使用的是 Root Mean Square Layer Normalization
設置了是否使用 gradient_checkpoint 主要是用來節省顯存
調用 post_init() 完成一些初始化和準備檢查的代碼

class Qwen2Model(Qwen2PreTrainedModel):
    def __init__(self, config: Qwen2Config):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList(
            [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

對於post_init函數：
主要是對參數進行初始化，以及初始化梯度檢查點作用

def post_init(self):
    """
    A method executed at the end of each Transformer model initialization, to execute code that needs the model's
    modules properly initialized (such as weight initialization).
    """
    self.init_weights()
    self._backward_compatibility_gradient_checkpointing()

1.1.2 Forward#

在此只對核心主幹進行講解:

inputs_embeds = self.embed_tokens(input_ids)
# embed positions
hidden_states = inputs_embeds

for idx, decoder_layer in enumerate(self.layers):
    # 將所有的hidden_states保存成tuple
    if output_hidden_states:
        all_hidden_states += (hidden_states,)
    # 將hs送入每一層decoder_layer
    layer_outputs = decoder_layer(
        hidden_states,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_value=past_key_value,
        output_attentions=output_attentions,
        use_cache=use_cache,
    )
    # 取出上一層decoder_輸出的hs,再傳入下一個layer
    # 只要第一個,第二個是cache的一個類，然後進入下一個layer
    hidden_states = layer_outputs[0]
    
# 將最後layers輸出後的hidden_states進行標準化  
hidden_states = self.norm(hidden_states)
    
# 加上最後一層的hidden_states
if output_hidden_states:
    all_hidden_states += (hidden_states,)

如果保存output_hidden_states的話，就是第一個為input_ids進行emb，然後保存到n-1層的decoder_layer的輸出hs，再加上最後一層layer的輸出hs進行過norm後的hs。
最後是以BaseModelOutputWithPast的形式輸出。

待續。。。。。