Qwen usage - Hanah

References: Qwen-blog
Qwen overall architecture:

framework
Among them:

tokenizer converts text into values in the vocabulary.
The values are transformed into corresponding vectors through embedding.
attention_mask is used to set visibility for left, right, bidirectional, etc.
Various downstream tasks, such as Casual, seqcls, etc., are basically the basic model model followed by the corresponding Linear layer, with different loss functions.

First, download the transformer repository from GitHub and name it transformer_213 (because I downloaded it on the date 2.13), this step is to prevent confusion with the package.

Define run_demo.ipynb to learn about qwen2 in detail.

from transformers_213.src.transformers.models.qwen2 import Qwen2Config, Qwen2Model
import torch

def run_qwen2():
    qwen2config = Qwen2Config(
        vocab_size=151936,
        hidden_size=4096//2,
        num_hidden_layers=32//2,
        num_attention_heads=32,
        intermediate_size=2048//2
    )
    qwen2model = Qwen2Model(qwen2config)
    input_ids = torch.randint(0, qwen2config.vocab_size, (4, 30))
    res = qwen2model(input_ids)
    print(res)

if __name__ == "__main__":
    run_qwen2()

1 Qwen2Config#

Qwen2Config contains some custom hyperparameters, such as vocab_size, hidden_size, num_hidden_layers, num_attention_heads, etc. Similar to dict, you can call the hyperparameters inside: config.pad_token_id.

1.1 Qwen2Model#

1.1.1 Initialization#

Sets two properties of the model: padding_idx (used to specify the index of the padding token), vocab_size (the size of the vocabulary)
Initializes the model's embedding layer, decoder layers, and normalization layer
Embedding layer (nn.Embedding): The model uses the embedding layer to map the input tokens to dense vector representations.
Decoder layers (nn.ModuleList()): The model contains multiple decoder layers, all defined by Qwen2DecoderLayer.
Normalization layer Qwen2RMSNorm: The normalization layer uses Root Mean Square Layer Normalization.
Sets whether to use gradient_checkpoint, mainly to save GPU memory.
Calls post_init() to complete some initialization and preparation checks.

class Qwen2Model(Qwen2PreTrainedModel):
    def __init__(self, config: Qwen2Config):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList(
            [Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
        )
        self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

For the post_init function:
Mainly initializes parameters and initializes the gradient checkpoint function.

def post_init(self):
    """
    A method executed at the end of each Transformer model initialization, to execute code that needs the model's
    modules properly initialized (such as weight initialization).
    """
    self.init_weights()
    self._backward_compatibility_gradient_checkpointing()

1.1.2 Forward#

Here, only the core backbone is explained:

inputs_embeds = self.embed_tokens(input_ids)
# embed positions
hidden_states = inputs_embeds

for idx, decoder_layer in enumerate(self.layers):
    # Save all hidden_states as a tuple
    if output_hidden_states:
        all_hidden_states += (hidden_states,)
    # Pass hs into each decoder_layer
    layer_outputs = decoder_layer(
        hidden_states,
        attention_mask=attention_mask,
        position_ids=position_ids,
        past_key_value=past_key_value,
        output_attentions=output_attentions,
        use_cache=use_cache,
    )
    # Take the hs output from the previous decoder and pass it to the next layer
    # Only the first one, the second is a type of cache, then enter the next layer
    hidden_states = layer_outputs[0]
    
# Normalize the hidden_states after the last layer output  
hidden_states = self.norm(hidden_states)
    
# Add the hidden_states from the last layer
if output_hidden_states:
    all_hidden_states += (hidden_states,)

If output_hidden_states is saved, the first is input_ids for emb, then saved to the n-1 layer's decoder_layer output hs, plus the last layer's layer output hs after norm.
Finally, it outputs in the form of BaseModelOutputWithPast.

To be continued...