📝 Blog

LLM 架构全景：从 Transformer 到现代大模型

Published February 21, 2026 Updated February 22, 2026 5 views

2017年，Google 发表了 Attention is All You Need，Transformer 架构从此改变了 NLP 领域。

import torch
import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
        
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def forward(self, x):
        Q = self.W_q(x)
        K = self.W_k(x)
        V = self.W_v(x)
        
        # Scaled Dot-Product Attention
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        attn = torch.softmax(scores, dim=-1)
        return torch.matmul(attn, V)

旋转位置编码（Rotary Position Embedding）相比传统绝对位置编码，能更好地外推到更长的序列。

将 Multi-Head Attention 的 KV heads 数量减少，大幅降低推理时的显存占用：

MHA: Q/K/V heads 数量相同
MQA: 只有 1 个 K/V head
GQA: K/V heads 数量为 Q 的 1/G（Llama 3 使用此方案）

替代传统 ReLU，在 FFN 层使用门控机制提升表达能力。

模型	参数量	上下文长度	特点
GPT-4	未公开	128K	最强综合能力
Claude 3.5	未公开	200K	长上下文理解
Llama 3.1	405B	128K	开源最强
Gemini 1.5	未公开	1M	超长上下文

LLM 技术发展迅猛，关键是理解底层原理，才能在应用层做出正确判断。

Back to Blog Edit