O Unsloth resolve o problema de inferência duplicada na versão quantificada do QwQ-32B

Tutoriais práticos de IAPublicado há 7 meses Círculo de compartilhamento de IA

14.4K 00

Recentemente, a equipe da Qwen lançou QwQ-32B um modelo de inferência que tem sido usado em muitas Referências É um show que rivaliza com o DeepSeek-R1 O desempenho do sistema é excelente. No entanto, muitos usuários se depararam com geração infinita, excesso de duplicatas, problemas de token e problemas de ajuste fino. Este artigo tem como objetivo fornecer um guia detalhado para ajudá-lo a depurar e resolver esses problemas e liberar todo o potencial do QwQ-32B.

Sem pano O modelo carregado pela equipe corrige os erros acima e permite melhor suporte a ferramentas e estruturas como ajuste fino, vLLM e Transformers. Para aqueles que usam llama.cpp e outros usuários com o llama.cpp como mecanismo de back-end, consulte a seção este link Obtenha orientação para corrigir o problema de geração infinita.

Remoção do modelo QwQ-32B (bug corrigido):

Configurações oficiais recomendadas

⚙️ Configurações oficiais recomendadas

Com base nas recomendações oficiais da Qwen, a seguir estão as configurações de parâmetros recomendadas para a inferência do modelo:

Temperatura: 0,6
Top_K: 40 (faixa recomendada de 20 a 40)
Min_P: 0,1 (opcional, mas funciona bem)
Top_P: 0,95
Penalidade de repetição: 1,0 (em llama.cpp e transformadores, 1,0 significa desativado)
Modelo de bate-papo:<|im_start|>user\nCreate a Flappy Bird game in Python.<|im_end|>\n<|im_start|>assistant\n<think>\n

Configurações recomendadas do llama.cpp

A equipe do Unsloth notou que muitos usuários preferem usar uma versão do Repetition Penalty No entanto, essa abordagem realmente interfere no mecanismo de amostragem do llama.cpp. A penalidade de duplicata tinha o objetivo de reduzir o número de duplicatas geradas, mas os experimentos mostraram que essa abordagem não tem o efeito desejado.

Dito isso, a desativação total da penalidade de repetição (definida como 1,0) também é uma opção. No entanto, a equipe do Unsloth descobriu que uma penalidade de repetição adequada ainda é eficaz na supressão da geração infinita.

Para usar a penalidade de repetição de forma eficaz, a ordem dos amostradores no llama.cpp deve ser ajustada para garantir que, ao aplicar a penalidade Repetition Penalty antes da amostragem, caso contrário, isso resultará em geração infinita. Para fazer isso, adicione o seguinte parâmetro:

--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"

Por padrão, o llama.cpp usa a seguinte ordem de amostradores:

--samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature"

A ordem ajustada da equipe do Unsloth basicamente troca as posições de temperatura e seco, e move o min_p para frente. Isso significa que o amostrador será aplicado na seguinte ordem:

top_k=40 
top_p=0.95
min_p=0.1
temperature=0.6
dry
typ_p
xtc

Se o problema persistir, tente mover o --repeat-penalty O valor de 1,0 foi ligeiramente aumentado para 1,2 ou 1,3.

Agradecemos a @krist486 por nos alertar sobre o problema da direção de amostragem no llama.cpp.

Penalidade de Repetição Seca

☀️ Dry Repetição de penalidade

A equipe do Unsloth analisou as sugestões dry penalty e tentamos usar um valor de 0,8. No entanto, os resultados experimentais mostram que odry penalty É mais provável que cause erros de sintaxe, especialmente ao gerar código. Se o usuário ainda encontrar problemas, tente definir a opção dry penalty Aumentar para 0,8.

Se você optar por usar dry penaltyA ordem de amostragem ajustada pode ser igualmente útil.

Ollama executando tutoriais do QwQ-32B

🦙 Ollama Tutorial de execução do QwQ-32B

Se ainda não estiver instalado ollamaInstale-o primeiro!

apt-get update 
apt-get install pciutils -y
curl -fSSL [https://ollama.com/install.sh](https://www.google.com/url?sa=E&q=https%3A%2F%2Follama.com%2Finstall.sh) | sh

Execute o modelo! Se a execução falhar, tente executá-lo em outro terminal ollama serveA equipe do Unsloth incluiu todas as correções e parâmetros sugeridos (temperatura, etc.) no modelo de upload do Hugging Face. param Documentação!

ollama run hf.co/unsloth/QwQ-32B-GGUF:Q4_K_M

llama.cpp Tutorial para execução do QwQ-32B

llama.cpp executando os tutoriais do QwQ-32B

através de (uma lacuna) llama.cpp Obter a versão mais recente llama.cpp. Você pode consultar as instruções de compilação a seguir para compilar. Se você não tiver uma GPU ou quiser apenas fazer inferência na CPU, defina o parâmetro -DGGML_CUDA=ON Substituir por -DGGML_CUDA=OFF.

apt-get update 
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone [https://github.com/ggerganov/llama.cpp](https://www.google.com/url?sa=E&q=https%3A%2F%2Fgithub.com%2Fggerganov%2Fllama.cpp)
cmake llama.cpp -B llama.cpp/build
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Faça o download do modelo (durante a instalação) pip install huggingface_hub hf_transfer (depois). Q4_K_M ou outras versões quantificadas (por exemplo, BF16 de precisão total) podem ser selecionadas. Mais versões estão disponíveis em: https://huggingface.co/unsloth/QwQ-32B-GGUF.

# !pip install huggingface_hub hf_transfer
import os 
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download( 
repo_id="unsloth/QwQ-32B-GGUF",
local_dir="unsloth-QwQ-32B-GGUF",
allow_patterns=[" Q4_K_M "],  # For Q4_K_M
)

Execute o script de teste do Flappy Bird fornecido pelo Unsloth e o resultado será salvo no arquivo Q4_K_M_yes_samplers.txt Documentação.
Ajuste os parâmetros de acordo com a situação real.--threads 32 Defina o número de threads da CPU.--ctx-size 16384 Definir o comprimento do contexto do--n-gpu-layers 99 Defina o número de camadas de descarregamento da GPU. Se a GPU estiver com pouca memória, tente ajustar o --n-gpu-layers valor. Remova esse parâmetro se apenas a inferência da CPU for usada.
--repeat-penalty 1.1 responder cantando --dry-multiplier 0.5 são os parâmetros de penalidade de repetição e penalidade seca, que podem ser ajustados pelo usuário conforme necessário.

./llama.cpp/llama-cli  
--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf
--threads 32
--ctx-size 16384
--n-gpu-layers 99
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.1
--top-k 40
--top-p 0.95
-no-cnv
--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n <think> \n"
2>&1 | tee Q4_K_M_yes_samplers.txt

As dicas do jogo Flappy Bird acima foram retiradas do site Unsloth's DeepSeekR1-Dynamic 1.58bit Blogs. A palavra-chave completa está abaixo:

<|im_start|>user
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>
<|im_start|>assistant
<think>

Aqui estão as partes inicial e final do código Python gerado pelo modelo (com o processo de pensamento removido):

import pygame
import random
import sys
pygame.init()
### Continues
class Bird:
def __init__(self):
### Continues
def main():
best_score = 0
current_score = 0
game_over = False
pipes = []
first_time = True  # Track first game play
# Initial setup
background_color = (173, 216, 230)  # Light blue initially
land_color = random.choice(land_colors)
bird = Bird()
while True:
for event in pygame.event.get():
### Continues
if not game_over:
# Update bird and pipes
bird.update()
### Continues
# Drawing
### Continues
pygame.display.flip()
clock.tick(60)
if __name__ == "__main__":
main()

O modelo gerou com sucesso um jogo Flappy Bird funcional!

Em seguida, tente remover o parâmetro --samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" e execute o mesmo comando sem a correção do Unsloth. A saída será salva no arquivo Q4_K_M_no_samplers.txt.

./llama.cpp/llama-cli
--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf
--threads 32
--ctx-size 16384
--n-gpu-layers 99
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.1
--top-k 40
--top-p 0.95
-no-cnv
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"
2>&1 | tee Q4_K_M_no_samplers.txt

Sem usar correções, o modelo gera resultados com loops e um grande número de erros de sintaxe do Python, além de vários outros problemas. Por exemplo, o seguinte trecho de código parece estar correto, mas na verdade está incorreto! Linha 39 pipes.clear() ### <<< NameError: name 'pipes' is not defined. Você se esqueceu de importar 'pipes'? Você se esqueceu de importar 'pipes'?

import pygame
import random
pygame.init()
# Constants
WIDTH, HEIGHT = 800, 600
GROUND_HEIGHT = 20
GRAVITY = 0.7
PIPE_SPEED = -3
BIRD_SIZE = 45
MIN_GAP = 130
MAX_GAP = 200
PIPE_COLORS = [(0, 96, 0), (205, 133, 63), (89, 97, 107)]
DARK_BROWN = (94, 72, 4)
YELLOW = (252, 228, 6)
screen = pygame.display.set_mode((WIDTH, HEIGHT))
clock = pygame.time.Clock()
def random_light_color():
return (
random.randint(180, 230),
random.randint(190, 300),
random.randint(250, 255)
)
def reset_game():
global bird_x, bird_y
global pipes, score
global background_color, land_color
global bird_shape, bird_color
# Bird properties
bird_x = WIDTH * 0.3
bird_y = HEIGHT // 2
bird_vel = -5  # Initial upward thrust
pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'?

Se você aumentar ainda mais --repeat-penalty para 1,5, a situação piora, com os erros de sintaxe se tornando mais óbvios e o código deixando de funcionar.

import pygame
from random import randint  # For generating colors/shapes/positions randomly 
pygame.init()
# Constants:
WIDTH, HEIGHT =456 ,702   #
BACKGROUND_COLOR_LIGHTS=['lightskyblue']
GAP_SIZE=189           #
BIRD_RADIUS=3.  
PIPE_SPEED=- ( )    ? 
class Game():
def __init__(self):
self.screen_size=( )
def reset_game_vars():
global current_scor e
# set to zero and other initial states.
# Main game loop:
while running :
for event in pygame.event.get() : 
if quit ... etc
pygame.quit()
print("Code is simplified. Due time constraints, full working version requires further implementation.")

Pode-se pensar que isso é um problema apenas com a versão quantificada do Q4_K_M... a versão de precisão total do BF16 deve funcionar bem, certo? Entretanto, esse não é o caso. Mesmo com o modelo de precisão total do BF16, se você não aplicar a correção -samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc" fornecida pela equipe do Unsloth e usar a penalidade de repetição, também ocorrerá falha na geração.

O token não é exibido?

🤔 token Não foi mostrado?

Há comentários de usuários de que alguns sistemas podem não ser capazes de gerar o processo de pensamento corretamente devido ao token adicionado por padrão ao modelo de bate-papo. Os usuários precisam editar manualmente o modelo Jinja para incluir:

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n<think>\n' }} {%- endif %}

Modificado para remover o \n final. A modificação exige que os modelos adicionem manualmente \n durante a inferência, mas isso pode nem sempre funcionar. A equipe do DeepSeek também modificou todos os modelos para adicionar tokens por padrão para forçar o modelo a entrar no modo de inferência.

Portanto, altere {%- if add_generation_prompt %} {{- 'assistant\n\n' }} {%- endif %} para {%- if add_generation_prompt %} { {- 'assistant\n' }} {%- endif %}, ou seja, remover \n.

Modelo jinja completo com a parte \n excluída.

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}

Notas adicionais

Inicialmente, a equipe do Unsloth supôs que o problema poderia ter origem no seguinte:

O comprimento do contexto do QwQ pode não ser os 128K nativos, mas 32K mais a extensão YaRN. Consulte, por exemplo, o arquivo readme em https://huggingface.co/Qwen/QwQ-32B:

{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}

A equipe do Unsloth tentou reescrever o tratamento do YaRN no llama.cpp, mas o problema persistiu.

--override-kv qwen2.context_length=int:131072
--override-kv qwen2.rope.scaling.type=str:yarn
--override-kv qwen2.rope.scaling.factor=float:4
--override-kv qwen2.rope.scaling.original_context_length=int:32768
--override-kv qqwen2.rope.scaling.attn_factor=float:1.13862943649292 \

A equipe do Unsloth também suspeitou que o valor do RMS Layernorm epsilon poderia estar incorreto e talvez devesse ser 1e-6 em vez de 1e-5. Por exemplo. Este link. em rms_norm_eps=1e-06 e Este link. em rms_norm_eps=1e-05. A equipe do Unsloth também tentou reescrever esse valor, mas o problema ainda não foi resolvido:

--override-kv qwen2.attention.layer_norm_rms_epsilon=float:0.000001 \

Graças a @kalomaze, a equipe do Unsloth também testou os IDs do tokenizador entre o llama.cpp e o Transformers para ver se eles coincidiam. Os resultados mostram que eles correspondem, portanto, a incompatibilidade nos IDs do tokenizador não é a fonte do problema.

Aqui estão os resultados do experimento da equipe do Unsloth:

61KB file_BF16_no_samplers.txt

BF16 Precisão total, nenhum reparo de amostra aplicado

55KB file_BF16_yes_samplers.txt

BF16 Precisão total, reparo de amostragem aplicado

71KB final_Q4_K_M_no_samplers.txt

Q4_K_M Precisão, sem correção de amostra aplicada

65KB final_Q4_K_M_yes_samplers.txt

Q4_K_M Precisão, correção de amostragem aplicada

Correções de erros do tokenizador

✏️ Correção de bug do tokenizador

A equipe do Unsloth também encontrou alguns problemas específicos que afetam o ajuste fino. O token EOS está correto, mas uma escolha mais lógica para o token PAD seria "". A equipe do Unsloth atualizou a configuração em https://huggingface.co/unsloth/QwQ-32B/blob/ A equipe do Unsloth atualizou a configuração em main/tokenizer_config.json.

"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",

Quantificação dinâmica de 4 bits

🛠️ Quantização dinâmica de 4 bits

A equipe do Unsloth também carregou um modelo de quantização dinâmica de 4 bits, que melhora significativamente a precisão do modelo em comparação com a quantização simples de 4 bits! A figura abaixo mostra a análise de erros dos valores e pesos de ativação do modelo QwQ durante o processo de quantificação:

![alt text](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252F32wjrIWeUEQTMq9PhmbS%252FQwQ%2520quantization%2520errors.png%3Falt%3Dmedia%26token%3D0733fd33-9fe9-4aad-812c-75dbad00373f&width=768&dpr=4&quality=100&sign=aafe447c&sv=2)

A equipe do Unsloth fez o upload do modelo quantitativo dinâmico de 4 bits para: https://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit.

desde vLLM A partir da versão 0.7.3 (20 de fevereiro de 2024) https://github.com/vllm-project/vllm/releases/tag/v0.7.3, o vLLM começou a oferecer suporte ao carregamento de modelos quantitativos dinâmicos de 4 bits do Unsloth!

Todos os modelos de formato GGUF podem ser encontrados em https://huggingface.co/unsloth/QwQ-32B-GGUF!