UnslothがQwQ-32Bの数量化バージョンで重複推論問題を解決

1.2K 00

最近、Qwenチームは QwQ-32B モデル、推論モデルである。ベンチマークに匹敵するショーだ。ディープシーク-R1 システムの性能は素晴らしい。しかし、多くのユーザーが無限生成、過度の重複、トークンの問題、微調整の問題に遭遇している。この記事は、これらの問題をデバッグし、解決し、QwQ-32Bの可能性を最大限に引き出すための詳細なガイドを提供することを目的としている。

アンクロスチームによってアップロードされたモデルは、上記のバグを修正し、ファインチューニング、vLLM、トランスフォーマーといったツールやフレームワークのサポートを向上させている。以下はラマ.cpp バックエンドエンジンとしてllama.cppを使用している方はこのリンク無限世代問題を解決するための指導を受ける。

QwQ-32B モデルをアンスロー（バグ修正）：

公式推奨設定

⚙️ 公式推奨設定

Qwenの公式推奨に基づき、モデル推論のための推奨パラメータ設定を以下に示します：

温度：0.6
Top_K：40（推奨範囲は20～40）
Min_P: 0.1（任意だが、うまく機能する）
トップ_P：0.95
繰り返しペナルティ: 1.0 (llama.cpp とトランスフォーマーでは、1.0 は無効を意味する)
チャットテンプレート：<|im_start|>user\nCreate a Flappy Bird game in Python.<|im_end|>\n<|im_start|>assistant\n<think>\n

llama.cpp 推奨設定

👍llama.cppの推奨設定

Unslothチームは、多くのユーザーが1.0より大きいバージョンの Repetition Penalty しかし、この方法は実際にはllama.cppのサンプリング・メカニズムを妨害する。重複ペナルティーは、生成される重複の数を減らすことを意図したものだが、実験によると、このアプローチでは期待した効果が得られないことがわかっている。

つまり、リピートペナルティを完全に無効にする（1.0に設定する）ことも選択肢の一つです。しかし、Unslothチームは、適切なリピートペナルティが無限生成の抑制に依然として効果的であることを発見した。

繰り返しペナルティーを効果的に使うには、llama.cppのサンプラーの順番を調整して Repetition Penalty を追加してください。これを行うには、以下のパラメータを追加する：

--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"

デフォルトでは、llama.cppは以下の順番でサンプラーを使用します：

--samplers "dry;top_k;typ_p;top_p;min_p;xtc;temperature"

アンスロスチームの調整した順番は、基本的に温度とドライの位置を入れ替え、min_pを前に移動させる。つまり、サンプラーは以下の順番で適用されることになる：

top_k=40 
top_p=0.95
min_p=0.1
temperature=0.6
dry
typ_p
xtc

それでも問題が解決しない場合は --repeat-penalty 1.0という値は、1.2または1.3にわずかに引き上げられた。

llama.cppのサンプリング方向の問題を指摘してくれた@krist486に感謝する。

ドライ・リピート・ペナルティ

☀️ ドライ・リピート・ペナルティ

アンスロットのチームは、次のような提案をしている。 dry penalty を使用し、0.8という値を使おうとした。しかし、実験結果はdry penalty 特にコード生成時に構文エラーを引き起こす可能性が高くなります。それでも問題が発生する場合は dry penalty 0.8に引き上げる。

を選択した場合 dry penalty調整されたサンプリング順も同様に有用である。

QwQ-32Bチュートリアル

🦙 オーラマ QwQ-32Bチュートリアルの実行

まだインストールされていない場合 ollamaまずはインストールしてください！

apt-get update 
apt-get install pciutils -y
curl -fSSL [https://ollama.com/install.sh](https://www.google.com/url?sa=E&q=https%3A%2F%2Follama.com%2Finstall.sh) | sh

モデルを実行する！実行に失敗した場合は、別のターミナルで実行してみてください。 ollama serveUnslothチームは、すべての修正と提案されたパラメーター（温度など）をHugging Faceのアップロードモデルに含めた。 param ドキュメンテーション

ollama run hf.co/unsloth/QwQ-32B-GGUF:Q4_K_M

llama.cpp QwQ-32Bを動かすためのチュートリアル

llama.cpp QwQ-32B チュートリアルの実行

をとおしてラマ.cpp 最新バージョンを入手する llama.cpp.以下のビルド手順を参考にビルドしてください。GPUを持っていない場合やCPU推論だけを行いたい場合は、ビルド時に -DGGML_CUDA=ON と置き換える。 -DGGML_CUDA=OFF.

apt-get update 
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone [https://github.com/ggerganov/llama.cpp](https://www.google.com/url?sa=E&q=https%3A%2F%2Fgithub.com%2Fggerganov%2Fllama.cpp)
cmake llama.cpp -B llama.cpp/build
-DBUILD_SHARED_LIBS=ON -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

モデルのダウンロード（インストール中） pip install huggingface_hub hf_transfer (後）。Q4_K_Mまたは他の定量化バージョン（例えばBF16 full precision）を選択することができる。その他のバージョンはhttps://huggingface.co/unsloth/QwQ-32B-GGUF。

# !pip install huggingface_hub hf_transfer
import os 
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download( 
repo_id="unsloth/QwQ-32B-GGUF",
local_dir="unsloth-QwQ-32B-GGUF",
allow_patterns=[" Q4_K_M "],  # For Q4_K_M
)

Unslothが提供するFlappy Birdのテストスクリプトを実行すると、出力は以下の場所に保存される。 Q4_K_M_yes_samplers.txt ドキュメンテーション
実際の状況に応じてパラメータを調整する。--threads 32 CPUスレッド数を設定する。--ctx-size 16384 のコンテキストの長さを設定する。--n-gpu-layers 99 GPUオフロード層の数を設定します。GPUのメモリが不足している場合は --n-gpu-layers 値を指定する。CPU推論のみを使用する場合は、このパラメータを削除する。
--repeat-penalty 1.1 歌で応える --dry-multiplier 0.5 はリピートペナルティとドライペナルティのパラメータで、必要に応じてユーザーが調整できる。

./llama.cpp/llama-cli  
--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf
--threads 32
--ctx-size 16384
--n-gpu-layers 99
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.1
--top-k 40
--top-p 0.95
-no-cnv
--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n <think> \n"
2>&1 | tee Q4_K_M_yes_samplers.txt

上記のFlappy BirdゲームのヒントはUnsloth'sから取られたものです。 DeepSeekR1-Dynamic 1.58bit ブログ.キュー・ワードの全文は以下の通り：

<|im_start|>user
Create a Flappy Bird game in Python. You must include these things:
1. You must use pygame.
2. The background color should be randomly chosen and is a light shade. Start with a light blue color.
3. Pressing SPACE multiple times will accelerate the bird.
4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.
5. Place on the bottom some land colored as dark brown or yellow chosen randomly.
6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.
7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.
8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.
The final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>
<|im_start|>assistant
<think>

以下は、モデルによって生成されたPythonコードの開始部分と終了部分である（思考プロセスは削除されている）：

import pygame
import random
import sys
pygame.init()
### Continues
class Bird:
def __init__(self):
### Continues
def main():
best_score = 0
current_score = 0
game_over = False
pipes = []
first_time = True  # Track first game play
# Initial setup
background_color = (173, 216, 230)  # Light blue initially
land_color = random.choice(land_colors)
bird = Bird()
while True:
for event in pygame.event.get():
### Continues
if not game_over:
# Update bird and pipes
bird.update()
### Continues
# Drawing
### Continues
pygame.display.flip()
clock.tick(60)
if __name__ == "__main__":
main()

このモデルは、フラッピーバードのゲームを正常に生成することに成功した！

次に、-samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc "パラメーターを削除し、Unslothの修正なしで同じコマンドを実行してみる。出力はQ4_K_M_no_samplers.txtファイルに保存される。

./llama.cpp/llama-cli
--model unsloth-QwQ-32B-GGUF/QwQ-32B-Q4_K_M.gguf
--threads 32
--ctx-size 16384
--n-gpu-layers 99
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.1
--top-k 40
--top-p 0.95
-no-cnv
--prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n<think>\n"
2>&1 | tee Q4_K_M_no_samplers.txt

修正を使わないと、モデルはループのある結果を生成し、Pythonの構文エラーが横行し、その他にも様々な問題が発生します。例えば、以下のコードスニペットは正しいように見えるが、実際には正しくない！39行目 pipes.clear() ### <<< NameError: name 'pipes' is not defined. pipes'のインポートを忘れていませんか？ pipes' のインポートを忘れていませんか？

import pygame
import random
pygame.init()
# Constants
WIDTH, HEIGHT = 800, 600
GROUND_HEIGHT = 20
GRAVITY = 0.7
PIPE_SPEED = -3
BIRD_SIZE = 45
MIN_GAP = 130
MAX_GAP = 200
PIPE_COLORS = [(0, 96, 0), (205, 133, 63), (89, 97, 107)]
DARK_BROWN = (94, 72, 4)
YELLOW = (252, 228, 6)
screen = pygame.display.set_mode((WIDTH, HEIGHT))
clock = pygame.time.Clock()
def random_light_color():
return (
random.randint(180, 230),
random.randint(190, 300),
random.randint(250, 255)
)
def reset_game():
global bird_x, bird_y
global pipes, score
global background_color, land_color
global bird_shape, bird_color
# Bird properties
bird_x = WIDTH * 0.3
bird_y = HEIGHT // 2
bird_vel = -5  # Initial upward thrust
pipes.clear() ### <<< NameError: name 'pipes' is not defined. Did you forget to import 'pipes'?

さらに--repeat-penaltyを1.5まで上げると、状況はさらに悪化し、構文エラーが目立つようになり、コードがまったく機能しなくなる。

import pygame
from random import randint  # For generating colors/shapes/positions randomly 
pygame.init()
# Constants:
WIDTH, HEIGHT =456 ,702   #
BACKGROUND_COLOR_LIGHTS=['lightskyblue']
GAP_SIZE=189           #
BIRD_RADIUS=3.  
PIPE_SPEED=- ( )    ? 
class Game():
def __init__(self):
self.screen_size=( )
def reset_game_vars():
global current_scor e
# set to zero and other initial states.
# Main game loop:
while running :
for event in pygame.event.get() : 
if quit ... etc
pygame.quit()
print("Code is simplified. Due time constraints, full working version requires further implementation.")

これはQ4_K_Mの量子化バージョンだけの問題だと思うかもしれない。しかし、そうではない。BF16の全精密モデルでも、Unslothチームが提供した-samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc "の修正を適用せず、Repetition Penaltyを使用すると、生成の失敗が発生します。

トークンが表示されない？

🤔 トークン表示されない？

一部のユーザーから、トークンがデフォルトでチャットテンプレートに追加されているため、システムによっては思考プロセスを正しく出力できないという報告がありました。ユーザーは手動でJinjaテンプレートを編集して追加する必要があります：

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n<think>\n' }} {%- endif %}

を削除するように修正しました。また、DeepSeek チームは、モデルを推論モードに強制的に移行させるために、デフォルトでトークンを追加するようにすべてのモデルを修正しました。

したがって、{%- if add_generation_prompt %} {{- 'assistantn' }} {%- endif %}を{%- if add_generation_prompt %} {- 'assistantn' }} {%-endif%}に変更する。{- 'assistantn' }} {%- endif %}に変更する。

[TEL]（着信音） [TEL]（着信音） [TEL]（着信音）

{%- if tools %} {{- '<|im_start|>system\n' }} {%- if messages[0]['role'] == 'system' %} {{- messages[0]['content'] }} {%- else %} {{- '' }} {%- endif %} {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} {%- for tool in tools %} {{- "\n" }} {{- tool | tojson }} {%- endfor %} {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} {%- else %} {%- if messages[0]['role'] == 'system' %} {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- for message in messages %} {%- if (message.role == "user") or (message.role == "system" and not loop.first) %} {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" and not message.tool_calls %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }} {%- elif message.role == "assistant" %} {%- set content = message.content.split('</think>')[-1].lstrip('\n') %} {{- '<|im_start|>' + message.role }} {%- if message.content %} {{- '\n' + content }} {%- endif %} {%- for tool_call in message.tool_calls %} {%- if tool_call.function is defined %} {%- set tool_call = tool_call.function %} {%- endif %} {{- '\n<tool_call>\n{"name": "' }} {{- tool_call.name }} {{- '", "arguments": ' }} {{- tool_call.arguments | tojson }} {{- '}\n</tool_call>' }} {%- endfor %} {{- '<|im_end|>\n' }} {%- elif message.role == "tool" %} {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} {{- '<|im_start|>user' }} {%- endif %} {{- '\n<tool_response>\n' }} {{- message.content }} {{- '\n</tool_response>' }} {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} {{- '<|im_end|>\n' }} {%- endif %} {%- endif %} {%- endfor %} {%- if add_generation_prompt %} {{- '<|im_start|>assistant\n' }} {%- endif %}

特記事項

アンスロットのチームは当初、この問題は次のようなことが原因ではないかと推測していた：

QwQのコンテキストの長さは、ネイティブの128Kではなく、32KにYaRN拡張を加えたものかもしれない。例えば、https://huggingface.co/Qwen/QwQ-32B の readme ファイルを参照のこと：

{
...,
"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
}

Unslothチームはllama.cppのYaRN処理を書き直そうとしたが、問題は解決しなかった。

--override-kv qwen2.context_length=int:131072
--override-kv qwen2.rope.scaling.type=str:yarn
--override-kv qwen2.rope.scaling.factor=float:4
--override-kv qwen2.rope.scaling.original_context_length=int:32768
--override-kv qqwen2.rope.scaling.attn_factor=float:1.13862943649292 \

アンスロスチームはまた、RMS Layernormのイプシロン値が正しくないのではないか、1e-5ではなく1e-6にすべきではないかと考えた。たとえば、このようなことだ。このリンク rms_norm_eps=1e-06、およびこのリンク Unslothチームもこの値を書き換えることを試みたが、問題はまだ解決していない：

--override-kv qwen2.attention.layer_norm_rms_epsilon=float:0.000001 \

また、@kalomazeのおかげで、Unslothチームはllama.cppとTransformersの間のトークナイザーIDが一致するかどうかもテストしました。その結果、両者は一致しており、トークナイザーIDの不一致が問題の原因ではないことがわかった。

以下は、アンスローチームの実験結果である：

61KB file_BF16_no_samplers.txt

BF16 フル精度、サンプルリペアなし

55KBファイル_BF16_yes_samplers.txt

BF16 フル精度、サンプリング修理適用

71KB final_Q4_K_M_no_samplers.txt

Q4_K_M 精度、サンプル修正なし

65KB final_Q4_K_M_yes_samplers.txt

Q4_K_M 精度、サンプリングフィックス適用時

トーケナイザーのバグ修正

✏️ トーケナイザーのバグ修正

Unslothチームは、微調整に影響するいくつかの特定の問題も発見しました。EOSトークンは正しいですが、PADトークンのより論理的な選択は""でしょう。Unslothチームはhttps://huggingface.co/unsloth/QwQ-32B/blob/、設定を更新しました。Unslothチームはmain/tokenizer_config.jsonの設定を更新しました。

"eos_token": "<|im_end|>",
"pad_token": "<|endoftext|>",

ダイナミック4ビット量子化

🛠️ ダイナミック4ビット量子化

Unslothチームは動的4ビット量子化モデルもアップロードしており、通常の4ビット量子化と比較してモデルの精度が大幅に向上している！下図は、量子化過程におけるQwQモデルの活性化値と重みのエラー解析を示しています：

![alt text](https://docs.unsloth.ai/~gitbook/image?url=https%3A%2F%2F3215535692-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-x-prod.appspot.com%2Fo%2Fspaces%252FxhOjnexMCB3dmuQFQ2Zq%252Fuploads%252F32wjrIWeUEQTMq9PhmbS%252FQwQ%2520quantization%2520errors.png%3Falt%3Dmedia%26token%3D0733fd33-9fe9-4aad-812c-75dbad00373f&width=768&dpr=4&quality=100&sign=aafe447c&sv=2)

Unslothチームは、ダイナミックな4ビット定量モデルをhttps://huggingface.co/unsloth/QwQ-32B-unsloth-bnb-4bit。

よりブイエルエルエムバージョン0.7.3（2024年2月20日）https://github.com/vllm-project/vllm/releases/tag/v0.7.3、vLLMはUnsloth動的4ビット定量モデルの読み込みをサポートし始めました！

GGUFフォーマットの全モデルはhttps://huggingface.co/unsloth/QwQ-32B-GGUF！