一、前言
听说KTransformers 0.2.4支持并发了,这可是个大进步,之前测试下来KTranformers最大的期待就是AMX指令加速和支持并发。 现在可以支持并发了,是否意味着KT终于不再是一个玩具,有可能朝产品化的方向去走了,因此上手体验一下看看。
省流,直接看结论:这个版本的方案下,依然没有看到传说中的新版XEON CPU的amx指令加速带来的飞跃,并发依然不行(能并发,但体验无法忍受),个人玩玩,研究一下技术可以,但无法产品化、商业化使用。
有兴趣复现的可以照我这个步骤来走,基本不会有问题。
二、软硬件环境
1. 软硬件环境
还是原来的环境。租的AutoDL的GPU服务器做的测试
- 软件环境
- PyTorch 2.5.1、Python 3.12(ubuntu22.04)、Cuda 12.4
- 硬件环境
- GPU:RTX 4090(24GB) * 2
- CPU:64 vCPU Intel(R) Xeon(R) Gold 6430
- 内存:480G(至少需要382G)
- 硬盘:1.8T(实际使用需要380G左右)
2. 虚拟环境
我图省事,就直接复用了之前的v0.2.3的虚拟环境:/root/autodl-tmp/jacky/envs/kt0.2.3
重头开始的朋友可以重新创建一个新的虚拟环境,步骤如下
- 创建conda 环境
conda create --prefix=/root/autodl-tmp/jacky/envs/deepseekr1-671b python==3.12.3
conda activate /root/autodl-tmp/jacky/envs/deepseekr1-671b
- 安装 PyTorch、packaging、ninja、flash-attn
pip install torch packaging ninja cpufeature numpy
pip install flash-attn
- 安装libstdcxx-ng
conda install -c conda-forge libstdcxx-ng
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
三、开工
测试使用:
- KTransformers版本: 0.2.4post,
- Deepseek: DeepSeek-R1-GGUF
1. 下载KT代码
给挂个加速器https://ghfast.top/ ,避免下载代码失败。
git clone https://ghfast.top/https://github.com/kvcache-ai/ktransformers ktransformers-new
cd ktransformers-new
2. 同步子模块
先改下子模块的代码仓库路径,同样给加下加速。
vi .gitmodules
所有子模块地址给挂个加速
[submodule "third_party/llama.cpp"]
path = third_party/llama.cpp
url = https://ghfast.top/https://github.com/ggerganov/llama.cpp.git
[submodule "third_party/pybind11"]
path = third_party/pybind11
url = https://ghfast.top/https://github.com/pybind/pybind11.git
[submodule "third_party/spdlog"]
path = third_party/spdlog
url = https://ghfast.top/https://github.com/gabime/spdlog.git
[submodule "third_party/custom_flashinfer"]
path = third_party/custom_flashinfer
url = https://ghfast.top/https://github.com/kvcache-ai/custom_flashinfer.git
branch = fix-precision-mla-merge-main
[submodule "third_party/xxHash"]
path = third_party/xxHash
url = https://ghfast.top/https://github.com/Cyan4973/xxHash.git
[submodule "third_party/prometheus-cpp"]
path = third_party/prometheus-cpp
url = https://ghfast.top/https://github.com/jupp0r/prometheus-cpp
然后下载子模块代码
git submodule update --init --recursive
注: 这一步要注意,v0.2.4引入了一些新的子模块,并且这些子模块又有子模块,这样会导致下载子模块会失败,从而导致下面的:编译完有一个报错:ERROR: Directory ‘third_party/custom_flashinfer/’ is not installable 这个错误,这个现在在墙内没办法,只能跑两遍(有多少层递归就要跑多少遍),然后每一层的代码用ghfast.top加速下载成功后,再去改那一层的.gitmodules里的每个子模块的仓库地址,然后再跑。
3. 安装依赖
export TORCH_CUDA_ARCH_LIST="8.9"
pip install -r requirements-local_chat.txt
pip install setuptools wheel packaging
4. 编译KTransformers v0.2.4
1) 修改./install.sh,
vi install.sh 加入:
export TORCH_CUDA_ARCH_LIST="8.9"
export MAX_JOBS=64
export CMAKE_BUILD_PARALLEL_LEVEL=64
2)编译
# Install single NUMA dependencies
USE_BALANCE_SERVE=1 bash ./install.sh
如果你有1T内存,可以 USE_NUMA=1(# For those who have two cpu and 1T RAM(Dual NUMA))
USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh
四、测试运行
1. 单4090
启动命令
python ktransformers/server/main.py \
--port 10002 \
--model_path /root/autodl-tmp/DeepSeek-R1 \
--gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF \
--model_name deepseek-r1-jacky \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--backend_type balance_serve \
--force_think
我有两个4090,但只用了一个,且利用率不高,跟之前一样用了13G左右显存。单个连接速度大概在3~5tps左右,两个连接基本上卡死,甚至几秒钟出一个字(think等几分钟)。
结论:这个版本的方案下,依然没有看到传说中的新版XEON CPU的amx指令加速带来的飞跃,且并发依然无法商用。

2. 双4090
启动命令
python ktransformers/server/main.py \
--port 10002 \
--model_path /root/autodl-tmp/DeepSeek-R1 \
--gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF \
--model_name deepseek-r1-jacky \
--optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml \
--max_new_tokens 1024 \
--cache_lens 32768 \
--chunk_size 256 \
--max_batch_size 4 \
--backend_type balance_serve \
--force_think
报错,参数都没填对,确认是一个低级bug。
Traceback (most recent call last):
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 246, in run_engine
engine.model_runner.warmup()
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/ktransformers/server/balance_serve/inference/model_runner.py", line 123, in warmup
self.outputs_buf[i] = self.model(self.input[i], self.features_buf[i], self.bsz_tensor_buf, self.num_tokens_tensor_buf, self.page_idx_buf[i], self.page_offset_buf[i])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py", line 112, in forward
hidden_states, residual = decode_layer.input_layernorm(hidden_states, num_tokens_tensors, residual)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepseekV3RMSNorm.forward() takes 2 positional arguments but 4 were given
五、碰到的问题
1。编译完有一个报错:ERROR: Directory ‘third_party/custom_flashinfer/’ is not installable
- 问题现象
changing mode of /root/autodl-tmp/jacky/envs/kt0.2.3/bin/ktransformers to 755
Successfully installed ktransformers-0.2.4.post1+cu121torch26fancy
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
ERROR: Directory 'third_party/custom_flashinfer/' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
- 问题分析
去third_party/custom_flashinfer目录下看了一下,发现这个模块.gitmodules里又有许多子模块,这些子模块的路径都是直接从github下载的代码,在墙内下载不下来,从而导致编译失败。
- 问题解决
进third_party/custom_flashinfer目录,修改里边的.gitmodules,将里边的各个子模块的仓库地址都给加一下ghfast.top的加速,避免下载代码失败。
2。编译中告警:TORCH_CUDA_ARCH_LIST is not set
- 问题现象 /root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ[‘TORCH_CUDA_ARCH_LIST’]. warnings.warn( Emitting ninja build file /root/autodl-tmp/jacky/ktransformers-new/build/temp.linux-x86_64-cpython-312/build.ninja…
- 问题解决
我用的是4090,对应 TORCH_CUDA_ARCH_LIST 是8.9,前面导出过,不知道为什么又被清了,可以直接加到install.sh里。
export TORCH_CUDA_ARCH_LIST=”8.9″
3。运行报错:No module named ‘sched_ext’
测试运行时报错
- 问题现象
export TORCH_CUDA_ARCH_LIST="8.9"
python ./ktransformers/local_chat.py --model_path /root/autodl-tmp/DeepSeek-R1 --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF --cpu_infer 64 --max_new_tokens 1000 --force_think true | tee runlog_v0.2.4.log
错误如下
Traceback (most recent call last):
File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/local_chat.py", line 25, in <module>
from ktransformers.optimize.optimize import optimize_and_load_gguf
File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/optimize/optimize.py", line 16, in <module>
from ktransformers.util.utils import set_module, load_weights
File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/util/utils.py", line 17, in <module>
from ktransformers.models.custom_cache import StaticCache
File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/models/custom_cache.py", line 15, in <module>
from ktransformers.server.balance_serve.settings import sched_ext
File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/server/balance_serve/settings.py", line 13, in <module>
import sched_ext
ModuleNotFoundError: No module named 'sched_ext'
- 解决方案
这个问题应该是由于前面USE_BALANCE_SERVE=1没有启用导致的,第一次编译的时候没加,结果报这个错误,后面加了重新编译一遍这个问题就不见的。