一、前言

听说KTransformers 0.2.4支持并发了，这可是个大进步，之前测试下来KTranformers最大的期待就是AMX指令加速和支持并发。现在可以支持并发了，是否意味着KT终于不再是一个玩具，有可能朝产品化的方向去走了，因此上手体验一下看看。

省流，直接看结论：这个版本的方案下，依然没有看到传说中的新版XEON CPU的amx指令加速带来的飞跃，并发依然不行（能并发，但体验无法忍受），个人玩玩，研究一下技术可以，但无法产品化、商业化使用。

有兴趣复现的可以照我这个步骤来走，基本不会有问题。

二、软硬件环境

1. 软硬件环境

还是原来的环境。租的AutoDL的GPU服务器做的测试

软件环境
- PyTorch 2.5.1、Python 3.12(ubuntu22.04)、Cuda 12.4
硬件环境
- GPU：RTX 4090(24GB) * 2
- CPU：64 vCPU Intel(R) Xeon(R) Gold 6430
- 内存：480G（至少需要382G）
- 硬盘：1.8T（实际使用需要380G左右）

2. 虚拟环境

我图省事，就直接复用了之前的v0.2.3的虚拟环境：/root/autodl-tmp/jacky/envs/kt0.2.3

重头开始的朋友可以重新创建一个新的虚拟环境，步骤如下

创建conda 环境

conda create --prefix=/root/autodl-tmp/jacky/envs/deepseekr1-671b python==3.12.3
conda activate /root/autodl-tmp/jacky/envs/deepseekr1-671b

安装 PyTorch、packaging、ninja、flash-attn

pip install torch packaging ninja cpufeature numpy
pip install flash-attn

安装libstdcxx-ng

conda install -c conda-forge libstdcxx-ng
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX

三、开工

测试使用：

KTransformers版本: 0.2.4post,
Deepseek: DeepSeek-R1-GGUF

1. 下载KT代码

给挂个加速器https://ghfast.top/ ，避免下载代码失败。

git clone https://ghfast.top/https://github.com/kvcache-ai/ktransformers ktransformers-new
cd ktransformers-new

2. 同步子模块

先改下子模块的代码仓库路径，同样给加下加速。

vi .gitmodules

所有子模块地址给挂个加速

[submodule "third_party/llama.cpp"]
        path = third_party/llama.cpp
        url = https://ghfast.top/https://github.com/ggerganov/llama.cpp.git
[submodule "third_party/pybind11"]
        path = third_party/pybind11
        url = https://ghfast.top/https://github.com/pybind/pybind11.git
[submodule "third_party/spdlog"]
        path = third_party/spdlog
        url = https://ghfast.top/https://github.com/gabime/spdlog.git
[submodule "third_party/custom_flashinfer"]
        path = third_party/custom_flashinfer
        url = https://ghfast.top/https://github.com/kvcache-ai/custom_flashinfer.git
        branch = fix-precision-mla-merge-main
[submodule "third_party/xxHash"]
        path = third_party/xxHash
        url = https://ghfast.top/https://github.com/Cyan4973/xxHash.git
[submodule "third_party/prometheus-cpp"]
        path = third_party/prometheus-cpp
        url = https://ghfast.top/https://github.com/jupp0r/prometheus-cpp

然后下载子模块代码

git submodule update --init --recursive

注：这一步要注意，v0.2.4引入了一些新的子模块，并且这些子模块又有子模块，这样会导致下载子模块会失败，从而导致下面的：编译完有一个报错：ERROR: Directory ‘third_party/custom_flashinfer/’ is not installable 这个错误，这个现在在墙内没办法，只能跑两遍（有多少层递归就要跑多少遍），然后每一层的代码用ghfast.top加速下载成功后，再去改那一层的.gitmodules里的每个子模块的仓库地址，然后再跑。

3. 安装依赖

export TORCH_CUDA_ARCH_LIST="8.9"

pip install -r requirements-local_chat.txt
pip install setuptools wheel packaging

4. 编译KTransformers v0.2.4

1) 修改./install.sh，

vi install.sh 加入：

export TORCH_CUDA_ARCH_LIST="8.9"
export MAX_JOBS=64
export CMAKE_BUILD_PARALLEL_LEVEL=64

2）编译

# Install single NUMA dependencies
USE_BALANCE_SERVE=1  bash ./install.sh

如果你有1T内存，可以 USE_NUMA=1（# For those who have two cpu and 1T RAM（Dual NUMA））

USE_BALANCE_SERVE=1 USE_NUMA=1 bash ./install.sh

四、测试运行

1. 单4090

启动命令

python ktransformers/server/main.py \
 --port 10002 \
 --model_path /root/autodl-tmp/DeepSeek-R1 \
 --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF \
 --model_name deepseek-r1-jacky \
 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-serve.yaml \
 --max_new_tokens 1024 \
 --cache_lens 32768 \
 --chunk_size 256 \
 --max_batch_size 4 \
 --backend_type balance_serve \
 --force_think

我有两个4090，但只用了一个，且利用率不高，跟之前一样用了13G左右显存。单个连接速度大概在3~5tps左右，两个连接基本上卡死，甚至几秒钟出一个字（think等几分钟）。

结论：这个版本的方案下，依然没有看到传说中的新版XEON CPU的amx指令加速带来的飞跃，且并发依然无法商用。

2. 双4090

启动命令

python ktransformers/server/main.py \
 --port 10002 \
 --model_path /root/autodl-tmp/DeepSeek-R1 \
 --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF \
 --model_name deepseek-r1-jacky \
 --optimize_config_path ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml \
 --max_new_tokens 1024 \
 --cache_lens 32768 \
 --chunk_size 256 \
 --max_batch_size 4 \
 --backend_type balance_serve \
 --force_think

报错，参数都没填对，确认是一个低级bug。

Traceback (most recent call last):
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/ktransformers/server/backend/interfaces/balance_serve.py", line 246, in run_engine
    engine.model_runner.warmup()
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/ktransformers/server/balance_serve/inference/model_runner.py", line 123, in warmup
    self.outputs_buf[i] = self.model(self.input[i], self.features_buf[i], self.bsz_tensor_buf, self.num_tokens_tensor_buf, self.page_idx_buf[i], self.page_offset_buf[i])
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/ktransformers/models/custom_modeling_deepseek_v3.py", line 112, in forward
    hidden_states, residual = decode_layer.input_layernorm(hidden_states, num_tokens_tensors, residual)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DeepseekV3RMSNorm.forward() takes 2 positional arguments but 4 were given

五、碰到的问题

1。编译完有一个报错：ERROR: Directory ‘third_party/custom_flashinfer/’ is not installable

问题现象

  changing mode of /root/autodl-tmp/jacky/envs/kt0.2.3/bin/ktransformers to 755
Successfully installed ktransformers-0.2.4.post1+cu121torch26fancy
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
ERROR: Directory 'third_party/custom_flashinfer/' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.

问题分析

去third_party/custom_flashinfer目录下看了一下，发现这个模块.gitmodules里又有许多子模块，这些子模块的路径都是直接从github下载的代码，在墙内下载不下来，从而导致编译失败。

问题解决

进third_party/custom_flashinfer目录，修改里边的.gitmodules，将里边的各个子模块的仓库地址都给加一下ghfast.top的加速，避免下载代码失败。

2。编译中告警：TORCH_CUDA_ARCH_LIST is not set

问题现象 /root/autodl-tmp/jacky/envs/kt0.2.3/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ[‘TORCH_CUDA_ARCH_LIST’]. warnings.warn( Emitting ninja build file /root/autodl-tmp/jacky/ktransformers-new/build/temp.linux-x86_64-cpython-312/build.ninja…
问题解决

我用的是4090，对应 TORCH_CUDA_ARCH_LIST 是8.9，前面导出过，不知道为什么又被清了，可以直接加到install.sh里。

export TORCH_CUDA_ARCH_LIST=”8.9″

3。运行报错：No module named ‘sched_ext’

测试运行时报错

问题现象

export TORCH_CUDA_ARCH_LIST="8.9"
python ./ktransformers/local_chat.py --model_path /root/autodl-tmp/DeepSeek-R1 --gguf_path /root/autodl-tmp/DeepSeek-R1-GGUF --cpu_infer 64 --max_new_tokens 1000 --force_think true | tee runlog_v0.2.4.log

错误如下

Traceback (most recent call last):
  File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/local_chat.py", line 25, in <module>
    from ktransformers.optimize.optimize import optimize_and_load_gguf
  File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/optimize/optimize.py", line 16, in <module>
    from ktransformers.util.utils import set_module, load_weights
  File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/util/utils.py", line 17, in <module>
    from ktransformers.models.custom_cache import StaticCache
  File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/models/custom_cache.py", line 15, in <module>
    from ktransformers.server.balance_serve.settings import sched_ext
  File "/root/autodl-tmp/jacky/ktransformers-new/./ktransformers/server/balance_serve/settings.py", line 13, in <module>
    import sched_ext
ModuleNotFoundError: No module named 'sched_ext'

解决方案

这个问题应该是由于前面USE_BALANCE_SERVE=1没有启用导致的，第一次编译的时候没加，结果报这个错误，后面加了重新编译一遍这个问题就不见的。

亲测 KTransformers 0.2.4post+Deepseek r1 671B Q4：传说中的 amx 指令加速、并发究竟成色如何？