Search

Search
- Search

LLM

vllm 优化：flashinfer问题

by jacky|Published 2025-03-26|1 comment

一开始报没安装FlashInfer

启动vllm过程中有一个warning。

[topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.

那就安装一下FlashInfer

pip install flashinfo-python

安装了后又报FlashInfer>=v0.2.3不支持向下兼容
还是有warning，只是变了个样子，然后warning里也没说应该用哪一个版本。

[topk_topp_sampler.py:38] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.

查代码 topk_topp_sampler.py 第38行，
代码文件位置：vllm/vllm/v1/sample/ops/topk_topp_sampler.py

class TopKTopPSampler(nn.Module):

    def __init__(self):
        super().__init__()
        if current_platform.is_cuda():
            if is_flashinfer_available:
                flashinfer_version = flashinfer.__version__
                if flashinfer_version >= "0.2.3":
                    # FIXME(DefTruth): Currently, we have errors when using
                    # FlashInfer>=v0.2.3 for top-p & top-k sampling. As a
                    # workaround, we disable FlashInfer for top-p & top-k
                    # sampling by default while FlashInfer>=v0.2.3.
                    # The sampling API removes the success return value
                    # of all sampling API, which is not compatible with
                    # earlier design.
                    # https://github.com/flashinfer-ai/flashinfer/releases/
                    # tag/v0.2.3
                    logger.info(
                        "Currently, FlashInfer top-p & top-k sampling sampler "
                        "is disabled because FlashInfer>=v0.2.3 is not "
                        "backward compatible. Falling back to the PyTorch-"
                        "native implementation of top-p & top-k sampling.")
                    self.forward = self.forward_native
                elif envs.VLLM_USE_FLASHINFER_SAMPLER is not False:
                    # NOTE(woosuk): The V0 sampler doesn't use FlashInfer for
                    # sampling unless VLLM_USE_FLASHINFER_SAMPLER=1 (i.e., by
                    # default it is unused). For backward compatibility, we set
                    # `VLLM_USE_FLASHINFER_SAMPLER` as None by default and
                    # interpret it differently in V0 and V1 samplers: In V0,
                    # None means False, while in V1, None means True. This is
                    # why we use the condition
                    # `envs.VLLM_USE_FLASHINFER_SAMPLER is not False` here.
                    logger.info("Using FlashInfer for top-p & top-k sampling.")
                    self.forward = self.forward_cuda
                else:
                    logger.warning(
                        "FlashInfer is available, but it is not enabled. "
                        "Falling back to the PyTorch-native implementation of "
                        "top-p & top-k sampling. For the best performance, "
                        "please set VLLM_USE_FLASHINFER_SAMPLER=1.")
                    self.forward = self.forward_native
            else:
                logger.warning(
                    "FlashInfer is not available. Falling back to the PyTorch-"
                    "native implementation of top-p & top-k sampling. For the "
                    "best performance, please install FlashInfer.")
                self.forward = self.forward_native

从这个代码上看，应该只要不是0.2.3就可以了。

卸载flashinfer-python

pip uninstall flashinfer-python

安装一个老一点的0.2.2

pip install flashinfer-python==0.2.2

重新启动server

终于启用了flashinfer!

(VllmWorker rank=0 pid=13950) INFO 03-26 15:01:52 [topk_topp_sampler.py:53] Using FlashInfer for top-p & top-k sampling.

但是依旧报警告：TORCH_CUDA_ARCH_LIST is not set

(VllmWorker rank=0 pid=13950) /root/autodl-tmp/jacky/envs/vllm/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2059: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
(VllmWorker rank=0 pid=13950) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].

指定TORCH_CUDA_ARCH_LIST为8.9

我用的是4090，所以TORCH_CUDA_ARCH_LIST应该是8.9

export TORCH_CUDA_ARCH_LIST="8.9"

重新启动server

成功！

Published 2025-04-06

速来！Open-WebUI 与 QwQ-32B 构建本地知识库，解锁精准问答新姿势

Open-WebUI+QwQ-32B搭建本地知识库一、概述当用户提出一个问题时，如何让大模型准确的定位到你的输入背后真的正的问题，并输出正确的回复，是大模型应用的关键。而要达到此目的，主要有三种方式：提示词、知识库和微调。大模型的搭建，open-webui及RAG的启用等步骤暂先跳过，本文主要介绍并演示了本地知识库的一些关键点。二、背景前阵子，应产品部门的要求，对Deepseek R1 671B及QwQ-32B等大模型做了一番技术上的预研。由于前期的测试中发现，在硬件受限（单卡或双卡4090）环境下，QwQ-32B-AWQ模型的表现在并发、速度等多方向优于Deepseek满血版，并且二者在会议纪要等功能的对比测试各有优劣，因此知识库的预研和测试也优先选择了QwQ-32B-AWQ模型。而前端平台则采用了开源的open-webui，同时RAG采用了open-webui自带的“sentence-transformers/all-MiniLM-L6-v2”向量模型。平台模型备注前端平台 Open-webui搭建的框架 github中开源项目，支持rag、对接ollama等功能后端大模型 QwQ-32B-AWQ 自行部署的大模型，使用AutoDL上租借的服务器向量模型 sentence-transformers/all-MiniLM-L6-v2 open-webui自带的向量库 […]

Published 2025-03-26

Benchmark: 用vllm自带的工具对 QwQ-32B-AWQ进行压测

一、省流，直接看结论一）参数：两个4090，1000 token的输入，128 token的输出（vllm benchmark默认值） 1. benchmark最高并发请求：60+ 参数：两个4090，1000 token的输入，128 token的输出（vllm benchmark默认值） 2.启用FlashInfer前后对比启用FlashInfer比默认的PyTorch-native模式的性能提升差不多。 client端统计对比 server端统计对比用pyplot针对这3次测试跑的3个日志文件生成了一个图。 3.结论测试1000个请求, 三轮跑下来, 不启用flashinfer总耗时稍长一点点(差10来秒, 459 vs […]

Leave a comment Cancel reply

One thought on “vllm 优化：flashinfer问题”

1 pingback

¹[…] 在启动vllm的时候需要注意一下启动日志，如果FlashInfer没启用的话，对整体的性能会带来非常大的影响。如果你有看到类似下面这样的告警，请参考这篇文章：vllm 优化：flashinfer问题 […] - Benchmark: 用vllm自带的工具对 QwQ-32B-AWQ进行压测 – OddMeta