一、省流,直接看结论
一)参数:两个4090,1000 token的输入,128 token的输出(vllm benchmark默认值)
1. benchmark最高并发请求:60+
参数:两个4090,1000 token的输入,128 token的输出(vllm benchmark默认值)
INFO 03-26 15:19:59 [loggers.py:80] Avg prompt throughput: 3035.9 tokens/s, Avg generation throughput: 91.8 tokens/s, Running: 61 reqs, Waiting: 117 reqs, GPU KV cache usage: 93.6%, Prefix cache hit rate: 16.8%
2.启用FlashInfer前后对比
启用FlashInfer比默认的PyTorch-native模式的性能提升差不多。
client端统计对比
============ Serving Benchmark Result ============
Native FlashInfer1 FlashInfer2 vs FlashInfer3
Successful requests: 1000 vs 1000 vs 1000 vs 1000
Benchmark duration (s): 459.73 vs 448.73 vs 449.52 vs 1423.79
Total input tokens: 1024000 vs 1024000 vs 1024000 vs 1024000
Total generated tokens: 125604 vs 125604 vs 125600 vs 967760
Request throughput (req/s): 2.18 vs 2.23 vs 2.22 vs 0.70
Output token throughput (tok/s): 273.22 vs 279.91 vs 279.41 vs 679.70
Total Token throughput (tok/s): 2500.63 vs 2561.90 vs 2557.41 vs 1398.91
---------------Time to First Token----------------
Mean TTFT (ms): 227297.09 vs 220243.30 vs 221009.70 vs 674006.33
Median TTFT (ms): 228497.44 vs 219254.37 vs 220160.59 vs 695322.65
P99 TTFT (ms): 452492.57 vs 441558.31 vs 442343.52 vs 1384799.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 135.27 vs 211.42 vs 212.00 vs 77.41
Median TPOT (ms): 134.80 vs 213.51 vs 214.87 vs 64.62
P99 TPOT (ms): 163.09 vs 250.43 vs 249.78 vs 118.74
---------------Inter-token Latency----------------
Mean ITL (ms): 134.08 vs 210.14 vs 210.71 vs 75.08
Median ITL (ms): 34.00 vs 50.69 vs 50.77 vs 39.88
P99 ITL (ms): 655.67 vs 666.94 vs 667.86 vs 664.38
==================================================
server端统计对比
用pyplot针对这3次测试跑的3个日志文件生成了一个图。
3.结论
测试1000个请求, 三轮跑下来, 不启用flashinfer总耗时稍长一点点(差10来秒, 459 vs 449).
启用flashinfer: 并发请求可达到60左右,但是受限于硬件/GPU, 首字出字速度, 单位输出token时延等数据都会延长。
每秒输出的总token数1=125604/448.73=27.99 tps
每秒输出的总token数2=125600/449.52=27.32 tps
不启用flashinfer: 并发请求在40左右, 但首字出字速度, 单位输出 token时延都会较短.
每秒输出的总token数=125604/459.73=27.32 tps
关于flashinfer:从测试结果来看,启用后并没有将这1000个请求的总耗时降下来多少,因此最终还是会受限于硬件/GPU?
二)参数:两个4090,1000 token的输入,1000 token的输出(会议摘要常规输出)
1.benchmark最高并发请求:约40~50左右
2.启用FlashInfer后数据
合并到上面的表格,具体看FlashInfer3一列数据
3.结论
指定输出token数量从128到1000,对最大并发有影响,全影响不是非常大。每秒输出的总token数=967760/1423.79=67.9 tps。
这一段测试是在5点后,快下班时间跑的
二、测试硬件环境
•软件环境:
PyTorch 2.6.0、Python 3.12(ubuntu22.04)、Cuda 12.4
•硬件环境:
○GPU:RTX 4090(24GB) * 2
○CPU:64 vCPU Intel(R) Xeon(R) Gold 6430
○内存:480G(DDR4)
○硬盘:1.8T
三、测试版本
vllm: v0.8.1
模型:QwQ-32B-AWQ
四、启动命令/参数
由于KIS产品业务需求至少两张卡(一张卡会出现文字响应出一半被截断的现象),因此目前只测了两张卡运行。
五、未启用flashinfer
1。启动server
vllm serve /root/autodl-tmp/HF_download/hub/models–Qwen–QwQ-32B-AWQ/snapshots/4e95b98be0332075ac9e4eb144d402a5ea8ad4f0 \
–swap-space 16 \
–tensor-parallel-size 2 \
–disable-log-requests
2。启动Client
python benchmark_serving.py \
–backend vllm \
–model /root/autodl-tmp/HF_download/hub/models–Qwen–QwQ-32B-AWQ/snapshots/4e95b98be0332075ac9e4eb144d402a5ea8ad4f0 \
–num-prompts 1000 \
–dataset-name random \
–request-rate inf
3。结果
client端统计
(vllm) root@autodl-container-63b64b9474-071b3374:~/autodl-tmp/jacky/vllm/benchmarks# python benchmark_serving.py --backend vllm --model /root/autodl-tmp/HF_download/hub/models--Qwen--QwQ-32B-AWQ/snapshots/4e95b98be0332075ac9e4eb144d402a5ea8ad4f0 --num-prompts 1000 --dataset-name random --request-rate inf
INFO 03-26 14:00:04 [__init__.py:256] Automatically detected platform cuda.
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='/root/autodl-tmp/HF_download/hub/models--Qwen--QwQ-32B-AWQ/snapshots/4e95b98be0332075ac9e4eb144d402a5ea8ad4f0', tokenizer=None, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [07:39<00:00, 2.18it/s]
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 459.73
Total input tokens: 1024000
Total generated tokens: 125604
Request throughput (req/s): 2.18
Output token throughput (tok/s): 273.22
Total Token throughput (tok/s): 2500.63
---------------Time to First Token----------------
Mean TTFT (ms): 227297.09
Median TTFT (ms): 228497.44
P99 TTFT (ms): 452492.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 135.27
Median TPOT (ms): 134.80
P99 TPOT (ms): 163.09
---------------Inter-token Latency----------------
Mean ITL (ms): 134.08
Median ITL (ms): 34.00
P99 ITL (ms): 655.67
==================================================
server端统计
INFO 03-26 14:00:21 [loggers.py:80] Avg prompt throughput: 2478.0 tokens/s, Avg generation throughput: 23.1 tokens/s, Running: 23 reqs, Waiting: 976 reqs, GPU KV cache usage: 54.8%, Prefix cache hit rate: 3.9%
INFO 03-26 14:00:31 [loggers.py:80] Avg prompt throughput: 2159.3 tokens/s, Avg generation throughput: 449.6 tokens/s, Running: 37 reqs, Waiting: 960 reqs, GPU KV cache usage: 95.7%, Prefix cache hit rate: 25.9%
INFO 03-26 14:00:41 [loggers.py:80] Avg prompt throughput: 2605.8 tokens/s, Avg generation throughput: 58.9 tokens/s, Running: 34 reqs, Waiting: 932 reqs, GPU KV cache usage: 83.6%, Prefix cache hit rate: 21.9%
INFO 03-26 14:00:51 [loggers.py:80] Avg prompt throughput: 1926.5 tokens/s, Avg generation throughput: 458.2 tokens/s, Running: 37 reqs, Waiting: 914 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 24.1%
INFO 03-26 14:01:01 [loggers.py:80] Avg prompt throughput: 3117.5 tokens/s, Avg generation throughput: 61.4 tokens/s, Running: 38 reqs, Waiting: 885 reqs, GPU KV cache usage: 90.6%, Prefix cache hit rate: 21.9%
INFO 03-26 14:01:11 [loggers.py:80] Avg prompt throughput: 1807.1 tokens/s, Avg generation throughput: 464.3 tokens/s, Running: 37 reqs, Waiting: 868 reqs, GPU KV cache usage: 91.8%, Prefix cache hit rate: 16.2%
INFO 03-26 14:01:21 [loggers.py:80] Avg prompt throughput: 2970.6 tokens/s, Avg generation throughput: 58.2 tokens/s, Running: 38 reqs, Waiting: 840 reqs, GPU KV cache usage: 90.7%, Prefix cache hit rate: 15.3%
INFO 03-26 14:01:31 [loggers.py:80] Avg prompt throughput: 1576.8 tokens/s, Avg generation throughput: 460.8 tokens/s, Running: 37 reqs, Waiting: 825 reqs, GPU KV cache usage: 92.1%, Prefix cache hit rate: 16.2%
INFO 03-26 14:01:41 [loggers.py:80] Avg prompt throughput: 3231.4 tokens/s, Avg generation throughput: 61.9 tokens/s, Running: 39 reqs, Waiting: 796 reqs, GPU KV cache usage: 93.9%, Prefix cache hit rate: 15.5%
INFO 03-26 14:01:51 [loggers.py:80] Avg prompt throughput: 1660.5 tokens/s, Avg generation throughput: 459.2 tokens/s, Running: 37 reqs, Waiting: 780 reqs, GPU KV cache usage: 93.3%, Prefix cache hit rate: 14.8%
INFO 03-26 14:02:01 [loggers.py:80] Avg prompt throughput: 3214.9 tokens/s, Avg generation throughput: 85.6 tokens/s, Running: 41 reqs, Waiting: 751 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 14.2%
INFO 03-26 14:02:11 [loggers.py:80] Avg prompt throughput: 1641.6 tokens/s, Avg generation throughput: 434.1 tokens/s, Running: 36 reqs, Waiting: 735 reqs, GPU KV cache usage: 90.0%, Prefix cache hit rate: 14.4%
INFO 03-26 14:02:21 [loggers.py:80] Avg prompt throughput: 2656.5 tokens/s, Avg generation throughput: 224.9 tokens/s, Running: 39 reqs, Waiting: 712 reqs, GPU KV cache usage: 96.2%, Prefix cache hit rate: 14.3%
INFO 03-26 14:02:31 [loggers.py:80] Avg prompt throughput: 2148.3 tokens/s, Avg generation throughput: 283.1 tokens/s, Running: 37 reqs, Waiting: 691 reqs, GPU KV cache usage: 91.5%, Prefix cache hit rate: 14.2%
INFO 03-26 14:02:41 [loggers.py:80] Avg prompt throughput: 2229.9 tokens/s, Avg generation throughput: 392.1 tokens/s, Running: 38 reqs, Waiting: 674 reqs, GPU KV cache usage: 99.3%, Prefix cache hit rate: 17.1%
INFO 03-26 14:02:51 [loggers.py:80] Avg prompt throughput: 2528.1 tokens/s, Avg generation throughput: 116.6 tokens/s, Running: 37 reqs, Waiting: 648 reqs, GPU KV cache usage: 91.1%, Prefix cache hit rate: 17.0%
INFO 03-26 14:03:01 [loggers.py:80] Avg prompt throughput: 1834.1 tokens/s, Avg generation throughput: 455.6 tokens/s, Running: 36 reqs, Waiting: 633 reqs, GPU KV cache usage: 93.8%, Prefix cache hit rate: 18.3%
INFO 03-26 14:03:11 [loggers.py:80] Avg prompt throughput: 3006.4 tokens/s, Avg generation throughput: 59.9 tokens/s, Running: 37 reqs, Waiting: 603 reqs, GPU KV cache usage: 87.4%, Prefix cache hit rate: 18.1%
INFO 03-26 14:03:21 [loggers.py:80] Avg prompt throughput: 1738.1 tokens/s, Avg generation throughput: 459.4 tokens/s, Running: 37 reqs, Waiting: 587 reqs, GPU KV cache usage: 93.6%, Prefix cache hit rate: 16.8%
INFO 03-26 14:03:31 [loggers.py:80] Avg prompt throughput: 3026.4 tokens/s, Avg generation throughput: 55.0 tokens/s, Running: 35 reqs, Waiting: 559 reqs, GPU KV cache usage: 83.8%, Prefix cache hit rate: 16.7%
INFO 03-26 14:03:41 [loggers.py:80] Avg prompt throughput: 1626.0 tokens/s, Avg generation throughput: 454.0 tokens/s, Running: 36 reqs, Waiting: 544 reqs, GPU KV cache usage: 92.4%, Prefix cache hit rate: 19.1%
INFO 03-26 14:03:51 [loggers.py:80] Avg prompt throughput: 3090.3 tokens/s, Avg generation throughput: 61.4 tokens/s, Running: 37 reqs, Waiting: 516 reqs, GPU KV cache usage: 90.9%, Prefix cache hit rate: 18.9%
INFO 03-26 14:04:01 [loggers.py:80] Avg prompt throughput: 1877.7 tokens/s, Avg generation throughput: 452.4 tokens/s, Running: 37 reqs, Waiting: 499 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 17.4%
INFO 03-26 14:04:11 [loggers.py:80] Avg prompt throughput: 3088.2 tokens/s, Avg generation throughput: 113.2 tokens/s, Running: 40 reqs, Waiting: 472 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 17.1%
INFO 03-26 14:04:21 [loggers.py:80] Avg prompt throughput: 1846.1 tokens/s, Avg generation throughput: 398.9 tokens/s, Running: 36 reqs, Waiting: 454 reqs, GPU KV cache usage: 89.7%, Prefix cache hit rate: 16.8%
INFO 03-26 14:04:31 [loggers.py:80] Avg prompt throughput: 2459.3 tokens/s, Avg generation throughput: 263.0 tokens/s, Running: 40 reqs, Waiting: 432 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 15.0%
INFO 03-26 14:04:41 [loggers.py:80] Avg prompt throughput: 2273.4 tokens/s, Avg generation throughput: 254.3 tokens/s, Running: 37 reqs, Waiting: 410 reqs, GPU KV cache usage: 91.3%, Prefix cache hit rate: 17.6%
INFO 03-26 14:04:51 [loggers.py:80] Avg prompt throughput: 2010.7 tokens/s, Avg generation throughput: 441.6 tokens/s, Running: 38 reqs, Waiting: 393 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 17.5%
INFO 03-26 14:05:01 [loggers.py:80] Avg prompt throughput: 2979.2 tokens/s, Avg generation throughput: 71.3 tokens/s, Running: 37 reqs, Waiting: 364 reqs, GPU KV cache usage: 88.8%, Prefix cache hit rate: 16.1%
INFO 03-26 14:05:11 [loggers.py:80] Avg prompt throughput: 1514.0 tokens/s, Avg generation throughput: 459.9 tokens/s, Running: 37 reqs, Waiting: 350 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 15.1%
INFO 03-26 14:05:21 [loggers.py:80] Avg prompt throughput: 3255.3 tokens/s, Avg generation throughput: 60.9 tokens/s, Running: 38 reqs, Waiting: 320 reqs, GPU KV cache usage: 92.2%, Prefix cache hit rate: 14.6%
INFO 03-26 14:05:31 [loggers.py:80] Avg prompt throughput: 1611.5 tokens/s, Avg generation throughput: 458.0 tokens/s, Running: 36 reqs, Waiting: 305 reqs, GPU KV cache usage: 90.6%, Prefix cache hit rate: 14.5%
INFO 03-26 14:05:41 [loggers.py:80] Avg prompt throughput: 3013.0 tokens/s, Avg generation throughput: 56.1 tokens/s, Running: 36 reqs, Waiting: 277 reqs, GPU KV cache usage: 85.9%, Prefix cache hit rate: 14.3%
INFO 03-26 14:05:51 [loggers.py:80] Avg prompt throughput: 1906.0 tokens/s, Avg generation throughput: 464.7 tokens/s, Running: 36 reqs, Waiting: 259 reqs, GPU KV cache usage: 89.1%, Prefix cache hit rate: 15.8%
INFO 03-26 14:06:01 [loggers.py:80] Avg prompt throughput: 3138.7 tokens/s, Avg generation throughput: 63.7 tokens/s, Running: 41 reqs, Waiting: 231 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 15.6%
INFO 03-26 14:06:11 [loggers.py:80] Avg prompt throughput: 1599.6 tokens/s, Avg generation throughput: 452.7 tokens/s, Running: 38 reqs, Waiting: 215 reqs, GPU KV cache usage: 92.9%, Prefix cache hit rate: 16.5%
INFO 03-26 14:06:21 [loggers.py:80] Avg prompt throughput: 2788.7 tokens/s, Avg generation throughput: 222.0 tokens/s, Running: 41 reqs, Waiting: 190 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 15.8%
INFO 03-26 14:06:31 [loggers.py:80] Avg prompt throughput: 2046.4 tokens/s, Avg generation throughput: 301.5 tokens/s, Running: 36 reqs, Waiting: 170 reqs, GPU KV cache usage: 89.0%, Prefix cache hit rate: 16.7%
INFO 03-26 14:06:41 [loggers.py:80] Avg prompt throughput: 2338.4 tokens/s, Avg generation throughput: 349.7 tokens/s, Running: 39 reqs, Waiting: 149 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 16.5%
INFO 03-26 14:06:51 [loggers.py:80] Avg prompt throughput: 2695.0 tokens/s, Avg generation throughput: 175.1 tokens/s, Running: 38 reqs, Waiting: 123 reqs, GPU KV cache usage: 91.1%, Prefix cache hit rate: 16.4%
INFO 03-26 14:07:01 [loggers.py:80] Avg prompt throughput: 1738.7 tokens/s, Avg generation throughput: 460.5 tokens/s, Running: 36 reqs, Waiting: 109 reqs, GPU KV cache usage: 93.3%, Prefix cache hit rate: 15.9%
INFO 03-26 14:07:11 [loggers.py:80] Avg prompt throughput: 2890.5 tokens/s, Avg generation throughput: 54.8 tokens/s, Running: 36 reqs, Waiting: 81 reqs, GPU KV cache usage: 88.0%, Prefix cache hit rate: 15.7%
INFO 03-26 14:07:21 [loggers.py:80] Avg prompt throughput: 1845.4 tokens/s, Avg generation throughput: 455.0 tokens/s, Running: 37 reqs, Waiting: 64 reqs, GPU KV cache usage: 94.8%, Prefix cache hit rate: 16.3%
INFO 03-26 14:07:31 [loggers.py:80] Avg prompt throughput: 3026.6 tokens/s, Avg generation throughput: 56.3 tokens/s, Running: 36 reqs, Waiting: 36 reqs, GPU KV cache usage: 86.6%, Prefix cache hit rate: 15.7%
INFO 03-26 14:07:41 [loggers.py:80] Avg prompt throughput: 1663.0 tokens/s, Avg generation throughput: 454.9 tokens/s, Running: 36 reqs, Waiting: 21 reqs, GPU KV cache usage: 91.0%, Prefix cache hit rate: 16.6%
INFO 03-26 14:07:51 [loggers.py:80] Avg prompt throughput: 2428.9 tokens/s, Avg generation throughput: 310.9 tokens/s, Running: 28 reqs, Waiting: 0 reqs, GPU KV cache usage: 74.2%, Prefix cache hit rate: 15.6%
六、启用flashinfer(第1次)
结果
client端统计
(vllm) root@autodl-container-63b64b9474-071b3374:~/autodl-tmp/jacky/vllm/benchmarks# python benchmark_serving.py --backend vllm --model /root/autodl-tmp/HF_download/hub/models--Qwen--QwQ-32B-AWQ/snapshots/4e95b98be0332075ac9e4eb144d402a5ea8ad4f0 --num-prompts 1000 --dataset-name random --request-rate inf
INFO 03-26 15:13:19 [__init__.py:256] Automatically detected platform cuda.
Namespace(backend='vllm', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=None, model='/root/autodl-tmp/HF_download/hub/models--Qwen--QwQ-32B-AWQ/snapshots/4e95b98be0332075ac9e4eb144d402a5ea8ad4f0', tokenizer=None, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto', served_model_name=None, lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [07:28<00:00, 2.23it/s]
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 448.73
Total input tokens: 1024000
Total generated tokens: 125604
Request throughput (req/s): 2.23
Output token throughput (tok/s): 279.91
Total Token throughput (tok/s): 2561.90
---------------Time to First Token----------------
Mean TTFT (ms): 220243.30
Median TTFT (ms): 219254.37
P99 TTFT (ms): 441558.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 211.42
Median TPOT (ms): 213.51
P99 TPOT (ms): 250.43
---------------Inter-token Latency----------------
Mean ITL (ms): 210.14
Median ITL (ms): 50.69
P99 ITL (ms): 666.94
==================================================
server端统计
INFO 03-26 15:13:39 [loggers.py:80] Avg prompt throughput: 3034.1 tokens/s, Avg generation throughput: 22.7 tokens/s, Running: 29 reqs, Waiting: 970 reqs, GPU KV cache usage: 42.8%, Prefix cache hit rate: 3.2%
INFO 03-26 15:13:49 [loggers.py:80] Avg prompt throughput: 3000.8 tokens/s, Avg generation throughput: 63.1 tokens/s, Running: 56 reqs, Waiting: 942 reqs, GPU KV cache usage: 83.4%, Prefix cache hit rate: 1.7%
INFO 03-26 15:13:59 [loggers.py:80] Avg prompt throughput: 1418.0 tokens/s, Avg generation throughput: 635.4 tokens/s, Running: 59 reqs, Waiting: 931 reqs, GPU KV cache usage: 95.4%, Prefix cache hit rate: 20.5%
INFO 03-26 15:14:09 [loggers.py:80] Avg prompt throughput: 2813.4 tokens/s, Avg generation throughput: 90.5 tokens/s, Running: 60 reqs, Waiting: 903 reqs, GPU KV cache usage: 93.3%, Prefix cache hit rate: 17.5%
INFO 03-26 15:14:19 [loggers.py:80] Avg prompt throughput: 2961.6 tokens/s, Avg generation throughput: 92.3 tokens/s, Running: 63 reqs, Waiting: 875 reqs, GPU KV cache usage: 94.5%, Prefix cache hit rate: 15.4%
INFO 03-26 15:14:29 [loggers.py:80] Avg prompt throughput: 1594.2 tokens/s, Avg generation throughput: 654.9 tokens/s, Running: 61 reqs, Waiting: 860 reqs, GPU KV cache usage: 95.7%, Prefix cache hit rate: 16.3%
INFO 03-26 15:14:39 [loggers.py:80] Avg prompt throughput: 2979.6 tokens/s, Avg generation throughput: 92.5 tokens/s, Running: 60 reqs, Waiting: 832 reqs, GPU KV cache usage: 90.8%, Prefix cache hit rate: 15.1%
INFO 03-26 15:14:49 [loggers.py:80] Avg prompt throughput: 2977.9 tokens/s, Avg generation throughput: 90.6 tokens/s, Running: 61 reqs, Waiting: 804 reqs, GPU KV cache usage: 89.7%, Prefix cache hit rate: 14.0%
INFO 03-26 15:14:59 [loggers.py:80] Avg prompt throughput: 1328.1 tokens/s, Avg generation throughput: 644.5 tokens/s, Running: 60 reqs, Waiting: 792 reqs, GPU KV cache usage: 94.9%, Prefix cache hit rate: 19.0%
INFO 03-26 15:15:09 [loggers.py:80] Avg prompt throughput: 2914.3 tokens/s, Avg generation throughput: 92.4 tokens/s, Running: 62 reqs, Waiting: 765 reqs, GPU KV cache usage: 96.5%, Prefix cache hit rate: 18.0%
INFO 03-26 15:15:19 [loggers.py:80] Avg prompt throughput: 2915.5 tokens/s, Avg generation throughput: 94.8 tokens/s, Running: 63 reqs, Waiting: 738 reqs, GPU KV cache usage: 95.6%, Prefix cache hit rate: 17.1%
INFO 03-26 15:15:29 [loggers.py:80] Avg prompt throughput: 1470.1 tokens/s, Avg generation throughput: 644.3 tokens/s, Running: 59 reqs, Waiting: 725 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 17.1%
INFO 03-26 15:15:39 [loggers.py:80] Avg prompt throughput: 3043.1 tokens/s, Avg generation throughput: 89.4 tokens/s, Running: 59 reqs, Waiting: 697 reqs, GPU KV cache usage: 91.3%, Prefix cache hit rate: 16.4%
INFO 03-26 15:15:49 [loggers.py:80] Avg prompt throughput: 2867.4 tokens/s, Avg generation throughput: 153.8 tokens/s, Running: 63 reqs, Waiting: 672 reqs, GPU KV cache usage: 96.9%, Prefix cache hit rate: 15.5%
INFO 03-26 15:15:59 [loggers.py:80] Avg prompt throughput: 1561.3 tokens/s, Avg generation throughput: 573.8 tokens/s, Running: 59 reqs, Waiting: 657 reqs, GPU KV cache usage: 94.0%, Prefix cache hit rate: 16.8%
INFO 03-26 15:16:09 [loggers.py:80] Avg prompt throughput: 2908.3 tokens/s, Avg generation throughput: 90.2 tokens/s, Running: 59 reqs, Waiting: 630 reqs, GPU KV cache usage: 92.1%, Prefix cache hit rate: 16.2%
INFO 03-26 15:16:19 [loggers.py:80] Avg prompt throughput: 2578.7 tokens/s, Avg generation throughput: 307.6 tokens/s, Running: 65 reqs, Waiting: 607 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 15.5%
INFO 03-26 15:16:29 [loggers.py:80] Avg prompt throughput: 1848.0 tokens/s, Avg generation throughput: 429.5 tokens/s, Running: 59 reqs, Waiting: 589 reqs, GPU KV cache usage: 92.6%, Prefix cache hit rate: 15.4%
INFO 03-26 15:16:39 [loggers.py:80] Avg prompt throughput: 2891.1 tokens/s, Avg generation throughput: 88.3 tokens/s, Running: 56 reqs, Waiting: 562 reqs, GPU KV cache usage: 86.1%, Prefix cache hit rate: 15.0%
INFO 03-26 15:16:49 [loggers.py:80] Avg prompt throughput: 2524.1 tokens/s, Avg generation throughput: 318.6 tokens/s, Running: 65 reqs, Waiting: 540 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 14.7%
INFO 03-26 15:16:59 [loggers.py:80] Avg prompt throughput: 1781.9 tokens/s, Avg generation throughput: 402.8 tokens/s, Running: 59 reqs, Waiting: 523 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 16.5%
INFO 03-26 15:17:09 [loggers.py:80] Avg prompt throughput: 2982.4 tokens/s, Avg generation throughput: 92.8 tokens/s, Running: 61 reqs, Waiting: 496 reqs, GPU KV cache usage: 94.9%, Prefix cache hit rate: 16.1%
INFO 03-26 15:17:19 [loggers.py:80] Avg prompt throughput: 2183.1 tokens/s, Avg generation throughput: 529.5 tokens/s, Running: 61 reqs, Waiting: 480 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 18.4%
INFO 03-26 15:17:29 [loggers.py:80] Avg prompt throughput: 2276.2 tokens/s, Avg generation throughput: 197.7 tokens/s, Running: 60 reqs, Waiting: 455 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 17.2%
INFO 03-26 15:17:39 [loggers.py:80] Avg prompt throughput: 2905.9 tokens/s, Avg generation throughput: 89.1 tokens/s, Running: 59 reqs, Waiting: 428 reqs, GPU KV cache usage: 90.0%, Prefix cache hit rate: 16.5%
INFO 03-26 15:17:49 [loggers.py:80] Avg prompt throughput: 1735.8 tokens/s, Avg generation throughput: 635.5 tokens/s, Running: 61 reqs, Waiting: 415 reqs, GPU KV cache usage: 98.5%, Prefix cache hit rate: 18.7%
INFO 03-26 15:17:59 [loggers.py:80] Avg prompt throughput: 2442.7 tokens/s, Avg generation throughput: 95.7 tokens/s, Running: 58 reqs, Waiting: 390 reqs, GPU KV cache usage: 91.5%, Prefix cache hit rate: 17.7%
INFO 03-26 15:18:09 [loggers.py:80] Avg prompt throughput: 3186.7 tokens/s, Avg generation throughput: 96.0 tokens/s, Running: 58 reqs, Waiting: 360 reqs, GPU KV cache usage: 88.4%, Prefix cache hit rate: 17.3%
INFO 03-26 15:18:19 [loggers.py:80] Avg prompt throughput: 1301.0 tokens/s, Avg generation throughput: 632.8 tokens/s, Running: 59 reqs, Waiting: 350 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 17.8%
INFO 03-26 15:18:29 [loggers.py:80] Avg prompt throughput: 2833.2 tokens/s, Avg generation throughput: 91.2 tokens/s, Running: 60 reqs, Waiting: 322 reqs, GPU KV cache usage: 93.0%, Prefix cache hit rate: 16.9%
INFO 03-26 15:18:39 [loggers.py:80] Avg prompt throughput: 3006.7 tokens/s, Avg generation throughput: 91.9 tokens/s, Running: 61 reqs, Waiting: 294 reqs, GPU KV cache usage: 91.9%, Prefix cache hit rate: 16.9%
INFO 03-26 15:18:49 [loggers.py:80] Avg prompt throughput: 1405.9 tokens/s, Avg generation throughput: 651.1 tokens/s, Running: 61 reqs, Waiting: 281 reqs, GPU KV cache usage: 96.9%, Prefix cache hit rate: 18.6%
INFO 03-26 15:18:59 [loggers.py:80] Avg prompt throughput: 2988.2 tokens/s, Avg generation throughput: 89.9 tokens/s, Running: 59 reqs, Waiting: 253 reqs, GPU KV cache usage: 90.4%, Prefix cache hit rate: 17.9%
INFO 03-26 15:19:09 [loggers.py:80] Avg prompt throughput: 2907.1 tokens/s, Avg generation throughput: 89.0 tokens/s, Running: 60 reqs, Waiting: 226 reqs, GPU KV cache usage: 90.5%, Prefix cache hit rate: 17.8%
INFO 03-26 15:19:19 [loggers.py:80] Avg prompt throughput: 1379.2 tokens/s, Avg generation throughput: 648.0 tokens/s, Running: 60 reqs, Waiting: 214 reqs, GPU KV cache usage: 95.6%, Prefix cache hit rate: 17.4%
INFO 03-26 15:19:29 [loggers.py:80] Avg prompt throughput: 2898.4 tokens/s, Avg generation throughput: 92.9 tokens/s, Running: 60 reqs, Waiting: 186 reqs, GPU KV cache usage: 92.8%, Prefix cache hit rate: 17.1%
INFO 03-26 15:19:39 [loggers.py:80] Avg prompt throughput: 3020.6 tokens/s, Avg generation throughput: 91.7 tokens/s, Running: 63 reqs, Waiting: 158 reqs, GPU KV cache usage: 94.2%, Prefix cache hit rate: 16.8%
INFO 03-26 15:19:49 [loggers.py:80] Avg prompt throughput: 1369.9 tokens/s, Avg generation throughput: 647.9 tokens/s, Running: 60 reqs, Waiting: 145 reqs, GPU KV cache usage: 95.2%, Prefix cache hit rate: 17.1%
INFO 03-26 15:19:59 [loggers.py:80] Avg prompt throughput: 3035.9 tokens/s, Avg generation throughput: 91.8 tokens/s, Running: 61 reqs, Waiting: 117 reqs, GPU KV cache usage: 93.6%, Prefix cache hit rate: 16.8%
INFO 03-26 15:20:09 [loggers.py:80] Avg prompt throughput: 2986.3 tokens/s, Avg generation throughput: 94.0 tokens/s, Running: 66 reqs, Waiting: 90 reqs, GPU KV cache usage: 99.3%, Prefix cache hit rate: 16.8%
INFO 03-26 15:20:19 [loggers.py:80] Avg prompt throughput: 1196.0 tokens/s, Avg generation throughput: 637.1 tokens/s, Running: 59 reqs, Waiting: 79 reqs, GPU KV cache usage: 94.1%, Prefix cache hit rate: 16.4%
INFO 03-26 15:20:29 [loggers.py:80] Avg prompt throughput: 2931.8 tokens/s, Avg generation throughput: 90.0 tokens/s, Running: 59 reqs, Waiting: 52 reqs, GPU KV cache usage: 92.2%, Prefix cache hit rate: 15.3%
INFO 03-26 15:20:39 [loggers.py:80] Avg prompt throughput: 2941.2 tokens/s, Avg generation throughput: 182.0 tokens/s, Running: 64 reqs, Waiting: 26 reqs, GPU KV cache usage: 97.6%, Prefix cache hit rate: 15.6%
INFO 03-26 15:20:49 [loggers.py:80] Avg prompt throughput: 1522.4 tokens/s, Avg generation throughput: 546.3 tokens/s, Running: 60 reqs, Waiting: 11 reqs, GPU KV cache usage: 94.9%, Prefix cache hit rate: 14.1%
INFO 03-26 15:20:59 [loggers.py:80] Avg prompt throughput: 1353.1 tokens/s, Avg generation throughput: 384.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 13.9%
七、启用flashinfer(第2次)
拿第1次的测试结果,去跟未启用flashinfer的结果做了一下对比,有点不信,所以再跑一次。
结果
client端统计
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 449.52
Total input tokens: 1024000
Total generated tokens: 125600
Request throughput (req/s): 2.22
Output token throughput (tok/s): 279.41
Total Token throughput (tok/s): 2557.41
---------------Time to First Token----------------
Mean TTFT (ms): 221009.70
Median TTFT (ms): 220160.59
P99 TTFT (ms): 442343.52
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 212.00
Median TPOT (ms): 214.87
P99 TPOT (ms): 249.78
---------------Inter-token Latency----------------
Mean ITL (ms): 210.71
Median ITL (ms): 50.77
P99 ITL (ms): 667.86
==================================================
server端统计
INFO 03-26 15:36:59 [loggers.py:80] Avg prompt throughput: 3240.2 tokens/s, Avg generation throughput: 31.5 tokens/s, Running: 34 reqs, Waiting: 965 reqs, GPU KV cache usage: 51.1%, Prefix cache hit rate: 13.3%
INFO 03-26 15:37:09 [loggers.py:80] Avg prompt throughput: 3021.2 tokens/s, Avg generation throughput: 71.0 tokens/s, Running: 59 reqs, Waiting: 937 reqs, GPU KV cache usage: 88.8%, Prefix cache hit rate: 13.3%
INFO 03-26 15:37:19 [loggers.py:80] Avg prompt throughput: 1185.1 tokens/s, Avg generation throughput: 630.3 tokens/s, Running: 60 reqs, Waiting: 926 reqs, GPU KV cache usage: 95.4%, Prefix cache hit rate: 13.3%
INFO 03-26 15:37:29 [loggers.py:80] Avg prompt throughput: 3137.3 tokens/s, Avg generation throughput: 97.5 tokens/s, Running: 61 reqs, Waiting: 897 reqs, GPU KV cache usage: 94.3%, Prefix cache hit rate: 13.2%
INFO 03-26 15:37:39 [loggers.py:80] Avg prompt throughput: 2980.1 tokens/s, Avg generation throughput: 93.6 tokens/s, Running: 65 reqs, Waiting: 869 reqs, GPU KV cache usage: 97.4%, Prefix cache hit rate: 13.2%
INFO 03-26 15:37:49 [loggers.py:80] Avg prompt throughput: 1253.4 tokens/s, Avg generation throughput: 648.3 tokens/s, Running: 60 reqs, Waiting: 857 reqs, GPU KV cache usage: 94.6%, Prefix cache hit rate: 12.6%
INFO 03-26 15:37:59 [loggers.py:80] Avg prompt throughput: 2978.0 tokens/s, Avg generation throughput: 91.5 tokens/s, Running: 59 reqs, Waiting: 829 reqs, GPU KV cache usage: 90.1%, Prefix cache hit rate: 12.1%
INFO 03-26 15:38:09 [loggers.py:80] Avg prompt throughput: 2991.2 tokens/s, Avg generation throughput: 90.6 tokens/s, Running: 64 reqs, Waiting: 801 reqs, GPU KV cache usage: 95.3%, Prefix cache hit rate: 12.1%
INFO 03-26 15:38:19 [loggers.py:80] Avg prompt throughput: 1225.7 tokens/s, Avg generation throughput: 639.2 tokens/s, Running: 60 reqs, Waiting: 790 reqs, GPU KV cache usage: 95.7%, Prefix cache hit rate: 14.2%
INFO 03-26 15:38:29 [loggers.py:80] Avg prompt throughput: 3118.1 tokens/s, Avg generation throughput: 99.5 tokens/s, Running: 63 reqs, Waiting: 761 reqs, GPU KV cache usage: 96.6%, Prefix cache hit rate: 14.1%
INFO 03-26 15:38:39 [loggers.py:80] Avg prompt throughput: 2943.1 tokens/s, Avg generation throughput: 153.7 tokens/s, Running: 65 reqs, Waiting: 735 reqs, GPU KV cache usage: 99.0%, Prefix cache hit rate: 14.3%
INFO 03-26 15:38:49 [loggers.py:80] Avg prompt throughput: 1359.3 tokens/s, Avg generation throughput: 579.0 tokens/s, Running: 60 reqs, Waiting: 722 reqs, GPU KV cache usage: 95.4%, Prefix cache hit rate: 15.4%
INFO 03-26 15:38:59 [loggers.py:80] Avg prompt throughput: 2912.6 tokens/s, Avg generation throughput: 90.7 tokens/s, Running: 59 reqs, Waiting: 695 reqs, GPU KV cache usage: 92.3%, Prefix cache hit rate: 15.4%
INFO 03-26 15:39:09 [loggers.py:80] Avg prompt throughput: 2875.5 tokens/s, Avg generation throughput: 220.2 tokens/s, Running: 65 reqs, Waiting: 670 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 15.8%
INFO 03-26 15:39:19 [loggers.py:80] Avg prompt throughput: 1554.9 tokens/s, Avg generation throughput: 503.7 tokens/s, Running: 60 reqs, Waiting: 655 reqs, GPU KV cache usage: 95.1%, Prefix cache hit rate: 15.6%
INFO 03-26 15:39:29 [loggers.py:80] Avg prompt throughput: 2908.7 tokens/s, Avg generation throughput: 91.4 tokens/s, Running: 60 reqs, Waiting: 628 reqs, GPU KV cache usage: 93.4%, Prefix cache hit rate: 15.6%
INFO 03-26 15:39:39 [loggers.py:80] Avg prompt throughput: 2364.6 tokens/s, Avg generation throughput: 418.3 tokens/s, Running: 64 reqs, Waiting: 608 reqs, GPU KV cache usage: 99.9%, Prefix cache hit rate: 16.0%
INFO 03-26 15:39:49 [loggers.py:80] Avg prompt throughput: 2058.7 tokens/s, Avg generation throughput: 319.0 tokens/s, Running: 59 reqs, Waiting: 587 reqs, GPU KV cache usage: 91.9%, Prefix cache hit rate: 15.9%
INFO 03-26 15:39:59 [loggers.py:80] Avg prompt throughput: 2917.2 tokens/s, Avg generation throughput: 87.9 tokens/s, Running: 57 reqs, Waiting: 560 reqs, GPU KV cache usage: 87.0%, Prefix cache hit rate: 15.8%
INFO 03-26 15:40:09 [loggers.py:80] Avg prompt throughput: 2287.3 tokens/s, Avg generation throughput: 420.9 tokens/s, Running: 63 reqs, Waiting: 542 reqs, GPU KV cache usage: 99.4%, Prefix cache hit rate: 16.0%
INFO 03-26 15:40:19 [loggers.py:80] Avg prompt throughput: 2006.7 tokens/s, Avg generation throughput: 300.5 tokens/s, Running: 60 reqs, Waiting: 521 reqs, GPU KV cache usage: 95.6%, Prefix cache hit rate: 17.0%
INFO 03-26 15:40:29 [loggers.py:80] Avg prompt throughput: 2971.5 tokens/s, Avg generation throughput: 93.0 tokens/s, Running: 61 reqs, Waiting: 494 reqs, GPU KV cache usage: 94.5%, Prefix cache hit rate: 16.6%
INFO 03-26 15:40:39 [loggers.py:80] Avg prompt throughput: 1969.3 tokens/s, Avg generation throughput: 625.3 tokens/s, Running: 60 reqs, Waiting: 480 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 17.3%
INFO 03-26 15:40:49 [loggers.py:80] Avg prompt throughput: 2402.3 tokens/s, Avg generation throughput: 101.6 tokens/s, Running: 59 reqs, Waiting: 454 reqs, GPU KV cache usage: 93.8%, Prefix cache hit rate: 17.0%
INFO 03-26 15:40:59 [loggers.py:80] Avg prompt throughput: 2994.0 tokens/s, Avg generation throughput: 89.1 tokens/s, Running: 60 reqs, Waiting: 426 reqs, GPU KV cache usage: 91.2%, Prefix cache hit rate: 16.6%
INFO 03-26 15:41:09 [loggers.py:80] Avg prompt throughput: 1521.5 tokens/s, Avg generation throughput: 647.6 tokens/s, Running: 59 reqs, Waiting: 413 reqs, GPU KV cache usage: 94.6%, Prefix cache hit rate: 17.7%
INFO 03-26 15:41:19 [loggers.py:80] Avg prompt throughput: 2876.8 tokens/s, Avg generation throughput: 89.4 tokens/s, Running: 59 reqs, Waiting: 386 reqs, GPU KV cache usage: 92.4%, Prefix cache hit rate: 16.8%
INFO 03-26 15:41:29 [loggers.py:80] Avg prompt throughput: 2968.5 tokens/s, Avg generation throughput: 90.0 tokens/s, Running: 59 reqs, Waiting: 358 reqs, GPU KV cache usage: 89.6%, Prefix cache hit rate: 15.9%
INFO 03-26 15:41:39 [loggers.py:80] Avg prompt throughput: 1327.4 tokens/s, Avg generation throughput: 638.9 tokens/s, Running: 60 reqs, Waiting: 346 reqs, GPU KV cache usage: 96.0%, Prefix cache hit rate: 18.2%
INFO 03-26 15:41:49 [loggers.py:80] Avg prompt throughput: 3017.9 tokens/s, Avg generation throughput: 91.3 tokens/s, Running: 60 reqs, Waiting: 318 reqs, GPU KV cache usage: 92.1%, Prefix cache hit rate: 17.5%
INFO 03-26 15:41:59 [loggers.py:80] Avg prompt throughput: 2899.0 tokens/s, Avg generation throughput: 92.0 tokens/s, Running: 62 reqs, Waiting: 291 reqs, GPU KV cache usage: 94.1%, Prefix cache hit rate: 17.0%
INFO 03-26 15:42:09 [loggers.py:80] Avg prompt throughput: 1295.7 tokens/s, Avg generation throughput: 644.8 tokens/s, Running: 60 reqs, Waiting: 279 reqs, GPU KV cache usage: 94.8%, Prefix cache hit rate: 17.5%
INFO 03-26 15:42:19 [loggers.py:80] Avg prompt throughput: 3097.0 tokens/s, Avg generation throughput: 95.6 tokens/s, Running: 58 reqs, Waiting: 250 reqs, GPU KV cache usage: 89.6%, Prefix cache hit rate: 16.7%
INFO 03-26 15:42:29 [loggers.py:80] Avg prompt throughput: 3017.0 tokens/s, Avg generation throughput: 89.5 tokens/s, Running: 64 reqs, Waiting: 222 reqs, GPU KV cache usage: 96.0%, Prefix cache hit rate: 16.6%
INFO 03-26 15:42:39 [loggers.py:80] Avg prompt throughput: 1058.0 tokens/s, Avg generation throughput: 641.7 tokens/s, Running: 60 reqs, Waiting: 212 reqs, GPU KV cache usage: 95.4%, Prefix cache hit rate: 17.0%
INFO 03-26 15:42:49 [loggers.py:80] Avg prompt throughput: 3000.9 tokens/s, Avg generation throughput: 92.9 tokens/s, Running: 60 reqs, Waiting: 184 reqs, GPU KV cache usage: 92.4%, Prefix cache hit rate: 15.9%
INFO 03-26 15:42:59 [loggers.py:80] Avg prompt throughput: 3021.6 tokens/s, Avg generation throughput: 92.0 tokens/s, Running: 65 reqs, Waiting: 156 reqs, GPU KV cache usage: 97.0%, Prefix cache hit rate: 15.8%
INFO 03-26 15:43:09 [loggers.py:80] Avg prompt throughput: 1373.1 tokens/s, Avg generation throughput: 647.8 tokens/s, Running: 60 reqs, Waiting: 143 reqs, GPU KV cache usage: 94.9%, Prefix cache hit rate: 15.4%
INFO 03-26 15:43:19 [loggers.py:80] Avg prompt throughput: 3037.2 tokens/s, Avg generation throughput: 91.9 tokens/s, Running: 62 reqs, Waiting: 115 reqs, GPU KV cache usage: 94.5%, Prefix cache hit rate: 15.2%
INFO 03-26 15:43:29 [loggers.py:80] Avg prompt throughput: 2878.5 tokens/s, Avg generation throughput: 178.8 tokens/s, Running: 65 reqs, Waiting: 90 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 15.4%
INFO 03-26 15:43:39 [loggers.py:80] Avg prompt throughput: 1418.7 tokens/s, Avg generation throughput: 558.2 tokens/s, Running: 59 reqs, Waiting: 76 reqs, GPU KV cache usage: 94.9%, Prefix cache hit rate: 15.6%
INFO 03-26 15:43:49 [loggers.py:80] Avg prompt throughput: 3026.0 tokens/s, Avg generation throughput: 90.0 tokens/s, Running: 59 reqs, Waiting: 48 reqs, GPU KV cache usage: 91.1%, Prefix cache hit rate: 14.7%
INFO 03-26 15:43:59 [loggers.py:80] Avg prompt throughput: 2626.3 tokens/s, Avg generation throughput: 254.5 tokens/s, Running: 64 reqs, Waiting: 25 reqs, GPU KV cache usage: 97.6%, Prefix cache hit rate: 14.8%
INFO 03-26 15:44:09 [loggers.py:80] Avg prompt throughput: 1520.0 tokens/s, Avg generation throughput: 467.8 tokens/s, Running: 59 reqs, Waiting: 10 reqs, GPU KV cache usage: 94.6%, Prefix cache hit rate: 15.2%
INFO 03-26 15:44:19 [loggers.py:80] Avg prompt throughput: 1247.3 tokens/s, Avg generation throughput: 376.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 13.5%
八、启用flashinfer(第3次)
第1次和第2次的测试结果,输出的总token数都限制了最大128,再把输出token数放大一点试试。
命令
python benchmark_serving.py –backend vllm –model /root/autodl-tmp/HF_download/hub/models–Qwen–QwQ-32B-AWQ/snapshots/4e95b98be0332075ac9e4eb144d402a5ea8ad4f0 –num-prompts 1000 –dataset-name random –request-rate inf –random_output_len 1024
结果
client端统计
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 1423.79
Total input tokens: 1024000
Total generated tokens: 967760
Request throughput (req/s): 0.70
Output token throughput (tok/s): 679.70
Total Token throughput (tok/s): 1398.91
---------------Time to First Token----------------
Mean TTFT (ms): 674006.33
Median TTFT (ms): 695322.65
P99 TTFT (ms): 1384799.77
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 77.41
Median TPOT (ms): 64.62
P99 TPOT (ms): 118.74
---------------Inter-token Latency----------------
Mean ITL (ms): 75.08
Median ITL (ms): 39.88
P99 ITL (ms): 664.38
server端统计
INFO 03-26 17:03:09 [loggers.py:80] Avg prompt throughput: 2478.1 tokens/s, Avg generation throughput: 22.3 tokens/s, Running: 23 reqs, Waiting: 976 reqs, GPU KV cache usage: 34.3%, Prefix cache hit rate: 50.8%
INFO 03-26 17:03:19 [loggers.py:80] Avg prompt throughput: 3018.5 tokens/s, Avg generation throughput: 55.0 tokens/s, Running: 50 reqs, Waiting: 948 reqs, GPU KV cache usage: 75.0%, Prefix cache hit rate: 50.8%
INFO 03-26 17:03:29 [loggers.py:80] Avg prompt throughput: 2063.2 tokens/s, Avg generation throughput: 572.1 tokens/s, Running: 61 reqs, Waiting: 935 reqs, GPU KV cache usage: 98.5%, Prefix cache hit rate: 50.9%
INFO 03-26 17:03:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1299.4 tokens/s, Running: 52 reqs, Waiting: 943 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 50.5%
INFO 03-26 17:03:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1184.6 tokens/s, Running: 44 reqs, Waiting: 951 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 50.3%
INFO 03-26 17:03:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1044.3 tokens/s, Running: 38 reqs, Waiting: 956 reqs, GPU KV cache usage: 97.4%, Prefix cache hit rate: 50.1%
INFO 03-26 17:04:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 587.1 tokens/s, Running: 32 reqs, Waiting: 951 reqs, GPU KV cache usage: 85.8%, Prefix cache hit rate: 50.3%
INFO 03-26 17:04:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.1 tokens/s, Running: 29 reqs, Waiting: 929 reqs, GPU KV cache usage: 57.9%, Prefix cache hit rate: 50.3%
INFO 03-26 17:04:29 [loggers.py:80] Avg prompt throughput: 3026.0 tokens/s, Avg generation throughput: 132.0 tokens/s, Running: 56 reqs, Waiting: 902 reqs, GPU KV cache usage: 99.0%, Prefix cache hit rate: 50.3%
INFO 03-26 17:04:39 [loggers.py:80] Avg prompt throughput: 208.8 tokens/s, Avg generation throughput: 988.7 tokens/s, Running: 52 reqs, Waiting: 902 reqs, GPU KV cache usage: 99.7%, Prefix cache hit rate: 50.1%
INFO 03-26 17:04:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 966.9 tokens/s, Running: 45 reqs, Waiting: 906 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 50.1%
INFO 03-26 17:04:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 786.3 tokens/s, Running: 43 reqs, Waiting: 903 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 50.1%
INFO 03-26 17:05:09 [loggers.py:80] Avg prompt throughput: 321.6 tokens/s, Avg generation throughput: 702.8 tokens/s, Running: 42 reqs, Waiting: 897 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 50.0%
INFO 03-26 17:05:19 [loggers.py:80] Avg prompt throughput: 636.0 tokens/s, Avg generation throughput: 701.8 tokens/s, Running: 42 reqs, Waiting: 891 reqs, GPU KV cache usage: 99.7%, Prefix cache hit rate: 50.1%
INFO 03-26 17:05:29 [loggers.py:80] Avg prompt throughput: 1918.0 tokens/s, Avg generation throughput: 349.9 tokens/s, Running: 42 reqs, Waiting: 872 reqs, GPU KV cache usage: 79.8%, Prefix cache hit rate: 50.0%
INFO 03-26 17:05:39 [loggers.py:80] Avg prompt throughput: 2972.7 tokens/s, Avg generation throughput: 72.7 tokens/s, Running: 61 reqs, Waiting: 844 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 50.0%
INFO 03-26 17:05:49 [loggers.py:80] Avg prompt throughput: 422.8 tokens/s, Avg generation throughput: 1070.2 tokens/s, Running: 54 reqs, Waiting: 847 reqs, GPU KV cache usage: 96.6%, Prefix cache hit rate: 50.0%
INFO 03-26 17:05:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1080.0 tokens/s, Running: 48 reqs, Waiting: 851 reqs, GPU KV cache usage: 99.3%, Prefix cache hit rate: 50.0%
INFO 03-26 17:06:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1088.6 tokens/s, Running: 41 reqs, Waiting: 857 reqs, GPU KV cache usage: 98.0%, Prefix cache hit rate: 50.0%
INFO 03-26 17:06:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 636.8 tokens/s, Running: 38 reqs, Waiting: 853 reqs, GPU KV cache usage: 95.7%, Prefix cache hit rate: 50.2%
INFO 03-26 17:06:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 527.6 tokens/s, Running: 39 reqs, Waiting: 844 reqs, GPU KV cache usage: 97.0%, Prefix cache hit rate: 50.2%
INFO 03-26 17:06:39 [loggers.py:80] Avg prompt throughput: 2020.7 tokens/s, Avg generation throughput: 252.1 tokens/s, Running: 40 reqs, Waiting: 821 reqs, GPU KV cache usage: 76.4%, Prefix cache hit rate: 50.1%
INFO 03-26 17:06:49 [loggers.py:80] Avg prompt throughput: 3021.9 tokens/s, Avg generation throughput: 139.9 tokens/s, Running: 60 reqs, Waiting: 794 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 50.1%
INFO 03-26 17:06:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1136.9 tokens/s, Running: 54 reqs, Waiting: 797 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 50.2%
INFO 03-26 17:07:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 983.5 tokens/s, Running: 47 reqs, Waiting: 801 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 50.2%
INFO 03-26 17:07:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 946.1 tokens/s, Running: 43 reqs, Waiting: 802 reqs, GPU KV cache usage: 99.5%, Prefix cache hit rate: 50.2%
INFO 03-26 17:07:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 836.3 tokens/s, Running: 39 reqs, Waiting: 802 reqs, GPU KV cache usage: 99.9%, Prefix cache hit rate: 50.2%
INFO 03-26 17:07:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 577.7 tokens/s, Running: 35 reqs, Waiting: 797 reqs, GPU KV cache usage: 91.7%, Prefix cache hit rate: 50.1%
INFO 03-26 17:07:49 [loggers.py:80] Avg prompt throughput: 2706.0 tokens/s, Avg generation throughput: 56.9 tokens/s, Running: 36 reqs, Waiting: 768 reqs, GPU KV cache usage: 59.5%, Prefix cache hit rate: 50.0%
INFO 03-26 17:07:59 [loggers.py:80] Avg prompt throughput: 3014.1 tokens/s, Avg generation throughput: 74.9 tokens/s, Running: 63 reqs, Waiting: 740 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 49.9%
INFO 03-26 17:08:09 [loggers.py:80] Avg prompt throughput: 109.9 tokens/s, Avg generation throughput: 1258.4 tokens/s, Running: 54 reqs, Waiting: 748 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 49.8%
INFO 03-26 17:08:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1062.3 tokens/s, Running: 47 reqs, Waiting: 752 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 49.9%
INFO 03-26 17:08:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 951.6 tokens/s, Running: 41 reqs, Waiting: 755 reqs, GPU KV cache usage: 97.7%, Prefix cache hit rate: 49.9%
INFO 03-26 17:08:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 845.3 tokens/s, Running: 37 reqs, Waiting: 756 reqs, GPU KV cache usage: 97.6%, Prefix cache hit rate: 49.8%
INFO 03-26 17:08:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 446.1 tokens/s, Running: 34 reqs, Waiting: 746 reqs, GPU KV cache usage: 85.7%, Prefix cache hit rate: 49.8%
INFO 03-26 17:08:59 [loggers.py:80] Avg prompt throughput: 2242.1 tokens/s, Avg generation throughput: 52.6 tokens/s, Running: 38 reqs, Waiting: 719 reqs, GPU KV cache usage: 64.4%, Prefix cache hit rate: 49.7%
INFO 03-26 17:09:09 [loggers.py:80] Avg prompt throughput: 2803.9 tokens/s, Avg generation throughput: 239.5 tokens/s, Running: 61 reqs, Waiting: 694 reqs, GPU KV cache usage: 99.3%, Prefix cache hit rate: 49.5%
INFO 03-26 17:09:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1169.7 tokens/s, Running: 51 reqs, Waiting: 702 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 49.6%
INFO 03-26 17:09:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1064.1 tokens/s, Running: 45 reqs, Waiting: 706 reqs, GPU KV cache usage: 99.2%, Prefix cache hit rate: 49.2%
INFO 03-26 17:09:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 883.9 tokens/s, Running: 41 reqs, Waiting: 706 reqs, GPU KV cache usage: 99.4%, Prefix cache hit rate: 49.0%
INFO 03-26 17:09:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 697.1 tokens/s, Running: 37 reqs, Waiting: 704 reqs, GPU KV cache usage: 95.4%, Prefix cache hit rate: 48.9%
INFO 03-26 17:09:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 514.1 tokens/s, Running: 35 reqs, Waiting: 694 reqs, GPU KV cache usage: 85.7%, Prefix cache hit rate: 48.9%
INFO 03-26 17:10:09 [loggers.py:80] Avg prompt throughput: 2879.9 tokens/s, Avg generation throughput: 55.2 tokens/s, Running: 40 reqs, Waiting: 667 reqs, GPU KV cache usage: 66.7%, Prefix cache hit rate: 48.7%
INFO 03-26 17:10:19 [loggers.py:80] Avg prompt throughput: 2439.0 tokens/s, Avg generation throughput: 435.9 tokens/s, Running: 58 reqs, Waiting: 649 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 48.2%
INFO 03-26 17:10:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1114.8 tokens/s, Running: 50 reqs, Waiting: 655 reqs, GPU KV cache usage: 96.9%, Prefix cache hit rate: 48.6%
INFO 03-26 17:10:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1055.1 tokens/s, Running: 45 reqs, Waiting: 658 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 48.1%
INFO 03-26 17:10:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 957.7 tokens/s, Running: 39 reqs, Waiting: 662 reqs, GPU KV cache usage: 97.9%, Prefix cache hit rate: 48.0%
INFO 03-26 17:10:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 766.9 tokens/s, Running: 36 reqs, Waiting: 659 reqs, GPU KV cache usage: 97.0%, Prefix cache hit rate: 48.0%
INFO 03-26 17:11:09 [loggers.py:80] Avg prompt throughput: 433.7 tokens/s, Avg generation throughput: 230.8 tokens/s, Running: 34 reqs, Waiting: 641 reqs, GPU KV cache usage: 74.4%, Prefix cache hit rate: 47.1%
INFO 03-26 17:11:19 [loggers.py:80] Avg prompt throughput: 2988.6 tokens/s, Avg generation throughput: 57.8 tokens/s, Running: 49 reqs, Waiting: 613 reqs, GPU KV cache usage: 80.3%, Prefix cache hit rate: 46.9%
INFO 03-26 17:11:29 [loggers.py:80] Avg prompt throughput: 1630.4 tokens/s, Avg generation throughput: 741.9 tokens/s, Running: 58 reqs, Waiting: 603 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 47.5%
INFO 03-26 17:11:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1124.6 tokens/s, Running: 49 reqs, Waiting: 610 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 45.9%
INFO 03-26 17:11:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 995.9 tokens/s, Running: 42 reqs, Waiting: 614 reqs, GPU KV cache usage: 95.9%, Prefix cache hit rate: 44.9%
INFO 03-26 17:11:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 849.6 tokens/s, Running: 39 reqs, Waiting: 613 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 46.8%
INFO 03-26 17:12:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 652.9 tokens/s, Running: 38 reqs, Waiting: 609 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 44.4%
INFO 03-26 17:12:19 [loggers.py:80] Avg prompt throughput: 967.0 tokens/s, Avg generation throughput: 242.1 tokens/s, Running: 34 reqs, Waiting: 589 reqs, GPU KV cache usage: 69.2%, Prefix cache hit rate: 43.2%
INFO 03-26 17:12:29 [loggers.py:80] Avg prompt throughput: 3019.6 tokens/s, Avg generation throughput: 58.2 tokens/s, Running: 50 reqs, Waiting: 561 reqs, GPU KV cache usage: 80.6%, Prefix cache hit rate: 42.9%
INFO 03-26 17:12:39 [loggers.py:80] Avg prompt throughput: 1520.0 tokens/s, Avg generation throughput: 724.8 tokens/s, Running: 58 reqs, Waiting: 552 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 41.6%
INFO 03-26 17:12:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1152.5 tokens/s, Running: 49 reqs, Waiting: 560 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 41.4%
INFO 03-26 17:12:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1021.5 tokens/s, Running: 43 reqs, Waiting: 564 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 41.4%
INFO 03-26 17:13:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 804.9 tokens/s, Running: 40 reqs, Waiting: 563 reqs, GPU KV cache usage: 99.4%, Prefix cache hit rate: 41.5%
INFO 03-26 17:13:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 718.1 tokens/s, Running: 37 reqs, Waiting: 561 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 41.6%
INFO 03-26 17:13:29 [loggers.py:80] Avg prompt throughput: 875.9 tokens/s, Avg generation throughput: 170.9 tokens/s, Running: 37 reqs, Waiting: 539 reqs, GPU KV cache usage: 75.4%, Prefix cache hit rate: 41.6%
INFO 03-26 17:13:39 [loggers.py:80] Avg prompt throughput: 2989.0 tokens/s, Avg generation throughput: 62.4 tokens/s, Running: 52 reqs, Waiting: 512 reqs, GPU KV cache usage: 84.5%, Prefix cache hit rate: 41.5%
INFO 03-26 17:13:49 [loggers.py:80] Avg prompt throughput: 1106.6 tokens/s, Avg generation throughput: 824.2 tokens/s, Running: 56 reqs, Waiting: 507 reqs, GPU KV cache usage: 99.4%, Prefix cache hit rate: 41.3%
INFO 03-26 17:13:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1221.3 tokens/s, Running: 47 reqs, Waiting: 516 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 41.6%
INFO 03-26 17:14:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 946.4 tokens/s, Running: 41 reqs, Waiting: 519 reqs, GPU KV cache usage: 95.7%, Prefix cache hit rate: 41.7%
INFO 03-26 17:14:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 788.6 tokens/s, Running: 39 reqs, Waiting: 517 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 41.6%
INFO 03-26 17:14:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 636.2 tokens/s, Running: 37 reqs, Waiting: 513 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 41.7%
INFO 03-26 17:14:39 [loggers.py:80] Avg prompt throughput: 1419.1 tokens/s, Avg generation throughput: 138.7 tokens/s, Running: 35 reqs, Waiting: 489 reqs, GPU KV cache usage: 66.4%, Prefix cache hit rate: 41.5%
INFO 03-26 17:14:49 [loggers.py:80] Avg prompt throughput: 2987.3 tokens/s, Avg generation throughput: 66.6 tokens/s, Running: 57 reqs, Waiting: 462 reqs, GPU KV cache usage: 94.7%, Prefix cache hit rate: 41.4%
INFO 03-26 17:14:59 [loggers.py:80] Avg prompt throughput: 617.6 tokens/s, Avg generation throughput: 991.3 tokens/s, Running: 55 reqs, Waiting: 461 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 41.0%
INFO 03-26 17:15:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1144.9 tokens/s, Running: 46 reqs, Waiting: 469 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 41.4%
INFO 03-26 17:15:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 921.5 tokens/s, Running: 42 reqs, Waiting: 470 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 41.4%
INFO 03-26 17:15:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 779.1 tokens/s, Running: 38 reqs, Waiting: 470 reqs, GPU KV cache usage: 98.2%, Prefix cache hit rate: 41.4%
INFO 03-26 17:15:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 643.1 tokens/s, Running: 35 reqs, Waiting: 465 reqs, GPU KV cache usage: 94.5%, Prefix cache hit rate: 41.7%
INFO 03-26 17:15:49 [loggers.py:80] Avg prompt throughput: 2050.8 tokens/s, Avg generation throughput: 54.2 tokens/s, Running: 36 reqs, Waiting: 437 reqs, GPU KV cache usage: 61.2%, Prefix cache hit rate: 41.6%
INFO 03-26 17:15:59 [loggers.py:80] Avg prompt throughput: 2912.5 tokens/s, Avg generation throughput: 158.9 tokens/s, Running: 61 reqs, Waiting: 411 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 41.4%
INFO 03-26 17:16:09 [loggers.py:80] Avg prompt throughput: 236.0 tokens/s, Avg generation throughput: 1137.7 tokens/s, Running: 54 reqs, Waiting: 415 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 41.5%
INFO 03-26 17:16:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1110.6 tokens/s, Running: 47 reqs, Waiting: 420 reqs, GPU KV cache usage: 99.3%, Prefix cache hit rate: 41.6%
INFO 03-26 17:16:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 893.0 tokens/s, Running: 41 reqs, Waiting: 422 reqs, GPU KV cache usage: 96.7%, Prefix cache hit rate: 41.8%
INFO 03-26 17:16:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 820.1 tokens/s, Running: 38 reqs, Waiting: 421 reqs, GPU KV cache usage: 97.3%, Prefix cache hit rate: 41.4%
INFO 03-26 17:16:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 517.5 tokens/s, Running: 35 reqs, Waiting: 413 reqs, GPU KV cache usage: 88.7%, Prefix cache hit rate: 41.3%
INFO 03-26 17:16:59 [loggers.py:80] Avg prompt throughput: 2423.8 tokens/s, Avg generation throughput: 52.4 tokens/s, Running: 37 reqs, Waiting: 386 reqs, GPU KV cache usage: 62.0%, Prefix cache hit rate: 41.2%
INFO 03-26 17:17:09 [loggers.py:80] Avg prompt throughput: 2859.5 tokens/s, Avg generation throughput: 223.6 tokens/s, Running: 61 reqs, Waiting: 361 reqs, GPU KV cache usage: 98.6%, Prefix cache hit rate: 40.9%
INFO 03-26 17:17:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1157.1 tokens/s, Running: 52 reqs, Waiting: 368 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 41.2%
INFO 03-26 17:17:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1074.1 tokens/s, Running: 46 reqs, Waiting: 372 reqs, GPU KV cache usage: 99.9%, Prefix cache hit rate: 41.3%
INFO 03-26 17:17:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 921.2 tokens/s, Running: 41 reqs, Waiting: 374 reqs, GPU KV cache usage: 99.4%, Prefix cache hit rate: 41.5%
INFO 03-26 17:17:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 671.9 tokens/s, Running: 39 reqs, Waiting: 369 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 41.2%
INFO 03-26 17:17:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 544.9 tokens/s, Running: 36 reqs, Waiting: 360 reqs, GPU KV cache usage: 86.0%, Prefix cache hit rate: 41.5%
INFO 03-26 17:18:09 [loggers.py:80] Avg prompt throughput: 2845.1 tokens/s, Avg generation throughput: 56.0 tokens/s, Running: 43 reqs, Waiting: 333 reqs, GPU KV cache usage: 72.5%, Prefix cache hit rate: 41.3%
INFO 03-26 17:18:19 [loggers.py:80] Avg prompt throughput: 2265.2 tokens/s, Avg generation throughput: 451.2 tokens/s, Running: 59 reqs, Waiting: 315 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 41.0%
INFO 03-26 17:18:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1146.8 tokens/s, Running: 52 reqs, Waiting: 320 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 41.0%
INFO 03-26 17:18:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1051.5 tokens/s, Running: 45 reqs, Waiting: 325 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 41.5%
INFO 03-26 17:18:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1005.5 tokens/s, Running: 39 reqs, Waiting: 328 reqs, GPU KV cache usage: 96.7%, Prefix cache hit rate: 41.5%
INFO 03-26 17:18:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 688.1 tokens/s, Running: 37 reqs, Waiting: 325 reqs, GPU KV cache usage: 98.2%, Prefix cache hit rate: 41.7%
INFO 03-26 17:19:09 [loggers.py:80] Avg prompt throughput: 417.1 tokens/s, Avg generation throughput: 273.5 tokens/s, Running: 35 reqs, Waiting: 308 reqs, GPU KV cache usage: 77.7%, Prefix cache hit rate: 41.6%
INFO 03-26 17:19:19 [loggers.py:80] Avg prompt throughput: 3018.3 tokens/s, Avg generation throughput: 58.2 tokens/s, Running: 48 reqs, Waiting: 280 reqs, GPU KV cache usage: 78.6%, Prefix cache hit rate: 41.6%
INFO 03-26 17:19:29 [loggers.py:80] Avg prompt throughput: 1814.6 tokens/s, Avg generation throughput: 609.7 tokens/s, Running: 58 reqs, Waiting: 267 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 41.3%
INFO 03-26 17:19:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1175.7 tokens/s, Running: 49 reqs, Waiting: 275 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 41.7%
INFO 03-26 17:19:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 982.4 tokens/s, Running: 45 reqs, Waiting: 276 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 41.7%
INFO 03-26 17:19:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 835.6 tokens/s, Running: 40 reqs, Waiting: 277 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 41.8%
INFO 03-26 17:20:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 775.3 tokens/s, Running: 36 reqs, Waiting: 277 reqs, GPU KV cache usage: 96.4%, Prefix cache hit rate: 41.8%
INFO 03-26 17:20:19 [loggers.py:80] Avg prompt throughput: 519.2 tokens/s, Avg generation throughput: 231.8 tokens/s, Running: 34 reqs, Waiting: 258 reqs, GPU KV cache usage: 73.0%, Prefix cache hit rate: 42.0%
INFO 03-26 17:20:29 [loggers.py:80] Avg prompt throughput: 3239.4 tokens/s, Avg generation throughput: 61.4 tokens/s, Running: 50 reqs, Waiting: 228 reqs, GPU KV cache usage: 79.2%, Prefix cache hit rate: 41.8%
INFO 03-26 17:20:39 [loggers.py:80] Avg prompt throughput: 1595.5 tokens/s, Avg generation throughput: 676.7 tokens/s, Running: 59 reqs, Waiting: 218 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 41.8%
INFO 03-26 17:20:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1188.4 tokens/s, Running: 50 reqs, Waiting: 226 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 42.2%
INFO 03-26 17:20:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1104.5 tokens/s, Running: 43 reqs, Waiting: 232 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 42.3%
INFO 03-26 17:21:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 834.7 tokens/s, Running: 40 reqs, Waiting: 231 reqs, GPU KV cache usage: 99.4%, Prefix cache hit rate: 42.3%
INFO 03-26 17:21:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 657.4 tokens/s, Running: 38 reqs, Waiting: 227 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 42.0%
INFO 03-26 17:21:29 [loggers.py:80] Avg prompt throughput: 635.4 tokens/s, Avg generation throughput: 191.5 tokens/s, Running: 35 reqs, Waiting: 207 reqs, GPU KV cache usage: 70.4%, Prefix cache hit rate: 42.0%
INFO 03-26 17:21:39 [loggers.py:80] Avg prompt throughput: 3003.4 tokens/s, Avg generation throughput: 61.8 tokens/s, Running: 53 reqs, Waiting: 179 reqs, GPU KV cache usage: 86.4%, Prefix cache hit rate: 41.8%
INFO 03-26 17:21:49 [loggers.py:80] Avg prompt throughput: 1097.5 tokens/s, Avg generation throughput: 833.4 tokens/s, Running: 56 reqs, Waiting: 173 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 42.2%
INFO 03-26 17:21:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1059.8 tokens/s, Running: 49 reqs, Waiting: 178 reqs, GPU KV cache usage: 99.0%, Prefix cache hit rate: 42.6%
INFO 03-26 17:22:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 973.6 tokens/s, Running: 43 reqs, Waiting: 181 reqs, GPU KV cache usage: 98.5%, Prefix cache hit rate: 42.4%
INFO 03-26 17:22:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 856.1 tokens/s, Running: 39 reqs, Waiting: 180 reqs, GPU KV cache usage: 96.4%, Prefix cache hit rate: 42.4%
INFO 03-26 17:22:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 598.8 tokens/s, Running: 39 reqs, Waiting: 174 reqs, GPU KV cache usage: 99.6%, Prefix cache hit rate: 42.5%
INFO 03-26 17:22:39 [loggers.py:80] Avg prompt throughput: 1496.0 tokens/s, Avg generation throughput: 282.2 tokens/s, Running: 37 reqs, Waiting: 155 reqs, GPU KV cache usage: 73.1%, Prefix cache hit rate: 42.2%
INFO 03-26 17:22:49 [loggers.py:80] Avg prompt throughput: 3002.8 tokens/s, Avg generation throughput: 64.6 tokens/s, Running: 55 reqs, Waiting: 127 reqs, GPU KV cache usage: 87.9%, Prefix cache hit rate: 42.2%
INFO 03-26 17:22:59 [loggers.py:80] Avg prompt throughput: 1090.0 tokens/s, Avg generation throughput: 963.7 tokens/s, Running: 55 reqs, Waiting: 126 reqs, GPU KV cache usage: 99.0%, Prefix cache hit rate: 42.1%
INFO 03-26 17:23:09 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1088.8 tokens/s, Running: 50 reqs, Waiting: 129 reqs, GPU KV cache usage: 100.0%, Prefix cache hit rate: 42.1%
INFO 03-26 17:23:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1055.9 tokens/s, Running: 43 reqs, Waiting: 134 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 42.3%
INFO 03-26 17:23:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 864.9 tokens/s, Running: 39 reqs, Waiting: 135 reqs, GPU KV cache usage: 99.9%, Prefix cache hit rate: 42.3%
INFO 03-26 17:23:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 611.0 tokens/s, Running: 35 reqs, Waiting: 131 reqs, GPU KV cache usage: 94.5%, Prefix cache hit rate: 42.3%
INFO 03-26 17:23:49 [loggers.py:80] Avg prompt throughput: 1312.1 tokens/s, Avg generation throughput: 51.1 tokens/s, Running: 32 reqs, Waiting: 105 reqs, GPU KV cache usage: 56.7%, Prefix cache hit rate: 42.2%
INFO 03-26 17:23:59 [loggers.py:80] Avg prompt throughput: 3199.7 tokens/s, Avg generation throughput: 71.9 tokens/s, Running: 59 reqs, Waiting: 76 reqs, GPU KV cache usage: 96.1%, Prefix cache hit rate: 42.1%
INFO 03-26 17:24:09 [loggers.py:80] Avg prompt throughput: 332.6 tokens/s, Avg generation throughput: 1148.3 tokens/s, Running: 54 reqs, Waiting: 80 reqs, GPU KV cache usage: 99.5%, Prefix cache hit rate: 42.0%
INFO 03-26 17:24:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1129.4 tokens/s, Running: 46 reqs, Waiting: 87 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 42.5%
INFO 03-26 17:24:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 885.3 tokens/s, Running: 41 reqs, Waiting: 88 reqs, GPU KV cache usage: 98.4%, Prefix cache hit rate: 42.7%
INFO 03-26 17:24:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 774.0 tokens/s, Running: 38 reqs, Waiting: 87 reqs, GPU KV cache usage: 98.9%, Prefix cache hit rate: 42.9%
INFO 03-26 17:24:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 547.1 tokens/s, Running: 37 reqs, Waiting: 80 reqs, GPU KV cache usage: 97.1%, Prefix cache hit rate: 42.6%
INFO 03-26 17:24:59 [loggers.py:80] Avg prompt throughput: 2270.3 tokens/s, Avg generation throughput: 56.8 tokens/s, Running: 38 reqs, Waiting: 52 reqs, GPU KV cache usage: 63.9%, Prefix cache hit rate: 42.6%
INFO 03-26 17:25:09 [loggers.py:80] Avg prompt throughput: 2941.3 tokens/s, Avg generation throughput: 210.2 tokens/s, Running: 61 reqs, Waiting: 26 reqs, GPU KV cache usage: 99.2%, Prefix cache hit rate: 42.4%
INFO 03-26 17:25:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1220.8 tokens/s, Running: 51 reqs, Waiting: 35 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 42.4%
INFO 03-26 17:25:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1043.3 tokens/s, Running: 46 reqs, Waiting: 38 reqs, GPU KV cache usage: 99.9%, Prefix cache hit rate: 42.8%
INFO 03-26 17:25:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 911.4 tokens/s, Running: 41 reqs, Waiting: 40 reqs, GPU KV cache usage: 98.7%, Prefix cache hit rate: 42.7%
INFO 03-26 17:25:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 728.2 tokens/s, Running: 38 reqs, Waiting: 38 reqs, GPU KV cache usage: 98.3%, Prefix cache hit rate: 42.6%
INFO 03-26 17:25:59 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 476.0 tokens/s, Running: 34 reqs, Waiting: 28 reqs, GPU KV cache usage: 83.9%, Prefix cache hit rate: 42.6%
INFO 03-26 17:26:09 [loggers.py:80] Avg prompt throughput: 2875.4 tokens/s, Avg generation throughput: 62.2 tokens/s, Running: 41 reqs, Waiting: 0 reqs, GPU KV cache usage: 69.5%, Prefix cache hit rate: 42.4%
INFO 03-26 17:26:19 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1129.9 tokens/s, Running: 39 reqs, Waiting: 0 reqs, GPU KV cache usage: 76.6%, Prefix cache hit rate: 42.6%
INFO 03-26 17:26:29 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1059.8 tokens/s, Running: 37 reqs, Waiting: 0 reqs, GPU KV cache usage: 87.0%, Prefix cache hit rate: 42.7%
INFO 03-26 17:26:39 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 962.9 tokens/s, Running: 29 reqs, Waiting: 0 reqs, GPU KV cache usage: 75.6%, Prefix cache hit rate: 42.6%
INFO 03-26 17:26:49 [loggers.py:80] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 401.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 42.7%
附录一:如何启用FlashInfer
在启动vllm的时候需要注意一下启动日志,如果FlashInfer没启用的话,对整体的性能会带来非常大的影响。
如果你有看到类似下面这样的告警,请参考这篇文章:vllm 优化:flashinfer问题
[topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.