Welcome to vLLM!# 欢迎来到 vLLM!
Easy, fast, and cheap LLM serving for everyone
简单、快速、便宜,为所有人提供LLM服务
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM 是一个快速易用的LLM推理和服务库。
vLLM is fast with:
vLLM 速度很快:
State-of-the-art serving throughput
先进的 Serving 吞吐量Efficient management of attention key and value memory with PagedAttention
使用 PagedAttention 高效管理注意力关键和值内存Continuous batching of incoming requests
连续批处理入站请求Fast model execution with CUDA/HIP graph
使用 CUDA/HIP 图实现快速模型执行Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache
量化:GPTQ,AWQ,SqueezeLLM,FP8 KV 缓存Optimized CUDA kernels 优化的 CUDA 内核
vLLM is flexible and easy to use with:
vLLM 模块灵活且易于使用:
Seamless integration with popular HuggingFace models
与流行的 HuggingFace 模型无缝集成High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
高效能服务,支持多种解码算法,包括并行采样、束搜索等Tensor parallelism support for distributed inference
分布式推理的张量并行支持Streaming outputs 流式输出
OpenAI-compatible API server
兼容 OpenAI 的 API 服务器Support NVIDIA GPUs and AMD GPUs
支持 NVIDIA GPU 和 AMD GPU(Experimental) Prefix caching support
(实验性)前缀缓存支持(Experimental) Multi-lora support
(实验性)多 LoRa 支持
For more information, check out the following:
如需更多信息,请参阅以下内容:
vLLM announcing blog post (intro to PagedAttention)
vLLM 发布博客文章(PagedAttention 介绍)vLLM paper (SOSP 2023)
vLLM 论文(SOSP 2023)How continuous batching enables 23x throughput in LLM inference while reducing p50 latency by Cade Daniel et al.
连续批处理如何实现LLM推理的 23 倍吞吐量,同时将 p50 延迟降低 Cade Daniel 等人。vLLM Meetups. vLLM 面对面聚会。