vLLM
π’ Announcing our research paper: Mem0 achieves 26% higher accuracy than OpenAI Memory, 91% lower latency, and 90% token savings! Read the paper to learn how we're revolutionizing AI agent memory.
vLLM is a high-performance inference engine for large language models that provides significant performance improvements for local inference. Itβs designed to maximize throughput and memory efficiency for serving LLMs.
Prerequisites
-
Install vLLM:
-
Start vLLM server:
Usage
Configuration Parameters
Parameter | Description | Default | Environment Variable |
---|---|---|---|
model | Model name running on vLLM server | "Qwen/Qwen2.5-32B-Instruct" | - |
vllm_base_url | vLLM server URL | "http://localhost:8000/v1" | VLLM_BASE_URL |
api_key | API key (dummy for local) | "vllm-api-key" | VLLM_API_KEY |
temperature | Sampling temperature | 0.1 | - |
max_tokens | Maximum tokens to generate | 2000 | - |
Environment Variables
You can set these environment variables instead of specifying them in config:
Benefits
- High Performance: 2-24x faster inference than standard implementations
- Memory Efficient: Optimized memory usage with PagedAttention
- Local Deployment: Keep your data private and reduce API costs
- Easy Integration: Drop-in replacement for other LLM providers
- Flexible: Works with any model supported by vLLM
Troubleshooting
-
Server not responding: Make sure vLLM server is running
-
404 errors: Ensure correct base URL format
-
Model not found: Check model name matches server
-
Out of memory: Try smaller models or reduce
max_model_len
Config
All available parameters for the vllm
config are present in Master List of All Params in Config.