vLLM is a high-performance inference engine for large language models that provides significant performance improvements for local inference. It’s designed to maximize throughput and memory efficiency for serving LLMs.Documentation Index
Fetch the complete documentation index at: https://docs.mem0.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prerequisites
-
Install vLLM:
-
Start vLLM server:
Usage
Configuration Parameters
| Parameter | Description | Default | Environment Variable |
|---|---|---|---|
model | Model name running on vLLM server | "Qwen/Qwen2.5-32B-Instruct" | - |
vllm_base_url | vLLM server URL | "http://localhost:8000/v1" | VLLM_BASE_URL |
api_key | API key (dummy for local) | "vllm-api-key" | VLLM_API_KEY |
temperature | Sampling temperature | 0.1 | - |
max_tokens | Maximum tokens to generate | 2000 | - |
Environment Variables
You can set these environment variables instead of specifying them in config:Benefits
- High Performance: 2-24x faster inference than standard implementations
- Memory Efficient: Optimized memory usage with PagedAttention
- Local Deployment: Keep your data private and reduce API costs
- Easy Integration: Drop-in replacement for other LLM providers
- Flexible: Works with any model supported by vLLM
Troubleshooting
-
Server not responding: Make sure vLLM server is running
-
404 errors: Ensure correct base URL format
- Model not found: Check model name matches server
-
Out of memory: Try smaller models or reduce
max_model_len
Config
All available parameters for thevllm config are present in Master List of All Params in Config.