vLLM

vLLM is a high-performance inference engine for large language models that provides significant performance improvements for local inference. It’s designed to maximize throughput and memory efficiency for serving LLMs.

Prerequisites

Install vLLM:
```
pip install vllm
```

Start vLLM server:

# For testing with a small model
vllm serve microsoft/DialoGPT-medium --port 8000

# For production with a larger model (requires GPU)
vllm serve Qwen/Qwen2.5-32B-Instruct --port 8000

Usage

import os
from mem0 import Memory

os.environ["OPENAI_API_KEY"] = "your-api-key"  # used for embedding model

config = {
    "llm": {
        "provider": "vllm",
        "config": {
            "model": "Qwen/Qwen2.5-32B-Instruct",
            "vllm_base_url": "http://localhost:8000/v1",
            "temperature": 0.1,
            "max_tokens": 2000,
        }
    }
}

m = Memory.from_config(config)
messages = [
    {"role": "user", "content": "I'm planning to watch a movie tonight. Any recommendations?"},
    {"role": "assistant", "content": "How about thriller movies? They can be quite engaging."},
    {"role": "user", "content": "I'm not a big fan of thrillers, but I love sci-fi movies."},
    {"role": "assistant", "content": "Got it! I'll avoid thrillers and suggest sci-fi movies instead."}
]
m.add(messages, user_id="alice", metadata={"category": "movies"})

Configuration Parameters

Parameter	Description	Default	Environment Variable
`model`	Model name running on vLLM server	`"Qwen/Qwen2.5-32B-Instruct"`	-
`vllm_base_url`	vLLM server URL	`"http://localhost:8000/v1"`	`VLLM_BASE_URL`
`api_key`	API key (dummy for local)	`"vllm-api-key"`	`VLLM_API_KEY`
`temperature`	Sampling temperature	`0.1`	-
`max_tokens`	Maximum tokens to generate	`2000`	-

Environment Variables

You can set these environment variables instead of specifying them in config:

export VLLM_BASE_URL="http://localhost:8000/v1"
export VLLM_API_KEY="your-vllm-api-key"
export OPENAI_API_KEY="your-openai-api-key"  # for embeddings

Benefits

High Performance: 2-24x faster inference than standard implementations
Memory Efficient: Optimized memory usage with PagedAttention
Local Deployment: Keep your data private and reduce API costs
Easy Integration: Drop-in replacement for other LLM providers
Flexible: Works with any model supported by vLLM

Troubleshooting

Server not responding: Make sure vLLM server is running
```
curl http://localhost:8000/health
```

404 errors: Ensure correct base URL format

"vllm_base_url": "http://localhost:8000/v1"  # Note the /v1

Model not found: Check model name matches server
Out of memory: Try smaller models or reduce max_model_len
```
vllm serve Qwen/Qwen2.5-32B-Instruct --max-model-len 4096
```

Config

All available parameters for the vllm config are present in Master List of All Params in Config.

Getting Started

Core Concepts

Platform

Open Source

Contribution

Prerequisites

Usage

Configuration Parameters

Environment Variables

Benefits

Troubleshooting

Config

Getting Started

Core Concepts

Platform

Open Source

Contribution

​Prerequisites

​Usage

​Configuration Parameters

​Environment Variables

​Benefits

​Troubleshooting

​Config

Prerequisites

Usage

Configuration Parameters

Environment Variables

Benefits

Troubleshooting

Config