Skip to main content
Sentence Transformer reranker provides local reranking using HuggingFace cross-encoder models, perfect for privacy-focused deployments where you want to keep data on-premises.

Models

Any HuggingFace cross-encoder model can be used. Popular choices include:
  • cross-encoder/ms-marco-MiniLM-L-6-v2: Default, good balance of speed and accuracy
  • cross-encoder/ms-marco-TinyBERT-L-2-v2: Fastest, smaller model size
  • cross-encoder/ms-marco-electra-base: Higher accuracy, larger model
  • cross-encoder/stsb-distilroberta-base: Good for semantic similarity tasks

Installation

pip install sentence-transformers

Configuration

Python
from mem0 import Memory

config = {
    "vector_store": {
        "provider": "chroma",
        "config": {
            "collection_name": "my_memories",
            "path": "./chroma_db"
        }
    },
    "llm": {
        "provider": "openai",
        "config": {
            "model": "gpt-4o-mini"
        }
    },
    "rerank": {
        "provider": "sentence_transformer",
        "config": {
            "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cpu",  # or "cuda" for GPU
            "batch_size": 32,
            "show_progress_bar": False,
            "top_k": 5
        }
    }
}

memory = Memory.from_config(config)

GPU Acceleration

For better performance, use GPU acceleration:
Python
config = {
    "rerank": {
        "provider": "sentence_transformer",
        "config": {
            "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cuda",  # Use GPU
            "batch_size": 64   # high batch size for high memory GPUs
        }
    }
}

Usage Example

Python
from mem0 import Memory

# Initialize memory with local reranker
config = {
    "vector_store": {"provider": "chroma"},
    "llm": {"provider": "openai", "config": {"model": "gpt-4o-mini"}},
    "rerank": {
        "provider": "sentence_transformer",
        "config": {
            "model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
            "device": "cpu"
        }
    }
}

memory = Memory.from_config(config)

# Add memories
messages = [
    {"role": "user", "content": "I love reading science fiction novels"},
    {"role": "user", "content": "My favorite author is Isaac Asimov"},
    {"role": "user", "content": "I also enjoy watching sci-fi movies"}
]

memory.add(messages, user_id="charlie")

# Search with local reranking
results = memory.search("What books does the user like?", user_id="charlie")

for result in results['results']:
    print(f"Memory: {result['memory']}")
    print(f"Vector Score: {result['score']:.3f}")
    print(f"Rerank Score: {result['rerank_score']:.3f}")
    print()

Custom Models

You can use any HuggingFace cross-encoder model:
Python
# Using a different model
config = {
    "rerank": {
        "provider": "sentence_transformer", 
        "config": {
            "model": "cross-encoder/stsb-distilroberta-base",
            "device": "cpu"
        }
    }
}

Configuration Parameters

ParameterDescriptionTypeDefault
modelHuggingFace cross-encoder model namestr"cross-encoder/ms-marco-MiniLM-L-6-v2"
deviceDevice to run model on (cpu, cuda, etc.)strNone
batch_sizeBatch size for processing documentsint32
show_progress_barShow progress bar during processingboolFalse
top_kMaximum documents to returnintNone

Advantages

  • Privacy: Complete local processing, no external API calls
  • Cost: No per-token charges after initial model download
  • Customization: Use any HuggingFace cross-encoder model
  • Offline: Works without internet connection after model download

Performance Considerations

  • First Run: Model download may take time initially
  • Memory Usage: Models require GPU/CPU memory
  • Batch Size: Optimize batch size based on available memory
  • Device: GPU acceleration significantly improves speed

Best Practices

  1. Model Selection: Choose model based on accuracy vs speed requirements
  2. Device Management: Use GPU when available for better performance
  3. Batch Processing: Process multiple documents together for efficiency
  4. Memory Monitoring: Monitor system memory usage with larger models