Thursday, January 29, 2026

Using custom embedding models with llamaindex

 To use a custom model with LlamaIndex, you use the Settings object. This acts as a global configuration hub that tells LlamaIndex which "brain" (LLM) and "dictionary" (Embedding Model) to use for all operations.

Since you are working with CLI commands, I recommend using a local embedding model (no API cost and high privacy) and a custom LLM (like a specific Llama 3 variant).

1. Setup for Local Embedding & LLM

First, install the necessary integrations:

pip install llama-index-embeddings-huggingface llama-index-llms-ollama


2. Configuration Code

Here is how you replace the default OpenAI models with custom local ones.

from llama_index.core import Settings

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.llms.ollama import Ollama


# 1. Set a Custom Embedding Model (Local from HuggingFace)

# BGE-Small is excellent for technical/CLI text retrieval

Settings.embed_model = HuggingFaceEmbedding(

    model_name="BAAI/bge-small-en-v1.5"

)


# 2. Set a Custom LLM (Local via Ollama)

# This assumes you have Ollama running locally with 'llama3' pulled

Settings.llm = Ollama(model="llama3", request_timeout=60.0)


# 3. Standard LlamaIndex flow now uses these settings automatically

# index = VectorStoreIndex.from_documents(documents)

# query_engine = index.as_query_engine()


Which Custom Model Should You Choose for CLI?

Depending on your hardware and specific needs, here are the best "custom" matches for your JSON array:

| Type | Recommended Model | Why? |

|---|---|---|

| Embedding | BAAI/bge-base-en-v1.5 | Consistently tops benchmarks for retrieval accuracy. |

| Embedding | nomic-embed-text | Great for long examples and varied CLI syntax. |

| LLM | codellama or llama3 | These models understand code structure and terminal syntax better than generic models. |

Visual Flow of Custom Settings

Key Tips for Custom Models:

 * Device selection: If you have a GPU, the HuggingFace embedding will automatically try to use it. You can force it by adding device="cuda" or device="mps" (for Mac) inside HuggingFaceEmbedding.

 * Batch Size: For large JSON arrays, you can speed up the "custom" embedding process by setting embed_batch_size=32 in the embedding settings.

 * Persistence: Once you embed your JSON array using a custom model, you must use that same model every time you load the index later. If you change models, the "numbers" won't match, and retrieval will fail.

Would you like me to show you how to save (persist) the index to your disk so you don't have to re-embed your CLI JSON every time you run the script?


No comments:

Post a Comment