Friday, January 30, 2026

Create own rag evaluation strategy

 Building a RAG (Retrieval-Augmented Generation) evaluation from scratch is actually a great way to deeply understand where your pipeline is failing. While frameworks like Ragas or Arize Phoenix are popular, they are essentially just wrappers for specific prompts and math.

To do this manually, you need to evaluate the two distinct pillars of RAG: Retrieval (finding the right info) and Generation (using that info correctly).

1. The Evaluation Dataset

You can’t evaluate without a "Golden Dataset." Create a spreadsheet with 20–50 rows containing:

 * Question: What the user asks.

 * Context/Source: The specific document snippet that contains the answer.

 * Ground Truth: The ideal, "perfect" answer.

2. Evaluating Retrieval (The "Search" Part)

This measures if your vector database is actually finding the right documents. You don't need an LLM for this; you just need basic math.

 * Hit Rate (Precision at K): Did the correct document appear in the top k results?

   * Calculation: (Number of queries where the right doc was found) / (Total queries).

 * Mean Reciprocal Rank (MRR): Measures where the right document appeared. It rewards the system more for having the correct answer at rank 1 than rank 5.

   * Formula: MRR = \frac{1}{Q} \sum_{i=1}^{Q} \frac{1}{rank_i}

3. Evaluating Generation (The "LLM-as-a-Judge" Method)

Since manual grading is slow, you can use a "Judge LLM" (like GPT-4o or Claude 3.5) to grade your RAG output. You feed the judge a specific prompt for each of these three metrics:

A. Faithfulness (Groundedness)

Does the answer stay true to the retrieved context, or did the LLM hallucinate?

 * The Prompt: "Given the following context and the generated answer, list every claim in the answer. For each claim, state if it is supported by the context. Score 1.0 if all claims are supported, 0.0 otherwise."

B. Answer Relevance

Does the answer actually address the user's question?

 * The Prompt: "On a scale of 1-5, how relevant is this response to the original user question? Ignore whether the facts are true for now; focus only on whether it addresses the user's intent."

C. Context Precision

Did the retrieval step provide "clean" information, or was it full of noise?

 * The Prompt: "Check the retrieved context. Is this information actually necessary to answer the user's question? Rate 1 for useful, 0 for irrelevant."

4. Simple Python Implementation Structure

You don't need a library; a simple loop will do:

results = []

for item in golden_dataset:

    # 1. Run your RAG pipeline

    retrieved_docs = retriever.get_relevant_documents(item['question'])

    response = rag_chain.invoke(item['question'])

    

    # 2. Manual/LLM Scoring

    score = call_judge_llm(

        system_prompt="You are a grader...",

        user_content=f"Question: {item['question']}\nContext: {retrieved_docs}\nAnswer: {response}"

    )

    

    results.append({"question": item['question'], "score": score})


# 3. Calculate the mean score

final_grade = sum(r['score'] for r in results) / len(results)


Summary Table: What to Track

| Metric | What it tests | Success Criteria |

|---|---|---|

| Context Recall | Retrieval | Is the ground truth present in the chunks? |

| Faithfulness | Generation | Did the LLM make things up? |

| Answer Similarity | Generation | How close is the answer to the Ground Truth? (Use Semantic Similarity) |

Would you like me to write a specific "Judge Prompt" you can use to grade your RAG's faithfulness?


Arize phonix embedding visualization and observability

 Arize Phoenix is different from Ragas or DeepEval because it is an observability tool. Instead of just giving you a score, it launches a local web dashboard that lets you visually inspect your CLI embeddings and trace exactly how your RAG pipeline is performing in real-time.

For your CLI project, Phoenix is incredibly helpful for seeing "clusters" of commands and finding out why a specific query retrieved the wrong CLI command.

1. Prerequisites

pip install arize-phoenix llama-index-callbacks-arize-phoenix


2. Implementation Code

This script connects LlamaIndex to Phoenix. Once you run this, a browser window will open showing your RAG "traces."

import phoenix as px

import llama_index.core

from llama_index.core import VectorStoreIndex, Document, Settings

from llama_index.core.callbacks import CallbackManager

from llama_index.callbacks.arize_phoenix import ArizePhoenixCallbackHandler


# 1. Start the Phoenix Search & Trace server (launches a local web UI)

session = px.launch_app()


# 2. Setup LlamaIndex to send data to Phoenix

remote_callback_handler = ArizePhoenixCallbackHandler()

callback_manager = CallbackManager([remote_callback_handler])

Settings.callback_manager = callback_manager


# 3. Your CLI JSON Data

cli_data = [

    {"command": "git checkout -b", "description": "Create and switch to a new branch", "examples": ["git checkout -b feature-login"]},

    {"command": "git branch -d", "description": "Delete a local branch", "examples": ["git branch -d old-feature"]}

]


# 4. Standard LlamaIndex Ingestion

documents = [Document(text=f"{item['command']}: {item['description']}") for item in cli_data]

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()


# 5. Run a query

# After running this, check the Phoenix Dashboard!

response = query_engine.query("How do I make a new branch?")


print(f"Answer: {response}")

print(f"Phoenix Dashboard URL: {session.url}")


# Keep the script running so you can explore the UI

import time

time.sleep(1000)


What to look for in the Phoenix UI:

 * Traces: You will see a "timeline" of your query. You can click on it to see exactly what text was sent to the embedding model and what chunks were pulled from your JSON.

 * The Embedding Map: Phoenix can visualize your CLI commands as dots in a 3D space.

   * Example: You might see a cluster of "Docker" commands and a cluster of "Git" commands.

   * Insight: If "how do I delete a branch" pulls up a "Docker delete" command, you will see the query dot land in the wrong cluster, telling you that your embeddings need more technical context.

 * LLM Evaluation: Phoenix can run "Evals" in the background. It will flag queries that it thinks were "Unfaithful" or had "Poor Retrieval" based on its internal heuristics.

Comparison: When to use which?

| Use Case | Recommended Tool |

|---|---|

| "I want to know if my RAG is accurate." | Ragas |

| "I want to prevent breaking changes in my code." | DeepEval |

| "I want to see WHY my RAG is failing visually." | Arize Phoenix |

Would you like to know how to use Phoenix to find "Useless Commands" in your JSON (commands that never get retrieved or overlap too much with others)?


Using deepeval

 DeepEval is often called the "Pytest for LLMs" because it allows you to write evaluation scripts that feel exactly like standard software unit tests.

For your CLI JSON project, DeepEval is particularly useful because it provides Reasoning. If a command fails the test, it will tell you exactly why (e.g., "The model suggested the --force flag, but the JSON context only mentions --recursive").

1. Prerequisites

pip install deepeval


2. The DeepEval Test File (test_cli_rag.py)

This script uses the RAG Triad (Faithfulness, Answer Relevancy, and Contextual Precision) to test your CLI commands.

import pytest

from deepeval import assert_test

from deepeval.test_case import LLMTestCase

from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric, ContextualPrecisionMetric


# 1. Setup the metrics with passing thresholds

# Threshold 0.7 means the score must be > 0.7 to "Pass" the unit test

faithfulness_metric = FaithfulnessMetric(threshold=0.7)

relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

precision_metric = ContextualPrecisionMetric(threshold=0.7)


def test_docker_ps_command():

    # --- SIMULATED RAG OUTPUT ---

    # In a real test, you would call your query_engine.query() here

    input_query = "How do I see all my containers, even stopped ones?"

    actual_output = "Use the command 'docker ps -a' to list all containers including stopped ones."

    retrieval_context = [

        "Command: docker ps. Description: List running containers. Examples: docker ps -a"

    ]

    

    # 2. Create the Test Case

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    # 3. Assert the test with multiple metrics

    assert_test(test_case, [faithfulness_metric, relevancy_metric, precision_metric])


def test_non_existent_command():

    input_query = "How do I hack into NASA?"

    actual_output = "I'm sorry, I don't have information on that."

    retrieval_context = [] # Nothing found in your CLI JSON

    

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    assert_test(test_case, [relevancy_metric])


3. Running the Test

You run this from your terminal just like a normal python test:

deepeval test run test_cli_rag.py


4. Why DeepEval is better than Ragas for CLI:

 * The Dashboard: If you run deepeval login, all your results are uploaded to a web dashboard where you can see how your CLI tool's accuracy changes over time as you add more commands to your JSON.

 * Strict Flags: You can create a custom GEval metric in DeepEval specifically to check for "Flag Accuracy"—ensuring the LLM never hallucinates a CLI flag that isn't in your documentation.

 * CI/CD Integration: You can block a GitHub Pull Request from merging if the "CLI Accuracy" score drops below 80%.

Comparison: Ragas vs. DeepEval

| Feature | Ragas | DeepEval |

|---|---|---|

| Primary Use | Research / Bulk Data Eval | Engineering / Unit Testing |

| Output | Raw Scores (0.0 - 1.0) | Pass/Fail + Detailed Reasoning |

| Integration | Pandas / Notebooks | Pytest / GitHub Actions |

| UI | None (requires 3rd party) | Built-in Cloud Dashboard |

Would you like me to show you how to create a "Custom Flag Metric" to ensure the LLM never invents fake CLI arguments?


Thursday, January 29, 2026

Using custom embedding models with llamaindex

 To use a custom model with LlamaIndex, you use the Settings object. This acts as a global configuration hub that tells LlamaIndex which "brain" (LLM) and "dictionary" (Embedding Model) to use for all operations.

Since you are working with CLI commands, I recommend using a local embedding model (no API cost and high privacy) and a custom LLM (like a specific Llama 3 variant).

1. Setup for Local Embedding & LLM

First, install the necessary integrations:

pip install llama-index-embeddings-huggingface llama-index-llms-ollama


2. Configuration Code

Here is how you replace the default OpenAI models with custom local ones.

from llama_index.core import Settings

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.llms.ollama import Ollama


# 1. Set a Custom Embedding Model (Local from HuggingFace)

# BGE-Small is excellent for technical/CLI text retrieval

Settings.embed_model = HuggingFaceEmbedding(

    model_name="BAAI/bge-small-en-v1.5"

)


# 2. Set a Custom LLM (Local via Ollama)

# This assumes you have Ollama running locally with 'llama3' pulled

Settings.llm = Ollama(model="llama3", request_timeout=60.0)


# 3. Standard LlamaIndex flow now uses these settings automatically

# index = VectorStoreIndex.from_documents(documents)

# query_engine = index.as_query_engine()


Which Custom Model Should You Choose for CLI?

Depending on your hardware and specific needs, here are the best "custom" matches for your JSON array:

| Type | Recommended Model | Why? |

|---|---|---|

| Embedding | BAAI/bge-base-en-v1.5 | Consistently tops benchmarks for retrieval accuracy. |

| Embedding | nomic-embed-text | Great for long examples and varied CLI syntax. |

| LLM | codellama or llama3 | These models understand code structure and terminal syntax better than generic models. |

Visual Flow of Custom Settings

Key Tips for Custom Models:

 * Device selection: If you have a GPU, the HuggingFace embedding will automatically try to use it. You can force it by adding device="cuda" or device="mps" (for Mac) inside HuggingFaceEmbedding.

 * Batch Size: For large JSON arrays, you can speed up the "custom" embedding process by setting embed_batch_size=32 in the embedding settings.

 * Persistence: Once you embed your JSON array using a custom model, you must use that same model every time you load the index later. If you change models, the "numbers" won't match, and retrieval will fail.

Would you like me to show you how to save (persist) the index to your disk so you don't have to re-embed your CLI JSON every time you run the script?


Rag for Technical documentation

 Implementing a RAG (Retrieval-Augmented Generation) system for technical content like CLI commands requires a different approach than standard document RAG. Because CLI commands use specific syntax (e.g., --flags, grep, awk) and jargon, simple semantic search often fails to distinguish between similar-looking commands.

The best approach for your JSON dataset is a Hybrid Retrieval strategy combined with Context-Aware Metadata.

1. The Strategy: Hybrid Retrieval

Technical documentation usually requires two types of matching:

 * Semantic Search (Dense): Understands intent. If a user asks "How do I list hidden files?", the retriever finds ls -a even if the word "hidden" isn't in the description.

 * Keyword Match (Sparse/BM25): Handles exact technical terms. If a user types the specific flag --recursive, you want the retriever to prioritize commands that actually contain that exact string.

Recommended Pipeline:

 * Preprocessing: For each JSON object, create a "searchable string" that combines the command name, description, and examples.

 * Indexing: Use a vector database (like Qdrant, Pinecone, or Weaviate) that supports Hybrid Search.

 * Retrieval: Use Reciprocal Rank Fusion (RRF) to combine the results from the vector search and the keyword search.

2. Preparing Your JSON Data

Don't just embed the description. You need to structure the text to help the embedding model "see" the command.

Original JSON:

{

  "command": "docker ps",

  "description": "List running containers",

  "examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]

}


Optimized Chunk for Embedding:

> Command: docker ps

> Description: List running containers.

> Examples: docker ps -a, docker ps --format '{{.ID}}'

> Keywords: docker, ps, list, containers, running, status

3. Implementation Steps

A. Embedding Model Choice

Since you are dealing with code-like structures, use a model trained on technical text or code.

 * Open Source: BAAI/bge-small-en-v1.5 (excellent for technical retrieval).

 * Proprietary: text-embedding-3-small (OpenAI) or text-embedding-004 (Gemini).

B. The Retrieval Logic

Use a "Multi-Vector" approach if your examples are very different from your descriptions:

 * Summary Vector: Embed just the command and description.

 * Example Vector: Embed each example separately but link them to the same parent command ID.

C. Handling "Short Query" Problems

CLI users often type short, ambiguous queries like "delete file." To solve this, use Query Expansion (Hypothetical Document Embeddings - HyDE):

 * Ask an LLM: "What is the technical description of the command to delete a file?"

 * Use that expanded description to search your JSON array instead of the user's two-word query.

4. Why this works for CLI

| Feature | Why it matters |

|---|---|

| Hybrid Search | Ensures git log is found for both "history" (semantic) and "log" (keyword). |

| Metadata Filtering | Allows you to filter by categories (e.g., only search git commands if the user is in a git context). |

| Example Indexing | Often, a user remembers a flag but not the command; indexing examples helps catch these. |

Would you like me to provide a Python code snippet using a specific library (like LangChain or LlamaIndex) to implement this hybrid search?


Query fusion retrieval

 To implement this, I recommend using LlamaIndex with its built-in QueryFusionRetriever. This is a powerful, production-ready way to perform Hybrid Search (BM25 + Vector) and then use Reciprocal Rank Fusion (RRF) to get the most accurate result.

1. Prerequisites

You will need to install the following libraries:

pip install llama-index llama-index-retrievers-bm25 llama-index-embeddings-openai llama-index-vector-stores-qdrant qdrant-client


2. Implementation Code

This script loads your JSON, prepares the documents, and sets up the hybrid retrieval pipeline.

import json

from llama_index.core import Document, VectorStoreIndex, StorageContext

from llama_index.core.retrievers import QueryFusionRetriever

from llama_index.retrievers.bm25 import BM25Retriever

from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client


# 1. Load your JSON Data

cli_data = [

    {

        "command": "docker ps",

        "description": "List running containers",

        "examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]

    },

    # ... more commands

]


# 2. Transform JSON to Documents for Indexing

documents = []

for item in cli_data:

    # We combine command, description, and examples into one text block

    # This ensures the model can "see" all parts during search

    content = f"Command: {item['command']}\nDescription: {item['description']}\nExamples: {', '.join(item['examples'])}"

    

    doc = Document(

        text=content,

        metadata={"command": item['command']} # Keep original command in metadata

    )

    documents.append(doc)


# 3. Setup Vector Storage (Dense Search)

client = qdrant_client.QdrantClient(location=":memory:") # Use local memory for this example

vector_store = QdrantVectorStore(client=client, collection_name="cli_docs")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)


# 4. Initialize Retrievers

# Semantic (Vector) Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)


# Keyword (BM25) Retriever

bm25_retriever = BM25Retriever.from_defaults(

    docstore=index.docstore, 

    similarity_top_k=5

)


# 5. The Hybrid "Fusion" Retriever

# This combines both and reranks using Reciprocal Rank Fusion (RRF)

hybrid_retriever = QueryFusionRetriever(

    [vector_retriever, bm25_retriever],

    similarity_top_k=2,

    num_queries=1,  # Set > 1 if you want the LLM to rewrite the query into multiple variations

    mode="reciprocal_rerank",

    use_async=True

)


# 6. Usage

query = "How do I see all my containers?"

nodes = hybrid_retriever.retrieve(query)


for node in nodes:

    print(f"Score: {node.score:.4f}")

    print(f"Content:\n{node.text}\n")


Why this works for CLI datasets:

 * Contextual Awareness: By putting the command, description, and examples into the document text, the vector search understands that "list containers" relates to docker ps.

 * Precise Flag Matching: If a user searches for a specific flag like -a or --format, the BM25 retriever will catch that exact string, which a standard vector search might ignore as "noise."

 * RRF Ranking: Reciprocal Rank Fusion is great because it doesn't require you to manually "weight" (e.g., 70% vector, 30% keyword). It mathematically finds the items that appear at the top of both lists.

Would you like me to show you how to add an LLM step to this so it generates a natural language explanation of the command it found?


Evaluating using Raagas

 Evaluating a RAG (Retrieval-Augmented Generation) pipeline with Ragas (Retrieval Augmented Generation Assessment) is a smart move. It moves you away from "vibes-based" testing and into actual metrics like Faithfulness, Answer Relevance, and Context Precision.

To get this running, you'll need a "Evaluation Dataset" consisting of Questions, Contexts, Answers, and (optionally) Ground Truths.

Prerequisites

First, install the necessary libraries:

pip install ragas langchain openai


Python Implementation

Here is a concise script to evaluate a set of RAG results using Ragas and OpenAI as the "LLM judge."

import os

from datasets import Dataset

from ragas import evaluate

from ragas.metrics import (

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall,

)


# 1. Setup your API Key

os.environ["OPENAI_API_KEY"] = "your-api-key-here"


# 2. Prepare your data

# 'contexts' should be a list of lists (strings retrieved from your vector db)

data_samples = {

    'question': ['When was the first iPhone released?', 'Who founded SpaceX?'],

    'answer': ['The first iPhone was released on June 29, 2007.', 'Elon Musk founded SpaceX in 2002.'],

    'contexts': [

        ['Apple Inc. released the first iPhone in mid-2007.', 'Steve Jobs announced it in January.'],

        ['SpaceX was founded by Elon Musk to reduce space transportation costs.']

    ],

    'ground_truth': ['June 29, 2007', 'Elon Musk']

}


dataset = Dataset.from_dict(data_samples)


# 3. Define the metrics you want to track

metrics = [

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall

]


# 4. Run the evaluation

score = evaluate(dataset, metrics=metrics)


# 5. Review the results

df = score.to_pandas()

print(df)


Key Metrics Explained

Understanding what these numbers mean is half the battle:

| Metric | What it measures |

|---|---|

| Faithfulness | Is the answer derived only from the retrieved context? (Prevents hallucinations). |

| Answer Relevancy | Does the answer actually address the user's question? |

| Context Precision | Are the truly relevant chunks ranked higher in your retrieval results? |

| Context Recall | Does the retrieved context actually contain the information needed to answer? |

Pro-Tip: Evaluation without Ground Truth

If you don't have human-annotated ground_truth data yet, you can still run Faithfulness and Answer Relevancy. Ragas is particularly powerful because it uses an LLM to "reason" through whether the retrieved context supports the generated answer.

Would you like me to show you how to integrate this directly with a LangChain or LlamaIndex retriever so you don't have to manually build the dataset?


What is DSSE-KMS in AWS?

 DSSE-KMS (Dual-Layer Server-Side Encryption with AWS Key Management Service) is an Amazon S3 encryption option that applies two layers of encryption to objects at rest, providing enhanced security. It helps meet strict compliance requirements (like CNSSP 15) by using AWS KMS keys to encrypt data twice, offering superior protection for highly sensitive workloads. 

Key Features and Benefits

Dual-Layer Protection: Uses two distinct cryptographic libraries and data keys to encrypt objects, providing a higher level of assurance than single-layer encryption.

KMS Key Management: Uses AWS KMS to manage the master keys, allowing users to define permissions and audit usage.

Compliance Ready: Designed to meet rigorous standards, including the National Security Agency (NSA) CNSSP 15 for two layers of Commercial National Security Algorithm (CNSA) encryption.

Easy Implementation: Can be configured as the default encryption for an S3 bucket or specified in PUT/COPY requests.

Enforceable Security: IAM and bucket policies can be used to enforce this encryption type, ensuring all uploaded data is encrypted. 

DSSE-KMS is particularly aimed at US Department of Defense (DoD) customers and other industries requiring top-secret data handling


Tuesday, January 27, 2026

What Amazon Bedrock Flows Does?

Amazon Bedrock Flows is a visual workflow authoring and execution feature within Amazon Bedrock that lets developers and teams build, test, and deploy generative AI workflows without writing traditional code. It provides an intuitive drag-and-drop interface (and APIs/SDKs) for orchestrating sequences of AI tasks — combining prompts, foundation models (FMs), agents, knowledge bases, logic, and other AWS services into a reusable and versioned workflow (called a flow).

🔹 What Amazon Bedrock Flows Does

1. Visual Workflow Builder

Bedrock Flows gives you a graphical interface to drag, drop, and connect nodes representing steps in a GenAI workflow — such as model invocations, conditional logic, or integration points with services like AWS Lambda or Amazon Lex. You can also construct and modify flows using APIs or the AWS Cloud Development Kit (CDK).

2. Orchestration of Generative AI Workloads

Flows make it easy to link together:

foundation model prompts

AI agents

knowledge bases (for RAG)

business logic

external AWS services

into a cohesive workflow that responds to input data and produces a desired output.

3. Serverless Testing and Deployment

You can test flows directly in the AWS console with built-in input/output traceability, accelerate iteration, version your workflows for release management or A/B testing, and deploy them via API endpoints — all without managing infrastructure.

4. Enhanced Logic and Control

Flows consist of nodes and connections:

Nodes represent steps/operations (like invoking a model or evaluating a condition).

Connections (data or conditional) define how outputs feed into next steps.

This enables branching logic and complex, multi-stage execution paths.

5. Integration with AWS Ecosystem

Flows let you integrate generative AI with broader AWS tooling — such as Lambda functions for custom code, Amazon Lex for conversational interfaces, S3 for data input/output, and more — for complete, production-ready solutions.

🔹 Why It Matters

No-code/low-code AI orchestration: Non-developers or subject-matter experts can build sophisticated workflows.

Faster iteration: Test, version, and deploy generative AI applications more quickly.

Reusable AI logic: Flows can be versioned and reused across applications.

Supports complex AI use cases: Including multi-turn interactions and conditional behaviors.

🔹 Key Concepts

A flow is the workflow construct with a name, permissions, and connected nodes.

Nodes are the steps in the flow (inputs, actions, conditions).

Connections are the data or conditional links between nodes defining execution sequences.

In summary:

Amazon Bedrock Flows is a serverless AI workflow orchestration tool within AWS Bedrock that simplifies creating, testing, and deploying complex generative AI applications through a visual interface or APIs — enabling integration of foundation models, logic, and AWS services into scalable GenAI workflows.


Monday, January 26, 2026

What is VL Jepa VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

 VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) is a new vision-language model architecture that represents a major shift away from the typical generative, token-by-token approach used in most large multimodal models (like GPT-4V, LLaVA, InstructBLIP, etc.). Instead of learning to generate text tokens one after another, VL-JEPA trains a model to predict continuous semantic embeddings in a shared latent space that captures the meaning of text and visual content. (arXiv)

🧠 Core Idea

  • Joint Embedding Predictive Architecture (JEPA): The model adopts the JEPA philosophy: don’t predict low-level data (e.g., pixels or tokens) — predict meaningful latent representations. VL-JEPA applies this idea to vision-language tasks. (arXiv)

  • Predict Instead of Generate: Traditionally, vision-language models are trained to generate text outputs autoregressively (one token at a time). VL-JEPA instead predicts the continuous embedding vector of the target text given visual inputs and a query. This embedding represents the semantic meaning rather than the specific tokens. (arXiv)

  • Focus on Semantics: By operating in an abstract latent space, the model focuses on task-relevant semantics and reduces wasted effort modeling surface-level linguistic variability. (arXiv)

⚙️ How It Works

  1. Vision and Text Encoders:

    • A vision encoder extracts visual embeddings from images or video frames.

    • A text encoder maps query text and target text into continuous embeddings.

  2. Predictor:

    • The model’s core component predicts target text embeddings based on the visual context and input query, without generating actual text tokens. (arXiv)

  3. Selective Decoding:

    • When human-readable text is needed, a lightweight decoder can translate predicted embeddings into tokens. VL-JEPA supports selective decoding, meaning it only decodes what’s necessary — significantly reducing computation compared to standard autoregressive decoding. (alphaxiv.org)

🚀 Advantages

  • Efficiency: VL-JEPA uses roughly 50 % fewer trainable parameters than comparable token-generative vision-language models while maintaining or exceeding performance on many benchmarks. (arXiv)

  • Non-Generative Focus: The model is inherently non-generative during training, focusing on predicting semantic meaning, which leads to faster inference and reduced latency in applications like real-time video understanding. (DEV Community)

  • Supports Many Tasks: Without architectural changes, VL-JEPA naturally handles tasks such as open-vocabulary classification, text-to-video retrieval, and discriminative visual question answering (VQA). (arXiv)

📊 Performance

In controlled comparisons:

  • VL-JEPA outperforms or rivals established methods like CLIP, SigLIP2, and Perception Encoder on classification and retrieval benchmarks. (OpenReview)

  • On VQA datasets, it achieves performance comparable to classical VLMs (e.g., InstructBLIP, QwenVL) despite using fewer parameters. (OpenReview)


In summary, VL-JEPA moves beyond token generation toward semantic embedding prediction in vision-language models, offering greater efficiency and real-time capabilities without sacrificing general task performance. (arXiv)

references:

https://arxiv.org/abs/2512.10942

Sunday, January 25, 2026

Multi version concurrency control postgres

 In PostgreSQL, Multiversion Concurrency Control (MVCC) is the secret sauce that allows the database to handle multiple users at once without everyone stepping on each other's toes.

The core philosophy is simple: Readers never block writers, and writers never block readers.

How It Works: The "Snapshot" Concept

Instead of locking a row when someone wants to update it (which would force everyone else to wait), Postgres keeps multiple versions of that row. When you start a transaction, Postgres gives you a snapshot of the data as it existed at that exact moment.

1. Hidden Columns

Every row in a Postgres table has hidden system columns used for MVCC:

 * xmin: The ID of the transaction that inserted the row.

 * xmax: The ID of the transaction that deleted or updated the row (initially set to 0).

2. The Update Process

When you update a row, Postgres doesn't actually overwrite the old data. It performs a "soft delete" and an "insert":

 * It marks the old row version as expired by setting its xmax to the current Transaction ID.

 * It creates a new version of the row with the updated data and sets its xmin to the current Transaction ID.

3. Visibility Rules

When a different transaction tries to read that row, Postgres compares the transaction's ID with the xmin and xmax of the available versions:

 * If a row's xmin is from a committed transaction and its xmax is either 0 or belongs to an uncommitted transaction, the row is visible.

 * This ensures you always see a consistent state of the database, even if someone else is mid-update.

The Side Effect: Bloat and VACUUM

Because Postgres keeps those old versions of rows around (often called "dead tuples"), the database files will eventually grow—this is known as bloat.

To clean this up, Postgres uses a process called VACUUM:

 * It scans for rows where the xmax belongs to a transaction that is long finished.

 * It marks that space as available for new data.

 * Autovacuum is the built-in daemon that handles this automatically in the background so you don't have to.

Why This Matters

 * Performance: High concurrency. You can run massive reports (READs) while your application is constantly updating data (WRITEs) without them fighting for locks.

 * Consistency: You get "Snapshot Isolation," meaning your query results won't change halfway through just because another user committed a change.

> Note: While readers and writers don't block each other, writers still block writers if they attempt to update the exact same row at the same time.

Would you like me to show you how to query these hidden xmin and xmax columns on one of your existing tables to see this in action?


Thursday, January 22, 2026

What is Spec Driven Development

Instead of coding first and writing docs later, in spec-driven development, you start with a (you guessed it) spec. This is a contract for how your code should behave and becomes the source of truth your tools and AI agents use to generate, test, and validate code. The result is less guesswork, fewer surprises, and higher-quality code

Spec Kit makes your specification the center of your engineering process. Instead of writing a spec and setting it aside, the spec drives the implementation, checklists, and task breakdowns.  Your primary role is to steer; the coding agent does the bulk of the writing.

It works in four phases with clear checkpoints. But here’s the key insight: each phase has a specific job, and you don’t move to the next one until the current task is fully validated. 

Here’s how the process breaks down:

Specify: You provide a high-level description of what you’re building and why, and the coding agent generates a detailed specification. This isn’t about technical stacks or app design. It’s about user journeys, experiences, and what success looks like. Who will use this? What problem does it solve for them? How will they interact with it? What outcomes matter? Think of it as mapping the user experience you want to create, and letting the coding agent flesh out the details. Crucially, this becomes a living artifact that evolves as you learn more about your users and their needs.

Plan: Now you get technical. In this phase, you provide the coding agent with your desired stack, architecture, and constraints, and the coding agent generates a comprehensive technical plan. If your company standardizes on certain technologies, this is where you say so. If you’re integrating with legacy systems, have compliance requirements, or have performance targets you need to hit … all of that goes here. You can also ask for multiple plan variations to compare and contrast different approaches. If you make your internal docs available to the coding agent, it can integrate your architectural patterns and standards directly into the plan. After all, a coding agent needs to understand the rules of the game before it starts playing.

Tasks: The coding agent takes the spec and the plan and breaks them down into actual work. It generates small, reviewable chunks that each solve a specific piece of the puzzle. Each task should be something you can implement and test in isolation; this is crucial because it gives the coding agent a way to validate its work and stay on track, almost like a test-driven development process for your AI agent. Instead of “build authentication,” you get concrete tasks like “create a user registration endpoint that validates email format.”

Implement: Your coding agent tackles the tasks one by one (or in parallel, where applicable). But here’s what’s different: instead of reviewing thousand-line code dumps, you, the developer, review focused changes that solve specific problems. The coding agent knows what it’s supposed to build because the specification told it. It knows how to build it because the plan told it. And it knows exactly what to work on because the task told it.

Crucially, your role isn’t just to steer. It’s to verify. At each phase, you reflect and refine. Does the spec capture what you actually want to build? Does the plan account for real-world constraints? Are there omissions or edge cases the AI missed? The process builds in explicit checkpoints for you to critique what’s been generated, spot gaps, and course correct before moving forward. The AI generates the artifacts; you ensure they’re right.

Wednesday, January 21, 2026

Best RAG Startegies

Implementation Roadmap: Start Simple, Scale Smart

Don’t try to implement everything at once. Here’s a practical roadmap:


Phase 1: Foundation (Week 1)

Context-aware chunking (replace fixed-size splitting)

Basic vector search with proper embeddings

Measure baseline accuracy


Phase 2: Quick Wins (Week 2–3)

Add re-ranking (biggest accuracy boost for effort)

Implement query expansion (handles vague queries)

Measure improvement


Phase 3: Advanced (Week 4–6)

Add multi-query or agentic RAG (choose based on use case)

Implement self-reflection for critical queries

Fine-tune and optimize


Phase 4: Specialization (Month 2+)

Add contextual retrieval for high-value documents

Consider knowledge graphs if relationships matter

Fine-tune embeddings for domain-specific accuracy


references

https://pub.towardsai.net/i-spent-3-months-building-ra-systems-before-learning-these-11-strategies-1a8f6b4278aa

A simple example for embedding model

from sentence_transformers import SentenceTransformer, losses

from torch.utils.data import DataLoader


def prepare_training_data():

    """Domain-specific query-document pairs"""

    return [

        ("What is EBITDA?", "EBITDA (Earnings Before Interest, Taxes..."),

        ("Explain capital expenditure", "Capital expenditure (CapEx) refers to..."),

        # ... thousands more pairs

    ]

def fine_tune_model():

    """Fine-tune on domain data"""

    # Load base model

    model = SentenceTransformer('all-MiniLM-L6-v2')

  

    # Prepare training data

    train_examples = prepare_training_data()

    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

  

    # Define loss function

    train_loss = losses.MultipleNegativesRankingLoss(model)

  

    # Train

    model.fit(

        train_objectives=[(train_dataloader, train_loss)],

        epochs=3,

        warmup_steps=100

    )

  

    model.save('./fine_tuned_financial_model')

    return model

# Use fine-tuned model

embedding_model = SentenceTransformer('./fine_tuned_financial_model')


What is Late Chunking

What it does: Processes the entire document through the transformer before chunking the token embeddings (not the text).


The problem it solves: Traditional chunking loses long-distance context. Late chunking preserves full document context in each chunk’s embedding.


Conceptual example:


def late_chunk(text: str, chunk_size=512) -> list:

    """Embed full document BEFORE chunking"""

  

    # Step 1: Embed entire document (8192 tokens max)

    full_doc_token_embeddings = transformer_embed(text)  # Token-level

  

    # Step 2: Define chunk boundaries

    tokens = tokenize(text)

    chunk_boundaries = range(0, len(tokens), chunk_size)

  

    # Step 3: Pool token embeddings for each chunk

    chunks_with_embeddings = []

    for start in chunk_boundaries:

        end = start + chunk_size

        chunk_text = detokenize(tokens[start:end])

  

        # Mean pool token embeddings (preserves full doc context!)

        chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])

        chunks_with_embeddings.append((chunk_text, chunk_embedding))

  

    return chunks_with_embeddings




Late chunking in the context of embeddings for GenAI (Generative AI) is a strategy used when processing large documents or datasets for vector embeddings, particularly in RAG (Retrieval-Augmented Generation) workflows.


Here’s a clear breakdown:


Definition


Late chunking means delaying the splitting of content into smaller pieces (chunks) until after embedding generation has started or the content has been initially processed.


Instead of splitting a large document into chunks before generating embeddings (which is early chunking), the model or system first generates embeddings for larger units (like full documents or sections) and then splits or processes them further later in the pipeline if needed.


Why use Late Chunking?


Preserves context


Early chunking may break semantic context by splitting sentences or paragraphs arbitrarily.


Late chunking allows embeddings to capture larger context, improving similarity searches.


Efficient processing


You can generate embeddings for larger units first and only create smaller chunks if retrieval or indexing requires it, reducing unnecessary computations.


Dynamic retrieval granularity


Allows flexible adjustment of chunk size later depending on how the embeddings will be queried or used in the application.


Comparison to Early Chunking

Feature Early Chunking Late Chunking

When text is split Before embedding After embedding or during retrieval

Context retention Lower (may lose semantic links across chunks) Higher (larger context retained)

Processing efficiency May generate more embeddings unnecessarily Can reduce embedding count

Use case Simple search or small documents Large documents, long context GenAI applications


💡 Example Scenario:


A book with 1000 pages is to be used in a RAG application.


Early chunking: Split into 2-page chunks first → 500 embeddings.


Late chunking: Generate embeddings for each chapter first → 20 embeddings, then split chapters into smaller chunks later only if needed.


This approach balances context preservation and computational efficiency.

Tuesday, January 20, 2026

What is an example of Graphiti with Neo4J

from graphiti_core import Graphiti

from graphiti_core.nodes import EpisodeType


# Initialize Graphiti (connects to Neo4j)

graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")

async def ingest_document(text: str, source: str):

    """Ingest into knowledge graph"""

    # Graphiti automatically extracts entities and relationships

    await graphiti.add_episode(

        name=source,

        episode_body=text,

        source=EpisodeType.text,

        source_description=f"Document: {source}"

    )

async def search_knowledge_graph(query: str) -> str:

    """Hybrid search: semantic + keyword + graph"""

    # Graphiti combines:

    # - Semantic similarity (embeddings)

    # - BM25 keyword search

    # - Graph structure traversal

    # - Temporal context

  

    results = await graphiti.search(query=query, num_results=5)

  

    # Format graph results

    formatted = []

    for result in results:

        formatted.append(

            f"Entity: {result.node.name}\n"

            f"Type: {result.node.type}\n"

            f"Relationships: {result.relationships}"

        )

  

    return "\n---\n".join(formatted)





A Cross Encoder Example

 from sentence_transformers import CrossEncoder


# Initialize once
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
async def search_with_reranking(query: str, limit: int = 5) -> list:
# Stage 1: Fast vector retrieval (get 4x candidates)
candidate_limit = min(limit * 4, 20)
query_embedding = await embedder.embed_query(query)

candidates = await db.query(
"SELECT content, metadata FROM chunks ORDER BY embedding $1 LIMIT $2",
query_embedding, candidate_limit
)

# Stage 2: Re-rank with cross-encoder
pairs = [[query, row['content']] for row in candidates]
scores = reranker.predict(pairs)

# Sort by reranker scores and return top N
reranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)[:limit]

return [doc for doc, score in reranked]

Saturday, January 17, 2026

How AWS Config, AWS Inspector, AWS Audit Manager and AWS Artifact work togther ?

 Excellent question — these four AWS services (Audit Manager, Config, Inspector, and Artifact) all relate to security, compliance, and governance, but they serve very different purposes within that ecosystem.

Let’s break them down in a clear, structured way 👇


🧩 High-Level Summary

ServicePrimary PurposeType
AWS Audit ManagerContinuously collects evidence and automates audit reporting for compliance frameworksCompliance reporting tool
AWS ConfigTracks configuration changes and checks AWS resources against compliance rulesConfiguration monitoring tool
Amazon InspectorScans workloads for vulnerabilities and security issuesSecurity assessment tool
AWS ArtifactProvides on-demand access to AWS compliance reports and agreementsCompliance documentation portal

🧠 1. AWS Audit Manager

🔹 Purpose:

Helps you audit your AWS environment automatically to simplify compliance with frameworks like ISO 27001, GDPR, PCI-DSS, SOC 2, etc.

⚙️ How It Works:

  • Continuously collects evidence (data points) from AWS services (like Config, CloudTrail, IAM).

  • Maps them to control sets defined by compliance frameworks.

  • Generates audit-ready reports automatically.

📋 Key Features:

  • Prebuilt compliance frameworks and control mappings.

  • Automated evidence collection (no manual screenshots or data gathering).

  • Integration with AWS Organizations (multi-account audits).

  • Custom frameworks for internal governance.

🧭 Best For:

  • Compliance teams or auditors.

  • Organizations preparing for certifications or audits.

🧩 Example:

“Show me all evidence that IAM users require MFA.”
Audit Manager automatically gathers this proof over time.


⚙️ 2. AWS Config

🔹 Purpose:

Tracks and records configuration changes of AWS resources to ensure they remain compliant with desired settings or internal policies.

⚙️ How It Works:

  • Continuously records resource configurations (EC2, IAM, S3, VPC, etc.).

  • Allows you to define Config Rules (managed or custom using Lambda).

  • Detects non-compliant resources and triggers alerts or remediation.

📋 Key Features:

  • Real-time configuration tracking and history.

  • Compliance evaluation against internal or AWS standards.

  • Integration with CloudTrail and Security Hub.

🧭 Best For:

  • DevOps, security, and compliance teams wanting configuration drift detection.

  • Maintaining continuous resource compliance posture.

🧩 Example:

“Alert me if any S3 bucket becomes public.”
AWS Config continuously monitors and flags such violations.


🛡️ 3. Amazon Inspector

🔹 Purpose:

An automated vulnerability management service that scans workloads for security issues.

⚙️ How It Works:

  • Automatically discovers EC2 instances, container images (ECR), and Lambda functions.

  • Continuously scans for:

    • CVEs (Common Vulnerabilities and Exposures)

    • Misconfigurations

    • Software package vulnerabilities

  • Prioritizes findings by severity (CVSS score, exploitability).

📋 Key Features:

  • Continuous vulnerability scanning.

  • Agentless scanning for EC2 and container images.

  • Integration with AWS Security Hub, EventBridge, and Inspector dashboard.

  • Automatic remediation support.

🧭 Best For:

  • Security operations and compliance monitoring.

  • Continuous vulnerability assessment of compute resources.

🧩 Example:

“Detect and alert if any EC2 instance has a vulnerable OpenSSL version.”


📄 4. AWS Artifact

🔹 Purpose:

A self-service portal that provides AWS compliance reports, certifications, and agreements (e.g., SOC, ISO, PCI, GDPR).

⚙️ How It Works:

  • You access it from the AWS Console (no setup required).

  • Download third-party audit reports of AWS infrastructure.

  • Accept compliance agreements (e.g., Business Associate Addendum (BAA) for HIPAA).

📋 Key Features:

  • Central access to AWS’s own compliance evidence.

  • No cost; just authentication required.

  • Up-to-date compliance documentation and certifications.

🧭 Best For:

  • Compliance and legal teams.

  • Customers needing AWS compliance proof for audits.

🧩 Example:

“I need AWS’s SOC 2 Type II report to show my auditor.”
You download it directly from AWS Artifact.


⚖️ 5. Key Differences

FeatureAWS Audit ManagerAWS ConfigAmazon InspectorAWS Artifact
PurposeAutomate collection of audit evidenceMonitor resource configurationsDetect vulnerabilitiesProvide AWS compliance reports
Focus AreaCompliance automationConfiguration complianceSecurity posture & CVE detectionExternal compliance documentation
ScopeOrganization-level auditsResource-level stateInstance, container, Lambda-level scanningAWS infrastructure compliance
CustomizationCustom frameworksCustom Config rulesCustom scan targetsNone (read-only portal)
OutputAudit reports, control evidenceCompliance dashboardVulnerability findingsDownloadable reports (PDF)
Integration with OthersUses Config, CloudTrail, IAM dataFeeds data to Audit Manager, Security HubIntegrates with Security HubStandalone portal
User RoleAuditors & compliance officersDevOps/SecOpsSecurity engineersCompliance/legal staff

🧩 6. How They Work Together

Here’s how they complement each other in a real compliance workflow:

  1. AWS Config → Monitors your resource configurations.

  2. Amazon Inspector → Scans for vulnerabilities in EC2, ECR, Lambda.

  3. AWS Audit Manager → Collects evidence from Config, Inspector, IAM, etc., and maps it to compliance controls (e.g., SOC 2).

  4. AWS Artifact → Provides the official AWS compliance documentation to share with auditors.

📊 Example Flow:

Config detects non-compliant S3 → Inspector detects a vulnerability → Audit Manager collects both as audit evidence → Artifact provides AWS’s ISO report for your compliance pack.


🧠 Simple Analogy

ServiceAnalogy
Audit ManagerYour automated audit assistant (collects compliance evidence)
ConfigYour compliance monitor (tracks changes and deviations)
InspectorYour security scanner (finds vulnerabilities)
ArtifactYour compliance library (stores AWS certifications and reports)

Would you like me to add a diagram showing how these four services connect in a compliance architecture (e.g., flow from Config → Inspector → Audit Manager → Artifact)? It visually clarifies their interaction.

Wednesday, January 14, 2026

goroutine 5 - Using github.com/Jeffail/tunny (Third-party Library)

 package main


import (

"fmt"

"time"

"github.com/Jeffail/tunny"

)


func main() {

const numTasks = 50

// Create pool with 10 workers

pool := tunny.NewFunc(10, func(payload interface{}) interface{} {

taskID := payload.(int)

// Simulate work

time.Sleep(time.Millisecond * time.Duration(100+taskID*10))

return fmt.Sprintf("Task %d completed", taskID)

})

defer pool.Close()

// Submit tasks

results := make([]interface{}, numTasks)

for i := 1; i <= numTasks; i++ {

go func(taskID int) {

results[taskID-1] = pool.Process(taskID)

}(i)

}

// Wait for tasks (simple wait for demo)

time.Sleep(3 * time.Second)

// Count completed tasks

completed := 0

for _, result := range results {

if result != nil {

completed++

fmt.Println(result.(string))

}

}

fmt.Printf("\nSummary: %d out of %d tasks completed\n", completed, numTasks)

}

goroutine 4 - ErrGroup with Context and Worker Pool

 package main


import (

"context"

"fmt"

"golang.org/x/sync/errgroup"

"sync"

"time"

)


func workerPool(ctx context.Context, numWorkers, numTasks int) ([]string, error) {

taskChan := make(chan int, numTasks)

resultChan := make(chan string, numTasks)

// Create worker pool

g, ctx := errgroup.WithContext(ctx)

g.SetLimit(numWorkers)

// Start workers

for i := 0; i < numWorkers; i++ {

g.Go(func() error {

for {

select {

case <-ctx.Done():

return ctx.Err()

case taskID, ok := <-taskChan:

if !ok {

return nil

}

// Process task

time.Sleep(time.Millisecond * time.Duration(100+taskID*10))

resultChan <- fmt.Sprintf("Processed task %d", taskID)

}

}

})

}

// Feed tasks

go func() {

for i := 1; i <= numTasks; i++ {

taskChan <- i

}

close(taskChan)

}()

// Collect results

var results []string

var wg sync.WaitGroup

wg.Add(1)

go func() {

defer wg.Done()

for result := range resultChan {

results = append(results, result)

}

}()

// Wait for completion

if err := g.Wait(); err != nil {

return nil, err

}

close(resultChan)

wg.Wait()

return results, nil

}


func main() {

ctx := context.Background()

results, err := workerPool(ctx, 10, 50)

if err != nil {

fmt.Printf("Error: %v\n", err)

return

}

fmt.Printf("Summary: Completed %d tasks\n", len(results))

}

goroutine - 3 - Buffered Channel as Semaphore (Classic Go Pattern)

 package main


import (

"fmt"

"sync"

"time"

)


func processTask(taskID int, sem chan struct{}, results chan<- string) {

defer func() { <-sem }() // Release the semaphore

// Simulate work

time.Sleep(time.Millisecond * time.Duration(100+taskID*10))

results <- fmt.Sprintf("Task %d completed", taskID)

}


func main() {

const totalTasks = 50

const maxConcurrency = 10

sem := make(chan struct{}, maxConcurrency)

results := make(chan string, totalTasks)

var wg sync.WaitGroup

// Launch tasks

for i := 1; i <= totalTasks; i++ {

wg.Add(1)

go func(taskID int) {

defer wg.Done()

sem <- struct{}{} // Acquire semaphore

processTask(taskID, sem, results)

}(i)

}

// Wait for all tasks

go func() {

wg.Wait()

close(results)

}()

// Collect results

var completed []string

for result := range results {

completed = append(completed, result)

fmt.Println(result)

}

fmt.Printf("\nSummary: Completed %d tasks with max %d concurrent workers\n", 

len(completed), maxConcurrency)

}