Tuesday, February 10, 2026

AWS Internet Gateway and NAT Gateways for Public and Private Subnets in VPC

 Excellent — this is a very important AWS networking topic 👏

Let’s break it down step-by-step, covering:

  • how public and private subnets work,

  • how Internet Gateway (IGW) and NAT Gateway (NGW) fit into the picture,

  • and how traffic flows between them.


🏗️ 1. VPC (Virtual Private Cloud) Recap

A VPC is your own private, isolated network in AWS.
You define:

  • IP range (e.g. 10.0.0.0/16)

  • Subnets (smaller slices of that range)

  • Route tables (traffic rules)

  • Gateways (for internet or private connectivity)

Everything — EC2, RDS, Load Balancer, etc. — lives inside the VPC.


🌍 2. Public Subnet

A Public Subnet is a subnet that has:

  1. A route to the Internet Gateway (IGW) in its route table.

  2. Instances with public IPs or Elastic IPs.

Result:
Instances in this subnet can send and receive traffic directly from the Internet.

Example:

  • Web servers

  • Bastion hosts

  • NAT gateways

Route Table Example (Public Subnet):

DestinationTarget
10.0.0.0/16local
0.0.0.0/0igw-xxxxxx

🔒 3. Private Subnet

A Private Subnet has no direct route to the Internet Gateway.
It cannot be reached directly from outside the VPC.

Instead, if resources inside need to access the Internet (for updates, APIs, etc.), they go through a NAT Gateway in a Public Subnet.

Example:

  • Application servers

  • Databases

  • Internal microservices

Route Table Example (Private Subnet):

DestinationTarget
10.0.0.0/16local
0.0.0.0/0nat-xxxxxx

🌐 4. Internet Gateway (IGW)

The Internet Gateway is what connects your VPC to the public Internet.
It acts as a bridge that allows:

  • Outbound traffic from public instances to the Internet.

  • Inbound traffic (e.g. users accessing your public web servers).

Key facts:

  • One IGW per VPC (at most).

  • Must be attached to your VPC.

  • Only works with instances that have:

    • Public IP (or Elastic IP)

    • Subnet route to IGW

Command analogy:

IGW = door between your VPC and the Internet.


🛡️ 5. NAT Gateway (Network Address Translation Gateway)

The NAT Gateway allows private subnet instances to initiate outbound connections to the Internet —
but prevents inbound connections from the Internet.

Use Case:
You want your backend servers (in private subnets) to:

  • Download software updates

  • Call external APIs

  • Send telemetry data

—but not be reachable from outside.

How it works:

  • Deployed inside a Public Subnet

  • Has an Elastic IP

  • The private subnet route table sends Internet-bound traffic (0.0.0.0/0) to this NAT Gateway


🔁 6. How Traffic Flows

Let’s visualize two cases:


🌍 Public Subnet (with Internet Gateway)

User → Internet → IGW → Public Subnet → EC2 (Web Server)
  • Inbound traffic from Internet to EC2 works.

  • Outbound (e.g. software update) works too.


🔒 Private Subnet (with NAT Gateway)

EC2 (App Server in Private Subnet)
   │
   └──► Route (0.0.0.0/0) → NAT Gateway (in Public Subnet)
                                │
                                ▼
                            Internet Gateway → Internet
  • Outbound works (e.g., to fetch updates).

  • Inbound does not work (Internet cannot initiate a connection back).


📊 7. Architecture Diagram (Text Form)

                   ┌──────────────────────────┐
                   │        Internet          │
                   └───────────┬──────────────┘
                               │
                        ┌──────▼──────┐
                        │ Internet    │
                        │  Gateway    │
                        └──────┬──────┘
                               │
                 ┌─────────────┴──────────────┐
                 │         VPC (10.0.0.0/16)  │
                 │                            │
                 │   ┌────────────────────┐   │
                 │   │ Public Subnet      │   │
                 │   │ (10.0.1.0/24)      │   │
                 │   │ Route → IGW        │   │
                 │   │ [NAT Gateway]      │   │
                 │   │ [Web Server]       │   │
                 │   └────────────────────┘   │
                 │             │               │
                 │   ┌────────────────────┐    │
                 │   │ Private Subnet     │    │
                 │   │ (10.0.2.0/24)      │    │
                 │   │ Route → NAT Gateway│    │
                 │   │ [App / DB Server]  │    │
                 │   └────────────────────┘    │
                 │                            │
                 └────────────────────────────┘

⚡ 8. Summary Table

ComponentPurposeLocated InConnects ToAllows Inbound?Allows Outbound?
VPCIsolated network
Public SubnetHosts Internet-facing resourcesVPCIGW
Private SubnetInternal resourcesVPCNAT Gateway
Internet Gateway (IGW)Connects VPC ↔ InternetVPC levelPublic Subnets
NAT Gateway (NGW)Outbound Internet for private subnetsPublic SubnetIGW

Would you like me to generate a visual diagram (image) showing this complete flow (VPC, Public/Private Subnets, IGW, NAT Gateway, EC2 instances, and traffic arrows)? It’ll make this concept instantly clear.

What is AWS VPC ?

 Perfect — these are foundational AWS networking concepts, especially if you’re working with EC2, RDS, or Lambda in a secure setup.

Let’s break it down simply and visually 👇


🏗️ 1. VPC (Virtual Private Cloud)

A VPC is your own isolated network within AWS.
It’s like your own private data center inside AWS where you define:

  • IP address range (CIDR), e.g., 10.0.0.0/16

  • Subnets (smaller networks)

  • Route tables (rules for traffic flow)

  • Security groups and Network ACLs

You can think of a VPC as your virtual network boundary that connects all your AWS resources.


🌐 2. Route Table

A Route Table defines where network traffic should go.

Each subnet in a VPC is associated with one route table.

  • It contains routes that tell traffic what to do:

    • Local route: for communication inside the VPC
      (e.g., 10.0.0.0/16 → local)

    • Internet route: via Internet Gateway
      (e.g., 0.0.0.0/0 → igw-xxxxx)

    • Private route: via NAT Gateway or VPN


🕸️ 3. Subnets

A subnet is a segment of your VPC’s IP range — basically a “smaller network” inside your VPC.
You create multiple subnets to separate resources logically and securely.

There are two main types:

🌍 Public Subnet

  • Has a route to the Internet Gateway (IGW).

  • EC2 instances in this subnet can have public IPs and are reachable from the internet.

  • Used for:

    • Load balancers

    • Bastion hosts

    • NAT gateways

🔒 Private Subnet

  • No direct route to the Internet Gateway.

  • Internet access happens via a NAT Gateway (optional).

  • Used for:

    • Databases (RDS)

    • Backend servers

    • Application instances not directly exposed to the internet


📊 4. Typical Architecture Diagram

Here’s a clear text-based diagram:

                  ┌────────────────────────────┐
                  │        Internet             │
                  └────────────┬───────────────┘
                               │
                        ┌──────▼───────┐
                        │ Internet     │
                        │  Gateway     │
                        └──────┬───────┘
                               │
                 ┌─────────────┴────────────────┐
                 │         VPC (10.0.0.0/16)    │
                 │                              │
                 │   ┌──────────────────────┐    │
                 │   │  Public Subnet       │    │
                 │   │  (10.0.1.0/24)       │    │
                 │   │   Route: 0.0.0.0/0→IGW│   │
                 │   │                      │    │
                 │   │ [EC2: Web Server]    │    │
                 │   └──────────────────────┘    │
                 │               │                │
                 │   ┌──────────────────────┐     │
                 │   │  Private Subnet      │     │
                 │   │  (10.0.2.0/24)       │     │
                 │   │ Route: 0.0.0.0/0→NAT │     │
                 │   │                      │     │
                 │   │ [EC2: App Server]    │     │
                 │   │ [RDS: Database]      │     │
                 │   └──────────────────────┘     │
                 │                              │
                 └──────────────────────────────┘

🧭 5. Summary Table

ConceptDescriptionExample
VPCIsolated virtual network in AWS10.0.0.0/16
Route TableRules defining where traffic goes0.0.0.0/0 → igw-xxxx
Public SubnetSubnet with a route to Internet GatewayFor web servers
Private SubnetSubnet without direct internet accessFor databases, backend servers
Internet Gateway (IGW)Enables communication between VPC and the internetOutbound/inbound for public resources
NAT GatewayLets private subnet instances access internet (outbound only)For patch downloads, API calls

Would you like me to generate a visual diagram (image) version of this architecture (Public + Private subnets, IGW, NAT Gateway, EC2, and RDS)? It’ll make the concept instantly clear.

Monday, February 9, 2026

What is Cognito User Pool and Congito Identity Pools?

 

 1. Cognito User Pool

Purpose:
➡️ Manages user authentication (who you are).

Think of a User Pool as a user directory that stores user credentials and handles:

  • Sign-up and sign-in (username/password, email, phone, etc.)

  • MFA (Multi-Factor Authentication) and password policies

  • User profile attributes (name, email, etc.)

  • Token issuance:

    • ID Token (user identity)

    • Access Token (API access)

    • Refresh Token (to renew)

Example Use Case:

  • You want users to sign in directly to your app using email + password or Google login.

  • You want Cognito to handle authentication, user registration, password reset, etc.

→ Output: Authenticated user tokens (JWTs).


🧭 2. Cognito Identity Pool

Purpose:
➡️ Provides AWS credentials (what you can access).

An Identity Pool gives your users temporary AWS credentials (STS tokens) so they can access AWS resources (like S3, DynamoDB, or Lambda) directly.

It can:

  • Accept identities from Cognito User Pools

  • Or from federated identity providers, like:

    • Google, Facebook, Apple, etc.

    • SAML / OpenID Connect providers

    • Even unauthenticated (guest) users

→ Output: AWS access key and secret key (temporary credentials).


🧩 3. How They Work Together

They can be used independently or together:

ScenarioWhat You UseDescription
Only need user sign-up/sign-in (like a typical web app)User Pool onlyYou don’t need AWS resource access.
Need to allow users to access AWS services (like S3 upload, DynamoDB read, etc.)Both User Pool + Identity PoolAuthenticate user via User Pool, then exchange JWT token for temporary AWS credentials from Identity Pool.
Want to allow guest users or social logins directly accessing AWSIdentity Pool only

Wednesday, February 4, 2026

How AppSync can be used with Lambda resolvers for Bedrock inferencing

Using **AWS AppSync with Lambda resolvers** is a flexible way to integrate GraphQL with **Amazon Bedrock**. While AppSync now supports direct integration with Bedrock (no-code), using a Lambda resolver is still preferred when you need to perform **data validation, prompt engineering, or complex post-processing** before returning the AI's response to the client.


### The Architectural Flow


1. **Client Request:** A user sends a GraphQL query or mutation (e.g., `generateSummary(text: String!)`) to the AppSync endpoint.

2. **AppSync Resolver:** AppSync identifies the field and triggers the associated **Lambda Data Source**.

3. **Lambda Function:** The function receives the GraphQL arguments, constructs a prompt, and calls the **Bedrock Runtime API**.

4. **Bedrock Inference:** Bedrock processes the prompt and returns a JSON response.

5. **Return to Client:** Lambda parses the result and returns it to AppSync, which maps it back to the GraphQL schema.


---


### Step-by-Step Implementation


#### 1. Define the GraphQL Schema


In the AppSync console, define the types and the mutation that will trigger the AI.


```graphql

type AIResponse {

  content: String

  usage: String

}


type Mutation {

  askBedrock(prompt: String!): AIResponse

}


```


#### 2. Create the Lambda Resolver (Node.js Example)


The Lambda function acts as the "middleman." It uses the `@aws-sdk/client-bedrock-runtime` to communicate with the foundation models.


```javascript

import { BedrockRuntimeClient, InvokeModelCommand } from "@aws-sdk/client-bedrock-runtime";


const client = new BedrockRuntimeClient({ region: "us-east-1" });


export const handler = async (event) => {

  // Extract the prompt from the AppSync 'arguments' object

  const { prompt } = event.arguments;


  const input = {

    modelId: "anthropic.claude-3-haiku-20240307-v1:0",

    contentType: "application/json",

    accept: "application/json",

    body: JSON.stringify({

      anthropic_version: "bedrock-2023-05-31",

      max_tokens: 500,

      messages: [{ role: "user", content: prompt }],

    }),

  };


  try {

    const command = new InvokeModelCommand(input);

    const response = await client.send(command);

    

    // Decode and parse the binary response body

    const responseBody = JSON.parse(new TextDecoder().decode(response.body));

    

    return {

      content: responseBody.content[0].text,

      usage: "Success"

    };

  } catch (error) {

    console.error(error);

    throw new Error("Failed to invoke Bedrock");

  }

};


```


#### 3. Configure IAM Permissions


Your Lambda function's execution role **must** have permission to call the specific Bedrock model.


```json

{

  "Version": "2012-10-17",

  "Statement": [

    {

      "Effect": "Allow",

      "Action": "bedrock:InvokeModel",

      "Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-haiku-20240307-v1:0"

    }

  ]

}


```


---


### Why use Lambda instead of AppSync's Direct Bedrock Integration?


While AppSync can now talk to Bedrock directly using specialized "Direct Data Source" resolvers, the **Lambda approach** is better for:


* **Prompt Orchestration:** You can fetch additional data from DynamoDB or a vector database (like Pinecone or OpenSearch) to augment the prompt (**RAG architecture**) before sending it to Bedrock.

* **Response Sanitization:** You can filter the AI's output for PII (Personally Identifiable Information) or toxic content before it reaches the user.

* **Logging & Auditing:** You can easily log exact input/output tokens to CloudWatch for cost tracking and performance monitoring.

* **Error Handling:** You can provide custom "fallback" responses if the AI service is throttled or the prompt violates safety filters.


### Handling Long-Running Queries


Standard Lambda-based GraphQL queries have a **30-second timeout**. If the model (like Claude 3 Opus) takes longer to generate a response, the query will fail. In those cases, it is recommended to use **AppSync Subscriptions** to stream the response back to the client token-by-token.


How to use Amazone Kinesis Data Analysis for GraphQL ?

 sing **Amazon Kinesis Data Analytics** (now called **Amazon Managed Service for Apache Flink**) to parse GraphQL is unique because GraphQL queries are sent as **strings** inside a JSON payload. Unlike standard JSON, you cannot simply use a "dot" notation to access fields inside the query; you must parse the GraphQL DSL (Domain Specific Language) itself.


There are three main ways to achieve this, depending on how much detail you need from the query.


---


### 1. The "Robust" Path: Apache Flink with a Parser Library


If you need to extract specific fields (e.g., "how many times was the `email` field requested?"), you should use the **Managed Service for Apache Flink** with a custom Java or Python application.


* **How it works:** You write a Flink application that includes a GraphQL parsing library (like `graphql-java` for Java or `graphql-core` for Python).

* **The Logic:**

1. Flink consumes the JSON record from the Kinesis Stream.

2. A `MapFunction` extracts the `query` string from the JSON.

3. The parser library converts that string into an **AST (Abstract Syntax Tree)**.

4. You traverse the tree to find the operation name, fragments, or specific leaf fields.



* **Best for:** Deep security auditing, complexity analysis, or fine-grained usage billing.


### 2. The "Simple" Path: Kinesis SQL with Regex


If you only need to extract the **Operation Name** or verify the presence of a specific keyword, you can use the Legacy SQL runtime (or Flink SQL).


* **How it works:** Use the `REGEXP_EXTRACT` function to find patterns within the query string.

* **Example SQL:**

```sql

SELECT 

    STREAM_NAME,

    REGEXP_EXTRACT(query_payload, 'query\s+(\w+)') AS operation_name

FROM "SOURCE_SQL_STREAM_001";


```



* **Best for:** Real-time dashboards showing which queries (by name) are most popular.

* **Limitation:** This is very brittle. If a user changes their whitespace or uses aliases, the regex will likely fail.


### 3. The "Hybrid" Path: Lambda Pre-processing


The most common production pattern is to parse the GraphQL **before** it reaches Kinesis Analytics using a **Kinesis Data Firehose Transformation Lambda**.


1. **Ingest:** Data is sent to Kinesis Data Firehose.

2. **Transform:** Firehose triggers an AWS Lambda function.

3. **Parse:** The Lambda uses a standard GraphQL library to parse the query and flattens it into a standard JSON object (e.g., `{"operation": "GetUser", "fields": ["id", "name"]}`).

4. **Analyze:** The flattened JSON is sent to Kinesis Analytics, which can now use simple SQL to analyze the data because it is no longer a complex string.


---


### Comparison of Methods


| Feature | Flink + Parser Library | SQL + Regex | Lambda Pre-processor |

| --- | --- | --- | --- |

| **Parsing Depth** | Full (AST level) | Shallow (Pattern matching) | Full (JSON flattening) |

| **Complexity** | High (Custom Code) | Low (Standard SQL) | Moderate (Simple Lambda) |

| **Performance** | Highest (Native) | High | Moderate (Lambda overhead) |

| **Use Case** | Advanced Analytics | Basic Dashboards | General Purpose ETL |


### Next Step


Would you like me to provide a **sample Python script** for a Lambda transformer that flattens a GraphQL query for Kinesis, or would you prefer a **Java snippet** for a Flink MapFunction?


What is the difference between Apollo and AppSync when integring with BedRock?

 


Integrating **Apollo GraphQL** with **AWS Bedrock** creates a powerful bridge between your frontend and generative AI models. While Apollo manages your "Data Graph," Bedrock provides the "Intelligence" layer.


In this architecture, Apollo acts as the **orchestrator**, translating GraphQL queries into Bedrock API calls and shaping the AI's response to match your application's schema.


---


### 1. The Architectural Flow


The most common way to integrate these is by hosting an **Apollo Server** (on AWS Lambda, ECS, or Fargate) that uses the **AWS SDK** to communicate with Bedrock.


1. **Client Query:** The frontend sends a GraphQL query (e.g., `askAI(prompt: "...")`).

2. **Apollo Resolver:** A specific function in your Apollo Server intercepts the query.

3. **Bedrock Runtime:** The resolver calls the `InvokeModel` or `Converse` API via the `@aws-sdk/client-bedrock-runtime`.

4. **Schema Mapping:** Apollo transforms the raw JSON response from the AI (like Claude or Llama) into the structured format defined in your GraphQL schema.


---


### 2. Implementation Patterns


#### A. The "Standard" Apollo Resolver


In this pattern, you define a `Mutation` or `Query` in your schema. The resolver is responsible for the "heavy lifting."


```javascript

// Example Resolver logic

const resolvers = {

  Mutation: {

    generateResponse: async (_, { prompt }, { bedrockClient }) => {

      const command = new InvokeModelCommand({

        modelId: "anthropic.claude-3-sonnet",

        body: JSON.stringify({

          prompt: `\n\nHuman: ${prompt}\n\nAssistant:`,

          max_tokens_to_sample: 300,

        }),

        contentType: "application/json",

      });


      const response = await bedrockClient.send(command);

      const resBody = JSON.parse(new TextDecoder().decode(response.body));

      return { text: resBody.completion };

    },

  },

};


```


#### B. Streaming with Subscriptions


AI responses take time. To avoid timeouts and improve UX, you can use **GraphQL Subscriptions**.


* The client **subscribes** to a response channel.

* Apollo Server uses `InvokeModelWithResponseStream` to get tokens incrementally from Bedrock.

* As tokens arrive, Apollo "publishes" them to the subscription, appearing instantly on the user's screen.


---


### 3. Apollo vs. AWS AppSync for Bedrock


While you can build this manually with Apollo, AWS offers a managed GraphQL service called **AppSync** which has a native integration.


| Feature | Apollo Server (Self-Managed) | AWS AppSync (Managed) |

| --- | --- | --- |

| **Setup** | High control; requires hosting (Lambda/ECS). | Fully managed; serverless by default. |

| **Bedrock Integration** | Via **AWS SDK** in resolvers. | **Direct Bedrock Resolvers** (no code/Lambda needed). |

| **Streaming** | Requires WebSocket setup (Apollo Subscriptions). | Built-in via serverless WebSockets. |

| **Type Safety** | High (native GraphQL). | High (native GraphQL). |


---


### 4. Key Use Cases


* **Self-Documenting AI:** Bedrock Agents can use your Apollo GraphQL endpoint as an "Action Group." Because GraphQL is introspectable, the AI can "read" your schema to understand what data it can fetch.

* **Data Aggregation:** You can create a field like `aiSummary` on a `Product` type. When queried, Apollo fetches the product data from DynamoDB and simultaneously asks Bedrock to summarize it.


### Next Steps


Would you like me to **provide a full boilerplate for a Bedrock-enabled Apollo Server** or explain how to set up **AppSync’s direct Bedrock resolvers**?

Explain how the feature below function API gateway with VTL Transformations for GraphQL

In AWS, using **API Gateway with VTL (Velocity Template Language) for GraphQL** usually refers to one of two patterns: either using a REST API Gateway as a "proxy" or "translator" in front of a GraphQL backend, or utilizing the VTL engine within **AWS AppSync** (which is AWS’s dedicated managed GraphQL service).


VTL is a logic-heavy templating engine that allows you to manipulate data as it flows through the gateway without needing to spin up a compute resource like AWS Lambda.


### 1. The Core Role of VTL


VTL acts as a **data transformer**. In a typical request/response flow:


* **Request Mapping:** When a client sends a GraphQL query (which is essentially a JSON POST request), VTL intercepts it and reshapes it into a format the backend understands (e.g., a specific DynamoDB JSON structure or a specialized HTTP request).

* **Response Mapping:** When the backend returns data, VTL takes that raw output and reshapes it to match the GraphQL schema’s expected format before sending it back to the client.


### 2. Common Implementation Patterns


#### Pattern A: AppSync Resolvers (Native GraphQL)


AppSync is the primary service where VTL is used for GraphQL. It uses **Resolvers** to connect fields in your GraphQL schema to data sources.


* **Logic without Code:** You can write VTL to perform validation, authorization checks, or conditional logic (e.g., "if the user is an Admin, return the full object; otherwise, hide the email field").

* **Direct Integration:** VTL can translate a GraphQL `getPost(id: "123")` query directly into a DynamoDB `GetItem` request, bypassing the need for a Lambda function entirely. This reduces latency and cost.


#### Pattern B: API Gateway (REST) as a Proxy


If you have a REST API Gateway sitting in front of a GraphQL server (like an Apollo server running on Fargate):


* **Request Sanitization:** Use VTL to extract specific headers or JWT claims and inject them into the GraphQL query's `variables` or `context`.

* **Payload Wrapping:** API Gateway can take a simple REST-style input and wrap it into a valid GraphQL `query` string using VTL to send it to a downstream GraphQL endpoint.


### 3. Key VTL Features used in GraphQL


* **`$context` (or `$ctx`)**: An object containing all request details, including arguments (`$ctx.args`), identity/auth info (`$ctx.identity`), and previous results in a pipeline.

* **`$util`**: A library of helper functions for common tasks like generating UUIDs (`$util.autoId()`), converting strings to JSON, or escaping characters.

* **Directives**: VTL supports `#if`, `#foreach`, and `#set`, allowing you to loop through a list of items returned by a database and format them as a GraphQL array.


### Why use VTL instead of Lambda?


* **Performance:** VTL runs natively within the API Gateway or AppSync service. There is no "cold start" associated with it.

* **Cost:** You are not charged for the execution time of VTL templates (you pay for the API request), whereas Lambda incurs a cost per execution and duration.

* **Simplicity:** For simple data mapping (e.g., renaming a field or extracting an ID), a few lines of VTL is often easier to maintain than a full deployment package of code.


You can learn more about the mechanics of data transformations in this [AWS API Gateway Tutorial](https://www.youtube.com/watch?v=-_nYddYkd7M). This video explains how mapping templates function to restructure requests and responses, which is the foundational concept for handling GraphQL payloads.

Friday, January 30, 2026

Create own rag evaluation strategy

 Building a RAG (Retrieval-Augmented Generation) evaluation from scratch is actually a great way to deeply understand where your pipeline is failing. While frameworks like Ragas or Arize Phoenix are popular, they are essentially just wrappers for specific prompts and math.

To do this manually, you need to evaluate the two distinct pillars of RAG: Retrieval (finding the right info) and Generation (using that info correctly).

1. The Evaluation Dataset

You can’t evaluate without a "Golden Dataset." Create a spreadsheet with 20–50 rows containing:

 * Question: What the user asks.

 * Context/Source: The specific document snippet that contains the answer.

 * Ground Truth: The ideal, "perfect" answer.

2. Evaluating Retrieval (The "Search" Part)

This measures if your vector database is actually finding the right documents. You don't need an LLM for this; you just need basic math.

 * Hit Rate (Precision at K): Did the correct document appear in the top k results?

   * Calculation: (Number of queries where the right doc was found) / (Total queries).

 * Mean Reciprocal Rank (MRR): Measures where the right document appeared. It rewards the system more for having the correct answer at rank 1 than rank 5.

   * Formula: MRR = \frac{1}{Q} \sum_{i=1}^{Q} \frac{1}{rank_i}

3. Evaluating Generation (The "LLM-as-a-Judge" Method)

Since manual grading is slow, you can use a "Judge LLM" (like GPT-4o or Claude 3.5) to grade your RAG output. You feed the judge a specific prompt for each of these three metrics:

A. Faithfulness (Groundedness)

Does the answer stay true to the retrieved context, or did the LLM hallucinate?

 * The Prompt: "Given the following context and the generated answer, list every claim in the answer. For each claim, state if it is supported by the context. Score 1.0 if all claims are supported, 0.0 otherwise."

B. Answer Relevance

Does the answer actually address the user's question?

 * The Prompt: "On a scale of 1-5, how relevant is this response to the original user question? Ignore whether the facts are true for now; focus only on whether it addresses the user's intent."

C. Context Precision

Did the retrieval step provide "clean" information, or was it full of noise?

 * The Prompt: "Check the retrieved context. Is this information actually necessary to answer the user's question? Rate 1 for useful, 0 for irrelevant."

4. Simple Python Implementation Structure

You don't need a library; a simple loop will do:

results = []

for item in golden_dataset:

    # 1. Run your RAG pipeline

    retrieved_docs = retriever.get_relevant_documents(item['question'])

    response = rag_chain.invoke(item['question'])

    

    # 2. Manual/LLM Scoring

    score = call_judge_llm(

        system_prompt="You are a grader...",

        user_content=f"Question: {item['question']}\nContext: {retrieved_docs}\nAnswer: {response}"

    )

    

    results.append({"question": item['question'], "score": score})


# 3. Calculate the mean score

final_grade = sum(r['score'] for r in results) / len(results)


Summary Table: What to Track

| Metric | What it tests | Success Criteria |

|---|---|---|

| Context Recall | Retrieval | Is the ground truth present in the chunks? |

| Faithfulness | Generation | Did the LLM make things up? |

| Answer Similarity | Generation | How close is the answer to the Ground Truth? (Use Semantic Similarity) |

Would you like me to write a specific "Judge Prompt" you can use to grade your RAG's faithfulness?


Arize phonix embedding visualization and observability

 Arize Phoenix is different from Ragas or DeepEval because it is an observability tool. Instead of just giving you a score, it launches a local web dashboard that lets you visually inspect your CLI embeddings and trace exactly how your RAG pipeline is performing in real-time.

For your CLI project, Phoenix is incredibly helpful for seeing "clusters" of commands and finding out why a specific query retrieved the wrong CLI command.

1. Prerequisites

pip install arize-phoenix llama-index-callbacks-arize-phoenix


2. Implementation Code

This script connects LlamaIndex to Phoenix. Once you run this, a browser window will open showing your RAG "traces."

import phoenix as px

import llama_index.core

from llama_index.core import VectorStoreIndex, Document, Settings

from llama_index.core.callbacks import CallbackManager

from llama_index.callbacks.arize_phoenix import ArizePhoenixCallbackHandler


# 1. Start the Phoenix Search & Trace server (launches a local web UI)

session = px.launch_app()


# 2. Setup LlamaIndex to send data to Phoenix

remote_callback_handler = ArizePhoenixCallbackHandler()

callback_manager = CallbackManager([remote_callback_handler])

Settings.callback_manager = callback_manager


# 3. Your CLI JSON Data

cli_data = [

    {"command": "git checkout -b", "description": "Create and switch to a new branch", "examples": ["git checkout -b feature-login"]},

    {"command": "git branch -d", "description": "Delete a local branch", "examples": ["git branch -d old-feature"]}

]


# 4. Standard LlamaIndex Ingestion

documents = [Document(text=f"{item['command']}: {item['description']}") for item in cli_data]

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()


# 5. Run a query

# After running this, check the Phoenix Dashboard!

response = query_engine.query("How do I make a new branch?")


print(f"Answer: {response}")

print(f"Phoenix Dashboard URL: {session.url}")


# Keep the script running so you can explore the UI

import time

time.sleep(1000)


What to look for in the Phoenix UI:

 * Traces: You will see a "timeline" of your query. You can click on it to see exactly what text was sent to the embedding model and what chunks were pulled from your JSON.

 * The Embedding Map: Phoenix can visualize your CLI commands as dots in a 3D space.

   * Example: You might see a cluster of "Docker" commands and a cluster of "Git" commands.

   * Insight: If "how do I delete a branch" pulls up a "Docker delete" command, you will see the query dot land in the wrong cluster, telling you that your embeddings need more technical context.

 * LLM Evaluation: Phoenix can run "Evals" in the background. It will flag queries that it thinks were "Unfaithful" or had "Poor Retrieval" based on its internal heuristics.

Comparison: When to use which?

| Use Case | Recommended Tool |

|---|---|

| "I want to know if my RAG is accurate." | Ragas |

| "I want to prevent breaking changes in my code." | DeepEval |

| "I want to see WHY my RAG is failing visually." | Arize Phoenix |

Would you like to know how to use Phoenix to find "Useless Commands" in your JSON (commands that never get retrieved or overlap too much with others)?


Using deepeval

 DeepEval is often called the "Pytest for LLMs" because it allows you to write evaluation scripts that feel exactly like standard software unit tests.

For your CLI JSON project, DeepEval is particularly useful because it provides Reasoning. If a command fails the test, it will tell you exactly why (e.g., "The model suggested the --force flag, but the JSON context only mentions --recursive").

1. Prerequisites

pip install deepeval


2. The DeepEval Test File (test_cli_rag.py)

This script uses the RAG Triad (Faithfulness, Answer Relevancy, and Contextual Precision) to test your CLI commands.

import pytest

from deepeval import assert_test

from deepeval.test_case import LLMTestCase

from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric, ContextualPrecisionMetric


# 1. Setup the metrics with passing thresholds

# Threshold 0.7 means the score must be > 0.7 to "Pass" the unit test

faithfulness_metric = FaithfulnessMetric(threshold=0.7)

relevancy_metric = AnswerRelevancyMetric(threshold=0.7)

precision_metric = ContextualPrecisionMetric(threshold=0.7)


def test_docker_ps_command():

    # --- SIMULATED RAG OUTPUT ---

    # In a real test, you would call your query_engine.query() here

    input_query = "How do I see all my containers, even stopped ones?"

    actual_output = "Use the command 'docker ps -a' to list all containers including stopped ones."

    retrieval_context = [

        "Command: docker ps. Description: List running containers. Examples: docker ps -a"

    ]

    

    # 2. Create the Test Case

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    # 3. Assert the test with multiple metrics

    assert_test(test_case, [faithfulness_metric, relevancy_metric, precision_metric])


def test_non_existent_command():

    input_query = "How do I hack into NASA?"

    actual_output = "I'm sorry, I don't have information on that."

    retrieval_context = [] # Nothing found in your CLI JSON

    

    test_case = LLMTestCase(

        input=input_query,

        actual_output=actual_output,

        retrieval_context=retrieval_context

    )

    

    assert_test(test_case, [relevancy_metric])


3. Running the Test

You run this from your terminal just like a normal python test:

deepeval test run test_cli_rag.py


4. Why DeepEval is better than Ragas for CLI:

 * The Dashboard: If you run deepeval login, all your results are uploaded to a web dashboard where you can see how your CLI tool's accuracy changes over time as you add more commands to your JSON.

 * Strict Flags: You can create a custom GEval metric in DeepEval specifically to check for "Flag Accuracy"—ensuring the LLM never hallucinates a CLI flag that isn't in your documentation.

 * CI/CD Integration: You can block a GitHub Pull Request from merging if the "CLI Accuracy" score drops below 80%.

Comparison: Ragas vs. DeepEval

| Feature | Ragas | DeepEval |

|---|---|---|

| Primary Use | Research / Bulk Data Eval | Engineering / Unit Testing |

| Output | Raw Scores (0.0 - 1.0) | Pass/Fail + Detailed Reasoning |

| Integration | Pandas / Notebooks | Pytest / GitHub Actions |

| UI | None (requires 3rd party) | Built-in Cloud Dashboard |

Would you like me to show you how to create a "Custom Flag Metric" to ensure the LLM never invents fake CLI arguments?


Thursday, January 29, 2026

Using custom embedding models with llamaindex

 To use a custom model with LlamaIndex, you use the Settings object. This acts as a global configuration hub that tells LlamaIndex which "brain" (LLM) and "dictionary" (Embedding Model) to use for all operations.

Since you are working with CLI commands, I recommend using a local embedding model (no API cost and high privacy) and a custom LLM (like a specific Llama 3 variant).

1. Setup for Local Embedding & LLM

First, install the necessary integrations:

pip install llama-index-embeddings-huggingface llama-index-llms-ollama


2. Configuration Code

Here is how you replace the default OpenAI models with custom local ones.

from llama_index.core import Settings

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.llms.ollama import Ollama


# 1. Set a Custom Embedding Model (Local from HuggingFace)

# BGE-Small is excellent for technical/CLI text retrieval

Settings.embed_model = HuggingFaceEmbedding(

    model_name="BAAI/bge-small-en-v1.5"

)


# 2. Set a Custom LLM (Local via Ollama)

# This assumes you have Ollama running locally with 'llama3' pulled

Settings.llm = Ollama(model="llama3", request_timeout=60.0)


# 3. Standard LlamaIndex flow now uses these settings automatically

# index = VectorStoreIndex.from_documents(documents)

# query_engine = index.as_query_engine()


Which Custom Model Should You Choose for CLI?

Depending on your hardware and specific needs, here are the best "custom" matches for your JSON array:

| Type | Recommended Model | Why? |

|---|---|---|

| Embedding | BAAI/bge-base-en-v1.5 | Consistently tops benchmarks for retrieval accuracy. |

| Embedding | nomic-embed-text | Great for long examples and varied CLI syntax. |

| LLM | codellama or llama3 | These models understand code structure and terminal syntax better than generic models. |

Visual Flow of Custom Settings

Key Tips for Custom Models:

 * Device selection: If you have a GPU, the HuggingFace embedding will automatically try to use it. You can force it by adding device="cuda" or device="mps" (for Mac) inside HuggingFaceEmbedding.

 * Batch Size: For large JSON arrays, you can speed up the "custom" embedding process by setting embed_batch_size=32 in the embedding settings.

 * Persistence: Once you embed your JSON array using a custom model, you must use that same model every time you load the index later. If you change models, the "numbers" won't match, and retrieval will fail.

Would you like me to show you how to save (persist) the index to your disk so you don't have to re-embed your CLI JSON every time you run the script?


Rag for Technical documentation

 Implementing a RAG (Retrieval-Augmented Generation) system for technical content like CLI commands requires a different approach than standard document RAG. Because CLI commands use specific syntax (e.g., --flags, grep, awk) and jargon, simple semantic search often fails to distinguish between similar-looking commands.

The best approach for your JSON dataset is a Hybrid Retrieval strategy combined with Context-Aware Metadata.

1. The Strategy: Hybrid Retrieval

Technical documentation usually requires two types of matching:

 * Semantic Search (Dense): Understands intent. If a user asks "How do I list hidden files?", the retriever finds ls -a even if the word "hidden" isn't in the description.

 * Keyword Match (Sparse/BM25): Handles exact technical terms. If a user types the specific flag --recursive, you want the retriever to prioritize commands that actually contain that exact string.

Recommended Pipeline:

 * Preprocessing: For each JSON object, create a "searchable string" that combines the command name, description, and examples.

 * Indexing: Use a vector database (like Qdrant, Pinecone, or Weaviate) that supports Hybrid Search.

 * Retrieval: Use Reciprocal Rank Fusion (RRF) to combine the results from the vector search and the keyword search.

2. Preparing Your JSON Data

Don't just embed the description. You need to structure the text to help the embedding model "see" the command.

Original JSON:

{

  "command": "docker ps",

  "description": "List running containers",

  "examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]

}


Optimized Chunk for Embedding:

> Command: docker ps

> Description: List running containers.

> Examples: docker ps -a, docker ps --format '{{.ID}}'

> Keywords: docker, ps, list, containers, running, status

3. Implementation Steps

A. Embedding Model Choice

Since you are dealing with code-like structures, use a model trained on technical text or code.

 * Open Source: BAAI/bge-small-en-v1.5 (excellent for technical retrieval).

 * Proprietary: text-embedding-3-small (OpenAI) or text-embedding-004 (Gemini).

B. The Retrieval Logic

Use a "Multi-Vector" approach if your examples are very different from your descriptions:

 * Summary Vector: Embed just the command and description.

 * Example Vector: Embed each example separately but link them to the same parent command ID.

C. Handling "Short Query" Problems

CLI users often type short, ambiguous queries like "delete file." To solve this, use Query Expansion (Hypothetical Document Embeddings - HyDE):

 * Ask an LLM: "What is the technical description of the command to delete a file?"

 * Use that expanded description to search your JSON array instead of the user's two-word query.

4. Why this works for CLI

| Feature | Why it matters |

|---|---|

| Hybrid Search | Ensures git log is found for both "history" (semantic) and "log" (keyword). |

| Metadata Filtering | Allows you to filter by categories (e.g., only search git commands if the user is in a git context). |

| Example Indexing | Often, a user remembers a flag but not the command; indexing examples helps catch these. |

Would you like me to provide a Python code snippet using a specific library (like LangChain or LlamaIndex) to implement this hybrid search?


Query fusion retrieval

 To implement this, I recommend using LlamaIndex with its built-in QueryFusionRetriever. This is a powerful, production-ready way to perform Hybrid Search (BM25 + Vector) and then use Reciprocal Rank Fusion (RRF) to get the most accurate result.

1. Prerequisites

You will need to install the following libraries:

pip install llama-index llama-index-retrievers-bm25 llama-index-embeddings-openai llama-index-vector-stores-qdrant qdrant-client


2. Implementation Code

This script loads your JSON, prepares the documents, and sets up the hybrid retrieval pipeline.

import json

from llama_index.core import Document, VectorStoreIndex, StorageContext

from llama_index.core.retrievers import QueryFusionRetriever

from llama_index.retrievers.bm25 import BM25Retriever

from llama_index.vector_stores.qdrant import QdrantVectorStore

import qdrant_client


# 1. Load your JSON Data

cli_data = [

    {

        "command": "docker ps",

        "description": "List running containers",

        "examples": ["docker ps -a", "docker ps --format '{{.ID}}'"]

    },

    # ... more commands

]


# 2. Transform JSON to Documents for Indexing

documents = []

for item in cli_data:

    # We combine command, description, and examples into one text block

    # This ensures the model can "see" all parts during search

    content = f"Command: {item['command']}\nDescription: {item['description']}\nExamples: {', '.join(item['examples'])}"

    

    doc = Document(

        text=content,

        metadata={"command": item['command']} # Keep original command in metadata

    )

    documents.append(doc)


# 3. Setup Vector Storage (Dense Search)

client = qdrant_client.QdrantClient(location=":memory:") # Use local memory for this example

vector_store = QdrantVectorStore(client=client, collection_name="cli_docs")

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)


# 4. Initialize Retrievers

# Semantic (Vector) Retriever

vector_retriever = index.as_retriever(similarity_top_k=5)


# Keyword (BM25) Retriever

bm25_retriever = BM25Retriever.from_defaults(

    docstore=index.docstore, 

    similarity_top_k=5

)


# 5. The Hybrid "Fusion" Retriever

# This combines both and reranks using Reciprocal Rank Fusion (RRF)

hybrid_retriever = QueryFusionRetriever(

    [vector_retriever, bm25_retriever],

    similarity_top_k=2,

    num_queries=1,  # Set > 1 if you want the LLM to rewrite the query into multiple variations

    mode="reciprocal_rerank",

    use_async=True

)


# 6. Usage

query = "How do I see all my containers?"

nodes = hybrid_retriever.retrieve(query)


for node in nodes:

    print(f"Score: {node.score:.4f}")

    print(f"Content:\n{node.text}\n")


Why this works for CLI datasets:

 * Contextual Awareness: By putting the command, description, and examples into the document text, the vector search understands that "list containers" relates to docker ps.

 * Precise Flag Matching: If a user searches for a specific flag like -a or --format, the BM25 retriever will catch that exact string, which a standard vector search might ignore as "noise."

 * RRF Ranking: Reciprocal Rank Fusion is great because it doesn't require you to manually "weight" (e.g., 70% vector, 30% keyword). It mathematically finds the items that appear at the top of both lists.

Would you like me to show you how to add an LLM step to this so it generates a natural language explanation of the command it found?


Evaluating using Raagas

 Evaluating a RAG (Retrieval-Augmented Generation) pipeline with Ragas (Retrieval Augmented Generation Assessment) is a smart move. It moves you away from "vibes-based" testing and into actual metrics like Faithfulness, Answer Relevance, and Context Precision.

To get this running, you'll need a "Evaluation Dataset" consisting of Questions, Contexts, Answers, and (optionally) Ground Truths.

Prerequisites

First, install the necessary libraries:

pip install ragas langchain openai


Python Implementation

Here is a concise script to evaluate a set of RAG results using Ragas and OpenAI as the "LLM judge."

import os

from datasets import Dataset

from ragas import evaluate

from ragas.metrics import (

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall,

)


# 1. Setup your API Key

os.environ["OPENAI_API_KEY"] = "your-api-key-here"


# 2. Prepare your data

# 'contexts' should be a list of lists (strings retrieved from your vector db)

data_samples = {

    'question': ['When was the first iPhone released?', 'Who founded SpaceX?'],

    'answer': ['The first iPhone was released on June 29, 2007.', 'Elon Musk founded SpaceX in 2002.'],

    'contexts': [

        ['Apple Inc. released the first iPhone in mid-2007.', 'Steve Jobs announced it in January.'],

        ['SpaceX was founded by Elon Musk to reduce space transportation costs.']

    ],

    'ground_truth': ['June 29, 2007', 'Elon Musk']

}


dataset = Dataset.from_dict(data_samples)


# 3. Define the metrics you want to track

metrics = [

    faithfulness,

    answer_relevancy,

    context_precision,

    context_recall

]


# 4. Run the evaluation

score = evaluate(dataset, metrics=metrics)


# 5. Review the results

df = score.to_pandas()

print(df)


Key Metrics Explained

Understanding what these numbers mean is half the battle:

| Metric | What it measures |

|---|---|

| Faithfulness | Is the answer derived only from the retrieved context? (Prevents hallucinations). |

| Answer Relevancy | Does the answer actually address the user's question? |

| Context Precision | Are the truly relevant chunks ranked higher in your retrieval results? |

| Context Recall | Does the retrieved context actually contain the information needed to answer? |

Pro-Tip: Evaluation without Ground Truth

If you don't have human-annotated ground_truth data yet, you can still run Faithfulness and Answer Relevancy. Ragas is particularly powerful because it uses an LLM to "reason" through whether the retrieved context supports the generated answer.

Would you like me to show you how to integrate this directly with a LangChain or LlamaIndex retriever so you don't have to manually build the dataset?


What is DSSE-KMS in AWS?

 DSSE-KMS (Dual-Layer Server-Side Encryption with AWS Key Management Service) is an Amazon S3 encryption option that applies two layers of encryption to objects at rest, providing enhanced security. It helps meet strict compliance requirements (like CNSSP 15) by using AWS KMS keys to encrypt data twice, offering superior protection for highly sensitive workloads. 

Key Features and Benefits

Dual-Layer Protection: Uses two distinct cryptographic libraries and data keys to encrypt objects, providing a higher level of assurance than single-layer encryption.

KMS Key Management: Uses AWS KMS to manage the master keys, allowing users to define permissions and audit usage.

Compliance Ready: Designed to meet rigorous standards, including the National Security Agency (NSA) CNSSP 15 for two layers of Commercial National Security Algorithm (CNSA) encryption.

Easy Implementation: Can be configured as the default encryption for an S3 bucket or specified in PUT/COPY requests.

Enforceable Security: IAM and bucket policies can be used to enforce this encryption type, ensuring all uploaded data is encrypted. 

DSSE-KMS is particularly aimed at US Department of Defense (DoD) customers and other industries requiring top-secret data handling


Tuesday, January 27, 2026

What Amazon Bedrock Flows Does?

Amazon Bedrock Flows is a visual workflow authoring and execution feature within Amazon Bedrock that lets developers and teams build, test, and deploy generative AI workflows without writing traditional code. It provides an intuitive drag-and-drop interface (and APIs/SDKs) for orchestrating sequences of AI tasks — combining prompts, foundation models (FMs), agents, knowledge bases, logic, and other AWS services into a reusable and versioned workflow (called a flow).

🔹 What Amazon Bedrock Flows Does

1. Visual Workflow Builder

Bedrock Flows gives you a graphical interface to drag, drop, and connect nodes representing steps in a GenAI workflow — such as model invocations, conditional logic, or integration points with services like AWS Lambda or Amazon Lex. You can also construct and modify flows using APIs or the AWS Cloud Development Kit (CDK).

2. Orchestration of Generative AI Workloads

Flows make it easy to link together:

foundation model prompts

AI agents

knowledge bases (for RAG)

business logic

external AWS services

into a cohesive workflow that responds to input data and produces a desired output.

3. Serverless Testing and Deployment

You can test flows directly in the AWS console with built-in input/output traceability, accelerate iteration, version your workflows for release management or A/B testing, and deploy them via API endpoints — all without managing infrastructure.

4. Enhanced Logic and Control

Flows consist of nodes and connections:

Nodes represent steps/operations (like invoking a model or evaluating a condition).

Connections (data or conditional) define how outputs feed into next steps.

This enables branching logic and complex, multi-stage execution paths.

5. Integration with AWS Ecosystem

Flows let you integrate generative AI with broader AWS tooling — such as Lambda functions for custom code, Amazon Lex for conversational interfaces, S3 for data input/output, and more — for complete, production-ready solutions.

🔹 Why It Matters

No-code/low-code AI orchestration: Non-developers or subject-matter experts can build sophisticated workflows.

Faster iteration: Test, version, and deploy generative AI applications more quickly.

Reusable AI logic: Flows can be versioned and reused across applications.

Supports complex AI use cases: Including multi-turn interactions and conditional behaviors.

🔹 Key Concepts

A flow is the workflow construct with a name, permissions, and connected nodes.

Nodes are the steps in the flow (inputs, actions, conditions).

Connections are the data or conditional links between nodes defining execution sequences.

In summary:

Amazon Bedrock Flows is a serverless AI workflow orchestration tool within AWS Bedrock that simplifies creating, testing, and deploying complex generative AI applications through a visual interface or APIs — enabling integration of foundation models, logic, and AWS services into scalable GenAI workflows.


Monday, January 26, 2026

What is VL Jepa VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

 VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) is a new vision-language model architecture that represents a major shift away from the typical generative, token-by-token approach used in most large multimodal models (like GPT-4V, LLaVA, InstructBLIP, etc.). Instead of learning to generate text tokens one after another, VL-JEPA trains a model to predict continuous semantic embeddings in a shared latent space that captures the meaning of text and visual content. (arXiv)

🧠 Core Idea

  • Joint Embedding Predictive Architecture (JEPA): The model adopts the JEPA philosophy: don’t predict low-level data (e.g., pixels or tokens) — predict meaningful latent representations. VL-JEPA applies this idea to vision-language tasks. (arXiv)

  • Predict Instead of Generate: Traditionally, vision-language models are trained to generate text outputs autoregressively (one token at a time). VL-JEPA instead predicts the continuous embedding vector of the target text given visual inputs and a query. This embedding represents the semantic meaning rather than the specific tokens. (arXiv)

  • Focus on Semantics: By operating in an abstract latent space, the model focuses on task-relevant semantics and reduces wasted effort modeling surface-level linguistic variability. (arXiv)

⚙️ How It Works

  1. Vision and Text Encoders:

    • A vision encoder extracts visual embeddings from images or video frames.

    • A text encoder maps query text and target text into continuous embeddings.

  2. Predictor:

    • The model’s core component predicts target text embeddings based on the visual context and input query, without generating actual text tokens. (arXiv)

  3. Selective Decoding:

    • When human-readable text is needed, a lightweight decoder can translate predicted embeddings into tokens. VL-JEPA supports selective decoding, meaning it only decodes what’s necessary — significantly reducing computation compared to standard autoregressive decoding. (alphaxiv.org)

🚀 Advantages

  • Efficiency: VL-JEPA uses roughly 50 % fewer trainable parameters than comparable token-generative vision-language models while maintaining or exceeding performance on many benchmarks. (arXiv)

  • Non-Generative Focus: The model is inherently non-generative during training, focusing on predicting semantic meaning, which leads to faster inference and reduced latency in applications like real-time video understanding. (DEV Community)

  • Supports Many Tasks: Without architectural changes, VL-JEPA naturally handles tasks such as open-vocabulary classification, text-to-video retrieval, and discriminative visual question answering (VQA). (arXiv)

📊 Performance

In controlled comparisons:

  • VL-JEPA outperforms or rivals established methods like CLIP, SigLIP2, and Perception Encoder on classification and retrieval benchmarks. (OpenReview)

  • On VQA datasets, it achieves performance comparable to classical VLMs (e.g., InstructBLIP, QwenVL) despite using fewer parameters. (OpenReview)


In summary, VL-JEPA moves beyond token generation toward semantic embedding prediction in vision-language models, offering greater efficiency and real-time capabilities without sacrificing general task performance. (arXiv)

references:

https://arxiv.org/abs/2512.10942

Sunday, January 25, 2026

Multi version concurrency control postgres

 In PostgreSQL, Multiversion Concurrency Control (MVCC) is the secret sauce that allows the database to handle multiple users at once without everyone stepping on each other's toes.

The core philosophy is simple: Readers never block writers, and writers never block readers.

How It Works: The "Snapshot" Concept

Instead of locking a row when someone wants to update it (which would force everyone else to wait), Postgres keeps multiple versions of that row. When you start a transaction, Postgres gives you a snapshot of the data as it existed at that exact moment.

1. Hidden Columns

Every row in a Postgres table has hidden system columns used for MVCC:

 * xmin: The ID of the transaction that inserted the row.

 * xmax: The ID of the transaction that deleted or updated the row (initially set to 0).

2. The Update Process

When you update a row, Postgres doesn't actually overwrite the old data. It performs a "soft delete" and an "insert":

 * It marks the old row version as expired by setting its xmax to the current Transaction ID.

 * It creates a new version of the row with the updated data and sets its xmin to the current Transaction ID.

3. Visibility Rules

When a different transaction tries to read that row, Postgres compares the transaction's ID with the xmin and xmax of the available versions:

 * If a row's xmin is from a committed transaction and its xmax is either 0 or belongs to an uncommitted transaction, the row is visible.

 * This ensures you always see a consistent state of the database, even if someone else is mid-update.

The Side Effect: Bloat and VACUUM

Because Postgres keeps those old versions of rows around (often called "dead tuples"), the database files will eventually grow—this is known as bloat.

To clean this up, Postgres uses a process called VACUUM:

 * It scans for rows where the xmax belongs to a transaction that is long finished.

 * It marks that space as available for new data.

 * Autovacuum is the built-in daemon that handles this automatically in the background so you don't have to.

Why This Matters

 * Performance: High concurrency. You can run massive reports (READs) while your application is constantly updating data (WRITEs) without them fighting for locks.

 * Consistency: You get "Snapshot Isolation," meaning your query results won't change halfway through just because another user committed a change.

> Note: While readers and writers don't block each other, writers still block writers if they attempt to update the exact same row at the same time.

Would you like me to show you how to query these hidden xmin and xmax columns on one of your existing tables to see this in action?


Thursday, January 22, 2026

What is Spec Driven Development

Instead of coding first and writing docs later, in spec-driven development, you start with a (you guessed it) spec. This is a contract for how your code should behave and becomes the source of truth your tools and AI agents use to generate, test, and validate code. The result is less guesswork, fewer surprises, and higher-quality code

Spec Kit makes your specification the center of your engineering process. Instead of writing a spec and setting it aside, the spec drives the implementation, checklists, and task breakdowns.  Your primary role is to steer; the coding agent does the bulk of the writing.

It works in four phases with clear checkpoints. But here’s the key insight: each phase has a specific job, and you don’t move to the next one until the current task is fully validated. 

Here’s how the process breaks down:

Specify: You provide a high-level description of what you’re building and why, and the coding agent generates a detailed specification. This isn’t about technical stacks or app design. It’s about user journeys, experiences, and what success looks like. Who will use this? What problem does it solve for them? How will they interact with it? What outcomes matter? Think of it as mapping the user experience you want to create, and letting the coding agent flesh out the details. Crucially, this becomes a living artifact that evolves as you learn more about your users and their needs.

Plan: Now you get technical. In this phase, you provide the coding agent with your desired stack, architecture, and constraints, and the coding agent generates a comprehensive technical plan. If your company standardizes on certain technologies, this is where you say so. If you’re integrating with legacy systems, have compliance requirements, or have performance targets you need to hit … all of that goes here. You can also ask for multiple plan variations to compare and contrast different approaches. If you make your internal docs available to the coding agent, it can integrate your architectural patterns and standards directly into the plan. After all, a coding agent needs to understand the rules of the game before it starts playing.

Tasks: The coding agent takes the spec and the plan and breaks them down into actual work. It generates small, reviewable chunks that each solve a specific piece of the puzzle. Each task should be something you can implement and test in isolation; this is crucial because it gives the coding agent a way to validate its work and stay on track, almost like a test-driven development process for your AI agent. Instead of “build authentication,” you get concrete tasks like “create a user registration endpoint that validates email format.”

Implement: Your coding agent tackles the tasks one by one (or in parallel, where applicable). But here’s what’s different: instead of reviewing thousand-line code dumps, you, the developer, review focused changes that solve specific problems. The coding agent knows what it’s supposed to build because the specification told it. It knows how to build it because the plan told it. And it knows exactly what to work on because the task told it.

Crucially, your role isn’t just to steer. It’s to verify. At each phase, you reflect and refine. Does the spec capture what you actually want to build? Does the plan account for real-world constraints? Are there omissions or edge cases the AI missed? The process builds in explicit checkpoints for you to critique what’s been generated, spot gaps, and course correct before moving forward. The AI generates the artifacts; you ensure they’re right.

Wednesday, January 21, 2026

Best RAG Startegies

Implementation Roadmap: Start Simple, Scale Smart

Don’t try to implement everything at once. Here’s a practical roadmap:


Phase 1: Foundation (Week 1)

Context-aware chunking (replace fixed-size splitting)

Basic vector search with proper embeddings

Measure baseline accuracy


Phase 2: Quick Wins (Week 2–3)

Add re-ranking (biggest accuracy boost for effort)

Implement query expansion (handles vague queries)

Measure improvement


Phase 3: Advanced (Week 4–6)

Add multi-query or agentic RAG (choose based on use case)

Implement self-reflection for critical queries

Fine-tune and optimize


Phase 4: Specialization (Month 2+)

Add contextual retrieval for high-value documents

Consider knowledge graphs if relationships matter

Fine-tune embeddings for domain-specific accuracy


references

https://pub.towardsai.net/i-spent-3-months-building-ra-systems-before-learning-these-11-strategies-1a8f6b4278aa

A simple example for embedding model

from sentence_transformers import SentenceTransformer, losses

from torch.utils.data import DataLoader


def prepare_training_data():

    """Domain-specific query-document pairs"""

    return [

        ("What is EBITDA?", "EBITDA (Earnings Before Interest, Taxes..."),

        ("Explain capital expenditure", "Capital expenditure (CapEx) refers to..."),

        # ... thousands more pairs

    ]

def fine_tune_model():

    """Fine-tune on domain data"""

    # Load base model

    model = SentenceTransformer('all-MiniLM-L6-v2')

  

    # Prepare training data

    train_examples = prepare_training_data()

    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

  

    # Define loss function

    train_loss = losses.MultipleNegativesRankingLoss(model)

  

    # Train

    model.fit(

        train_objectives=[(train_dataloader, train_loss)],

        epochs=3,

        warmup_steps=100

    )

  

    model.save('./fine_tuned_financial_model')

    return model

# Use fine-tuned model

embedding_model = SentenceTransformer('./fine_tuned_financial_model')


What is Late Chunking

What it does: Processes the entire document through the transformer before chunking the token embeddings (not the text).


The problem it solves: Traditional chunking loses long-distance context. Late chunking preserves full document context in each chunk’s embedding.


Conceptual example:


def late_chunk(text: str, chunk_size=512) -> list:

    """Embed full document BEFORE chunking"""

  

    # Step 1: Embed entire document (8192 tokens max)

    full_doc_token_embeddings = transformer_embed(text)  # Token-level

  

    # Step 2: Define chunk boundaries

    tokens = tokenize(text)

    chunk_boundaries = range(0, len(tokens), chunk_size)

  

    # Step 3: Pool token embeddings for each chunk

    chunks_with_embeddings = []

    for start in chunk_boundaries:

        end = start + chunk_size

        chunk_text = detokenize(tokens[start:end])

  

        # Mean pool token embeddings (preserves full doc context!)

        chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])

        chunks_with_embeddings.append((chunk_text, chunk_embedding))

  

    return chunks_with_embeddings




Late chunking in the context of embeddings for GenAI (Generative AI) is a strategy used when processing large documents or datasets for vector embeddings, particularly in RAG (Retrieval-Augmented Generation) workflows.


Here’s a clear breakdown:


Definition


Late chunking means delaying the splitting of content into smaller pieces (chunks) until after embedding generation has started or the content has been initially processed.


Instead of splitting a large document into chunks before generating embeddings (which is early chunking), the model or system first generates embeddings for larger units (like full documents or sections) and then splits or processes them further later in the pipeline if needed.


Why use Late Chunking?


Preserves context


Early chunking may break semantic context by splitting sentences or paragraphs arbitrarily.


Late chunking allows embeddings to capture larger context, improving similarity searches.


Efficient processing


You can generate embeddings for larger units first and only create smaller chunks if retrieval or indexing requires it, reducing unnecessary computations.


Dynamic retrieval granularity


Allows flexible adjustment of chunk size later depending on how the embeddings will be queried or used in the application.


Comparison to Early Chunking

Feature Early Chunking Late Chunking

When text is split Before embedding After embedding or during retrieval

Context retention Lower (may lose semantic links across chunks) Higher (larger context retained)

Processing efficiency May generate more embeddings unnecessarily Can reduce embedding count

Use case Simple search or small documents Large documents, long context GenAI applications


💡 Example Scenario:


A book with 1000 pages is to be used in a RAG application.


Early chunking: Split into 2-page chunks first → 500 embeddings.


Late chunking: Generate embeddings for each chapter first → 20 embeddings, then split chapters into smaller chunks later only if needed.


This approach balances context preservation and computational efficiency.

Tuesday, January 20, 2026

What is an example of Graphiti with Neo4J

from graphiti_core import Graphiti

from graphiti_core.nodes import EpisodeType


# Initialize Graphiti (connects to Neo4j)

graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")

async def ingest_document(text: str, source: str):

    """Ingest into knowledge graph"""

    # Graphiti automatically extracts entities and relationships

    await graphiti.add_episode(

        name=source,

        episode_body=text,

        source=EpisodeType.text,

        source_description=f"Document: {source}"

    )

async def search_knowledge_graph(query: str) -> str:

    """Hybrid search: semantic + keyword + graph"""

    # Graphiti combines:

    # - Semantic similarity (embeddings)

    # - BM25 keyword search

    # - Graph structure traversal

    # - Temporal context

  

    results = await graphiti.search(query=query, num_results=5)

  

    # Format graph results

    formatted = []

    for result in results:

        formatted.append(

            f"Entity: {result.node.name}\n"

            f"Type: {result.node.type}\n"

            f"Relationships: {result.relationships}"

        )

  

    return "\n---\n".join(formatted)





A Cross Encoder Example

 from sentence_transformers import CrossEncoder


# Initialize once
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
async def search_with_reranking(query: str, limit: int = 5) -> list:
# Stage 1: Fast vector retrieval (get 4x candidates)
candidate_limit = min(limit * 4, 20)
query_embedding = await embedder.embed_query(query)

candidates = await db.query(
"SELECT content, metadata FROM chunks ORDER BY embedding $1 LIMIT $2",
query_embedding, candidate_limit
)

# Stage 2: Re-rank with cross-encoder
pairs = [[query, row['content']] for row in candidates]
scores = reranker.predict(pairs)

# Sort by reranker scores and return top N
reranked = sorted(
zip(candidates, scores),
key=lambda x: x[1],
reverse=True
)[:limit]

return [doc for doc, score in reranked]