Monday, January 26, 2026

What is VL Jepa VL-JEPA: Joint Embedding Predictive Architecture for Vision-language

 VL-JEPA (Vision-Language Joint Embedding Predictive Architecture) is a new vision-language model architecture that represents a major shift away from the typical generative, token-by-token approach used in most large multimodal models (like GPT-4V, LLaVA, InstructBLIP, etc.). Instead of learning to generate text tokens one after another, VL-JEPA trains a model to predict continuous semantic embeddings in a shared latent space that captures the meaning of text and visual content. (arXiv)

🧠 Core Idea

  • Joint Embedding Predictive Architecture (JEPA): The model adopts the JEPA philosophy: don’t predict low-level data (e.g., pixels or tokens) — predict meaningful latent representations. VL-JEPA applies this idea to vision-language tasks. (arXiv)

  • Predict Instead of Generate: Traditionally, vision-language models are trained to generate text outputs autoregressively (one token at a time). VL-JEPA instead predicts the continuous embedding vector of the target text given visual inputs and a query. This embedding represents the semantic meaning rather than the specific tokens. (arXiv)

  • Focus on Semantics: By operating in an abstract latent space, the model focuses on task-relevant semantics and reduces wasted effort modeling surface-level linguistic variability. (arXiv)

⚙️ How It Works

  1. Vision and Text Encoders:

    • A vision encoder extracts visual embeddings from images or video frames.

    • A text encoder maps query text and target text into continuous embeddings.

  2. Predictor:

    • The model’s core component predicts target text embeddings based on the visual context and input query, without generating actual text tokens. (arXiv)

  3. Selective Decoding:

    • When human-readable text is needed, a lightweight decoder can translate predicted embeddings into tokens. VL-JEPA supports selective decoding, meaning it only decodes what’s necessary — significantly reducing computation compared to standard autoregressive decoding. (alphaxiv.org)

🚀 Advantages

  • Efficiency: VL-JEPA uses roughly 50 % fewer trainable parameters than comparable token-generative vision-language models while maintaining or exceeding performance on many benchmarks. (arXiv)

  • Non-Generative Focus: The model is inherently non-generative during training, focusing on predicting semantic meaning, which leads to faster inference and reduced latency in applications like real-time video understanding. (DEV Community)

  • Supports Many Tasks: Without architectural changes, VL-JEPA naturally handles tasks such as open-vocabulary classification, text-to-video retrieval, and discriminative visual question answering (VQA). (arXiv)

📊 Performance

In controlled comparisons:

  • VL-JEPA outperforms or rivals established methods like CLIP, SigLIP2, and Perception Encoder on classification and retrieval benchmarks. (OpenReview)

  • On VQA datasets, it achieves performance comparable to classical VLMs (e.g., InstructBLIP, QwenVL) despite using fewer parameters. (OpenReview)


In summary, VL-JEPA moves beyond token generation toward semantic embedding prediction in vision-language models, offering greater efficiency and real-time capabilities without sacrificing general task performance. (arXiv)

references:

https://arxiv.org/abs/2512.10942

Sunday, January 25, 2026

Multi version concurrency control postgres

 In PostgreSQL, Multiversion Concurrency Control (MVCC) is the secret sauce that allows the database to handle multiple users at once without everyone stepping on each other's toes.

The core philosophy is simple: Readers never block writers, and writers never block readers.

How It Works: The "Snapshot" Concept

Instead of locking a row when someone wants to update it (which would force everyone else to wait), Postgres keeps multiple versions of that row. When you start a transaction, Postgres gives you a snapshot of the data as it existed at that exact moment.

1. Hidden Columns

Every row in a Postgres table has hidden system columns used for MVCC:

 * xmin: The ID of the transaction that inserted the row.

 * xmax: The ID of the transaction that deleted or updated the row (initially set to 0).

2. The Update Process

When you update a row, Postgres doesn't actually overwrite the old data. It performs a "soft delete" and an "insert":

 * It marks the old row version as expired by setting its xmax to the current Transaction ID.

 * It creates a new version of the row with the updated data and sets its xmin to the current Transaction ID.

3. Visibility Rules

When a different transaction tries to read that row, Postgres compares the transaction's ID with the xmin and xmax of the available versions:

 * If a row's xmin is from a committed transaction and its xmax is either 0 or belongs to an uncommitted transaction, the row is visible.

 * This ensures you always see a consistent state of the database, even if someone else is mid-update.

The Side Effect: Bloat and VACUUM

Because Postgres keeps those old versions of rows around (often called "dead tuples"), the database files will eventually grow—this is known as bloat.

To clean this up, Postgres uses a process called VACUUM:

 * It scans for rows where the xmax belongs to a transaction that is long finished.

 * It marks that space as available for new data.

 * Autovacuum is the built-in daemon that handles this automatically in the background so you don't have to.

Why This Matters

 * Performance: High concurrency. You can run massive reports (READs) while your application is constantly updating data (WRITEs) without them fighting for locks.

 * Consistency: You get "Snapshot Isolation," meaning your query results won't change halfway through just because another user committed a change.

> Note: While readers and writers don't block each other, writers still block writers if they attempt to update the exact same row at the same time.

Would you like me to show you how to query these hidden xmin and xmax columns on one of your existing tables to see this in action?


Thursday, January 22, 2026

What is Spec Driven Development

Instead of coding first and writing docs later, in spec-driven development, you start with a (you guessed it) spec. This is a contract for how your code should behave and becomes the source of truth your tools and AI agents use to generate, test, and validate code. The result is less guesswork, fewer surprises, and higher-quality code

Spec Kit makes your specification the center of your engineering process. Instead of writing a spec and setting it aside, the spec drives the implementation, checklists, and task breakdowns.  Your primary role is to steer; the coding agent does the bulk of the writing.

It works in four phases with clear checkpoints. But here’s the key insight: each phase has a specific job, and you don’t move to the next one until the current task is fully validated. 

Here’s how the process breaks down:

Specify: You provide a high-level description of what you’re building and why, and the coding agent generates a detailed specification. This isn’t about technical stacks or app design. It’s about user journeys, experiences, and what success looks like. Who will use this? What problem does it solve for them? How will they interact with it? What outcomes matter? Think of it as mapping the user experience you want to create, and letting the coding agent flesh out the details. Crucially, this becomes a living artifact that evolves as you learn more about your users and their needs.

Plan: Now you get technical. In this phase, you provide the coding agent with your desired stack, architecture, and constraints, and the coding agent generates a comprehensive technical plan. If your company standardizes on certain technologies, this is where you say so. If you’re integrating with legacy systems, have compliance requirements, or have performance targets you need to hit … all of that goes here. You can also ask for multiple plan variations to compare and contrast different approaches. If you make your internal docs available to the coding agent, it can integrate your architectural patterns and standards directly into the plan. After all, a coding agent needs to understand the rules of the game before it starts playing.

Tasks: The coding agent takes the spec and the plan and breaks them down into actual work. It generates small, reviewable chunks that each solve a specific piece of the puzzle. Each task should be something you can implement and test in isolation; this is crucial because it gives the coding agent a way to validate its work and stay on track, almost like a test-driven development process for your AI agent. Instead of “build authentication,” you get concrete tasks like “create a user registration endpoint that validates email format.”

Implement: Your coding agent tackles the tasks one by one (or in parallel, where applicable). But here’s what’s different: instead of reviewing thousand-line code dumps, you, the developer, review focused changes that solve specific problems. The coding agent knows what it’s supposed to build because the specification told it. It knows how to build it because the plan told it. And it knows exactly what to work on because the task told it.

Crucially, your role isn’t just to steer. It’s to verify. At each phase, you reflect and refine. Does the spec capture what you actually want to build? Does the plan account for real-world constraints? Are there omissions or edge cases the AI missed? The process builds in explicit checkpoints for you to critique what’s been generated, spot gaps, and course correct before moving forward. The AI generates the artifacts; you ensure they’re right.

Wednesday, January 21, 2026

Best RAG Startegies

Implementation Roadmap: Start Simple, Scale Smart

Don’t try to implement everything at once. Here’s a practical roadmap:


Phase 1: Foundation (Week 1)

Context-aware chunking (replace fixed-size splitting)

Basic vector search with proper embeddings

Measure baseline accuracy


Phase 2: Quick Wins (Week 2–3)

Add re-ranking (biggest accuracy boost for effort)

Implement query expansion (handles vague queries)

Measure improvement


Phase 3: Advanced (Week 4–6)

Add multi-query or agentic RAG (choose based on use case)

Implement self-reflection for critical queries

Fine-tune and optimize


Phase 4: Specialization (Month 2+)

Add contextual retrieval for high-value documents

Consider knowledge graphs if relationships matter

Fine-tune embeddings for domain-specific accuracy


references

https://pub.towardsai.net/i-spent-3-months-building-ra-systems-before-learning-these-11-strategies-1a8f6b4278aa

A simple example for embedding model

from sentence_transformers import SentenceTransformer, losses

from torch.utils.data import DataLoader


def prepare_training_data():

    """Domain-specific query-document pairs"""

    return [

        ("What is EBITDA?", "EBITDA (Earnings Before Interest, Taxes..."),

        ("Explain capital expenditure", "Capital expenditure (CapEx) refers to..."),

        # ... thousands more pairs

    ]

def fine_tune_model():

    """Fine-tune on domain data"""

    # Load base model

    model = SentenceTransformer('all-MiniLM-L6-v2')

  

    # Prepare training data

    train_examples = prepare_training_data()

    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

  

    # Define loss function

    train_loss = losses.MultipleNegativesRankingLoss(model)

  

    # Train

    model.fit(

        train_objectives=[(train_dataloader, train_loss)],

        epochs=3,

        warmup_steps=100

    )

  

    model.save('./fine_tuned_financial_model')

    return model

# Use fine-tuned model

embedding_model = SentenceTransformer('./fine_tuned_financial_model')


What is Late Chunking

What it does: Processes the entire document through the transformer before chunking the token embeddings (not the text).


The problem it solves: Traditional chunking loses long-distance context. Late chunking preserves full document context in each chunk’s embedding.


Conceptual example:


def late_chunk(text: str, chunk_size=512) -> list:

    """Embed full document BEFORE chunking"""

  

    # Step 1: Embed entire document (8192 tokens max)

    full_doc_token_embeddings = transformer_embed(text)  # Token-level

  

    # Step 2: Define chunk boundaries

    tokens = tokenize(text)

    chunk_boundaries = range(0, len(tokens), chunk_size)

  

    # Step 3: Pool token embeddings for each chunk

    chunks_with_embeddings = []

    for start in chunk_boundaries:

        end = start + chunk_size

        chunk_text = detokenize(tokens[start:end])

  

        # Mean pool token embeddings (preserves full doc context!)

        chunk_embedding = mean_pool(full_doc_token_embeddings[start:end])

        chunks_with_embeddings.append((chunk_text, chunk_embedding))

  

    return chunks_with_embeddings




Late chunking in the context of embeddings for GenAI (Generative AI) is a strategy used when processing large documents or datasets for vector embeddings, particularly in RAG (Retrieval-Augmented Generation) workflows.


Here’s a clear breakdown:


Definition


Late chunking means delaying the splitting of content into smaller pieces (chunks) until after embedding generation has started or the content has been initially processed.


Instead of splitting a large document into chunks before generating embeddings (which is early chunking), the model or system first generates embeddings for larger units (like full documents or sections) and then splits or processes them further later in the pipeline if needed.


Why use Late Chunking?


Preserves context


Early chunking may break semantic context by splitting sentences or paragraphs arbitrarily.


Late chunking allows embeddings to capture larger context, improving similarity searches.


Efficient processing


You can generate embeddings for larger units first and only create smaller chunks if retrieval or indexing requires it, reducing unnecessary computations.


Dynamic retrieval granularity


Allows flexible adjustment of chunk size later depending on how the embeddings will be queried or used in the application.


Comparison to Early Chunking

Feature Early Chunking Late Chunking

When text is split Before embedding After embedding or during retrieval

Context retention Lower (may lose semantic links across chunks) Higher (larger context retained)

Processing efficiency May generate more embeddings unnecessarily Can reduce embedding count

Use case Simple search or small documents Large documents, long context GenAI applications


💡 Example Scenario:


A book with 1000 pages is to be used in a RAG application.


Early chunking: Split into 2-page chunks first → 500 embeddings.


Late chunking: Generate embeddings for each chapter first → 20 embeddings, then split chapters into smaller chunks later only if needed.


This approach balances context preservation and computational efficiency.

Tuesday, January 20, 2026

What is an example of Graphiti with Neo4J

from graphiti_core import Graphiti

from graphiti_core.nodes import EpisodeType


# Initialize Graphiti (connects to Neo4j)

graphiti = Graphiti("neo4j://localhost:7687", "neo4j", "password")

async def ingest_document(text: str, source: str):

    """Ingest into knowledge graph"""

    # Graphiti automatically extracts entities and relationships

    await graphiti.add_episode(

        name=source,

        episode_body=text,

        source=EpisodeType.text,

        source_description=f"Document: {source}"

    )

async def search_knowledge_graph(query: str) -> str:

    """Hybrid search: semantic + keyword + graph"""

    # Graphiti combines:

    # - Semantic similarity (embeddings)

    # - BM25 keyword search

    # - Graph structure traversal

    # - Temporal context

  

    results = await graphiti.search(query=query, num_results=5)

  

    # Format graph results

    formatted = []

    for result in results:

        formatted.append(

            f"Entity: {result.node.name}\n"

            f"Type: {result.node.type}\n"

            f"Relationships: {result.relationships}"

        )

  

    return "\n---\n".join(formatted)