Thursday, May 22, 2025

What is KIND Tool?

Although Kubernetes production clusters are typically in a cloud environment, with the right tool,  running a Kubernetes cluster locally is not only possible but can also provide a variety of key benefits such as accelerated productivity, easy and efficient testing, and reduced resource expenditure. 


Kubernetes-in-Docker (Kind) is a command-line tool that enables developers to create a local Kubernetes cluster using docker images. With this novel approach, users can take advantage of Docker’s straightforward, self-contained deployments and cleanup to create and test Kubernetes infrastructure without the operational overhead of a full-blown cluster. 


The first step to understanding Kind and the value it brings to the table is to understand why developers would want a local Kubernetes development solution. There are a number of reasons to utilize a local Kubernetes cluster, for instance, the ability to test deployment methods check how the application interacts with mounted volumes, and test manifest files. 


It’s not enough for developers to simply spin up a service and test it. As services are deployed to Kubernetes clusters, developers must ensure they work together with other services and communicate properly with each other. Because of this, today it is more important than ever to have the option to run a Kubernetes cluster locally.

Here are some key use cases in which local Kubernetes clusters can be particularly beneficial: 

Proof of concepts and experimentation: Using local environments eliminates the need to provide the required cloud infrastructure, security configurations, and other administrative tasks. In essence, developers will be able to experiment and carry out Proof of Concepts (POCs) in a low-risk environment.

Smaller teams: With the differences in local machines and their respective software and configuration setups, there is a greater chance of configuration drift in large teams. However, a smaller team of experienced Kubernetes developers will be better able to standardize and align their cluster configurations based on the hardware being used, making local clusters more suitable. 


Low computation requirements: Local clusters are best suited for development environments with low computation requirements, or in other words, “simple” applications. 


What is Kind?


Kind is an open-source, CNCF-certified Kubernetes installer used by developers to quickly and easily create Kubernetes clusters using Docker container “nodes.” Though primarily designed for testing Kubernetes itself, Kind has proven to be an adept tool for local development and continuous integration (CI) pipelines. 



How does Kind work? 


At a high level, Kind clusters can be visualized as a single Docker container that runs a control plane node and worker nodes to form a Kubernetes cluster. Essentially, Kind bundles every Kubernetes object into a single image (called a node image), that contains all the required Kubernetes components to create a single-node or multi-node cluster. 


Kind creates images, however, developers have the option to create their own image if needed. Once the Kubernetes cluster is created, kind automatically configures kubectl context, making deployment easy and robust.



Support for multi-node clusters (including HA).

Support for building Kubernetes release builds from source.

Support for make/bash, docker, or bazel, in addition to pre-published builds.

Can be configured to run various releases of Kubernetes (v1.16.3, v1.17.1, etc.)



Kind is far from the only solution for running local clusters in Kubernetes, yet despite competing against tools such as Minkube, K3s, Microk8s, and more, Kind remains a strong contender in the market. 


Simplicity. With Kind, it’s simple to set up a Kubernetes environment for local testing without needing virtual machines or anything more complicated than a Docker install. Using the tool, developers can easily create, recreate or delete a cluster with a single command. Additionally, kind enables developers to load local container images directly into the Kubernetes cluster, saving the time and effort needed to set up a registry and push the images repeatedly. 


Speed. One of the key advantages of Kind is its start-up time, which is significantly faster than similar tools such as Minikube. For instance, Kind can launch a fully compliant Kubernetes cluster using Docker containers as nodes in less than a minute, drastically improving the developer experience when testing against Kubernetes. 


Customization. Another benefit of Kind is the customization it offers. By default, Kind creates the cluster with only one node, which acts as a control plane, however, users have the option to configure kind to run in a multi-node setup and add multiple control planes to simulate high availability. Additionally, because Kind works with docker images, developers can specify a custom docker image they want to run. 

 

references:

https://www.devoteam.com/expert-view/kind-simplifying-kubernetes-testing/#:~:text=Kind%20is%20an%20open%2Dsource,continuous%20integration%20(CI)%20pipelines.


Tuesday, May 20, 2025

High level overview of MCP components

At its core, MCP follows a client-server architecture where a host application can connect to multiple servers.

MCP hosts - apps like Claude Desktop, Cursor, Windsurf, or AI tools that want to access data via MCP.

MCP Clients - protocol clients that maintain 1:1 connections with MCP servers, acting as the communication bridge.

MCP Servers - lightweight programs that each expose specific capabilities (like reading files, querying databases...) through the standardized Model Context Protocol.

Local Data Sources - files, databases, and services on your computer that MCP servers can securely access. For instance, a browser automation MCP server needs access to your browser to work.

Remote Services - External APIs and cloud-based systems that MCP servers can connect to.


Sunday, May 18, 2025

What is main difference between Kubernetes and Openshift

People move from Kubernetes to OpenShift for several reasons, including ease of use, built-in tools, and enhanced security features. OpenShift simplifies Kubernetes management by providing a user-friendly interface and integrating CI/CD pipelines, making it easier for teams to deploy and manage applications. OpenShift also offers additional features like integrated development tools, image stream management, and centralized policy management, which can streamline development and operations. 

Here's a more detailed look at the reasons for migrating to OpenShift:

Ease of Use and Management:

OpenShift simplifies Kubernetes by providing a user-friendly interface and built-in tools for CI/CD, image management, and policy enforcement. This can significantly reduce the time and effort required to manage and operate Kubernetes clusters, particularly for teams without extensive Kubernetes expertise. 

Enhanced Security:

OpenShift offers a robust security framework, including role-based access control, network policies, and security contexts, which can help ensure the security of containerized applications. It also provides built-in security and encryption for container communications. 

Integrated Tools:

OpenShift includes a variety of integrated tools, such as source-to-image (S2I) for faster application development, image streams for container image management, and centralized policy management, which can streamline development workflows. 

Scalability and Customization:

OpenShift allows for customized scalability options to meet specific business needs, building upon the automated scaling capabilities of Kubernetes. 

On-Premise and Edge Computing:

OpenShift is well-suited for on-premise deployments and edge computing environments, offering robust security and management features in these environments. 

Red Hat Support:

OpenShift is backed by Red Hat's commercial support, which can be valuable for organizations that require a vendor-supported platform. 

Virtualization Integration:

OpenShift integrates with virtualization technologies like VMware and offers a unified platform for managing both virtualized and containerized workloads. 

In essence, OpenShift provides a more complete and user-friendly container management platform compared to bare-bones Kubernetes, offering a combination of enhanced security, integrated tools, and simplified management capabilities that can be particularly appealing to organizations seeking to streamline their cloud-native development and operations. 

What is Auto regressive decoding in LLMs

Auto-regressive decoding is the fundamental process by which most large language models (LLMs) generate text, one token at a time, in a sequential manner. The core idea is that the model predicts the next token based on all the tokens that have been generated so far, including the initial input prompt.

Here's a breakdown of how it works:

The Process:

Input: The process starts with an input prompt, which is a sequence of tokens.

Encoding: The LLM first processes this input prompt, typically by converting the tokens into numerical representations called embeddings.

First Token Prediction: Based on the encoded prompt, the model predicts the probability distribution over its entire vocabulary for the next token.

Token Sampling/Selection: A decoding strategy is then used to select the next token from this probability distribution. Common strategies include:

Greedy Decoding: Simply selecting the token with the highest probability. This is fast but can lead to repetitive or suboptimal outputs.

Sampling: Randomly selecting a token based on its probability. This introduces more diversity but can also lead to less coherent outputs.

Beam Search: Keeping track of multiple promising candidate sequences (beams) and expanding them at each step. This often yields better quality text than greedy decoding but is more computationally expensive.

Appending the Token: The selected token is appended to the currently generated sequence.

Iteration: The model then takes the original prompt plus the newly generated token as the new input and repeats steps 3-5 to predict the next token.

Stopping Condition: This process continues until a predefined stopping condition is met, such as reaching a maximum sequence length or generating a special "end-of-sequence" token.

Output: The final sequence of generated tokens is then converted back into human-readable text.

Why is it called "Auto-regressive"?


The term "auto-regressive" comes from statistics and signal processing. In this context, it means that the model's output at each step is dependent on its own previous outputs. The model "regresses" on its own generated history to predict the future.


Key Characteristics:


Sequential Generation: Tokens are generated one after the other. This inherent sequential nature can be a bottleneck for inference speed, especially for long sequences.

Context Dependency: Each predicted token is conditioned on the entire preceding context. This allows the model to maintain coherence and relevance in its generated text.

Probability Distribution: At each step, the model outputs a probability distribution over the vocabulary, allowing for different decoding strategies to influence the final output.

Implications:


Inference Speed: The sequential nature of auto-regressive decoding is a primary factor contributing to the latency of LLM inference. Generating longer sequences requires more steps.

Computational Cost: Each decoding step involves a forward pass through the model, which can be computationally intensive for large models.

Decoding Strategy Impact: The choice of decoding strategy significantly affects the quality, diversity, and coherence of the generated text, as well as the inference speed.

In summary, auto-regressive decoding is the step-by-step process of generating text by predicting one token at a time, with each prediction being conditioned on the previously generated tokens and the initial input. It's a fundamental mechanism behind the impressive text generation capabilities of modern LLMs.


References 

Gemini 

What is Cost, Throughput, Latency of LLM Inference? What are factors affecting these two and how to compute these ?

Throughput = Query / sec => Maximise for batch job speed or to allow more users 

Latency = sec / token => minimise for user experience ( how fast the application will look) = users can read 200 words per sec, so as long as the latency is below this, application will look to be performing good. 

Cost : Cheaper is better 

Let's break down the cost, throughput, and latency of LLM inference.

Cost of LLM Inference

What it is: The expense associated with running an LLM to generate responses or perform tasks based on input prompts.


Factors Affecting Cost:


Model Size: Larger models with more parameters generally require more computational resources, leading to higher costs.

Number of Tokens: Most LLM APIs charge based on the number of input and output tokens processed. Longer prompts and longer generated responses increase costs. Output tokens are often more expensive than input tokens.

Complexity of the Task: More complex tasks might require more processing and thus incur higher costs.

Hardware Used: The type and amount of hardware (GPUs, CPUs) used for inference significantly impact costs, especially for self-hosted models. Cloud-based services abstract this but factor it into their pricing.

Pricing Model: Different providers have varying pricing models (per token, per request, compute time, etc.).

Model Provider: Different providers offer the same or similar models at different price points.

Mixture of Experts (MoE) Models: These models might be priced based on the total number of parameters or the number of active parameters during inference.

How to Compute Cost:


The cost calculation depends on the pricing model of the LLM service or the infrastructure cost if self-hosting.


Per Token Pricing (Common for API services):

Cost = (Input Tokens / 1000 * Input Price per 1k tokens) + (Output Tokens / 1000 * Output Price per 1k tokens)

Self-Hosting: This involves calculating the cost of hardware (amortized over time), electricity, data center costs, and potentially software licenses. This is more complex and depends on your specific infrastructure.

Cloud Inference Services: These typically provide a per-token cost, and you can estimate based on your expected token usage. Some might have per-request fees as well.

Key Considerations:


Input vs. Output Tokens: Be mindful of the different costs for input and output tokens.

Context Length: Longer context windows can lead to higher token usage and thus higher costs.

Tokenization: Different models tokenize text differently, affecting the number of tokens for the same input.

Throughput of LLM Inference

What it is: The rate at which an LLM can process inference requests. It's often measured in:


Tokens per second (TPS): The number of input and/or output tokens the model can process or generate in one second. This is a common metric.

Requests per second (RPS): The number of independent inference requests the model can handle in one second. This depends on the total generation time per request.

Factors Affecting Throughput:


Model Size and Architecture: Smaller, less complex models generally have higher throughput.

Hardware: More powerful GPUs or CPUs with higher memory bandwidth lead to higher throughput. Using multiple parallel processing units (GPUs) significantly increases throughput.

Batch Size: Processing multiple requests together (batching) can significantly improve throughput by better utilizing the hardware. However, very large batch sizes can increase latency due to memory limitations.

Input/Output Length: Shorter prompts and shorter generated responses lead to higher throughput.

Optimization Techniques: Techniques like quantization, pruning, key-value caching, and efficient attention mechanisms (e.g., FlashAttention, Group Query Attention) can significantly boost throughput.

Parallelism: Techniques like tensor parallelism and pipeline parallelism distribute the model and its computations across multiple devices, improving throughput for large models.

Memory Bandwidth: The speed at which data (model weights, activations) can be transferred to the processing units is a crucial bottleneck for throughput.

How to Compute Throughput:


Tokens per Second:


Measure the total number of tokens processed (input + output) or generated (output only) over a specific time period.

TPS= 

Total Time (in seconds)

Total Tokens

 

Requests per Second:


Measure the total number of inference requests completed over a specific time period.

RPS= 

Total Time (in seconds)

Total Requests

 

RPS is also related to the average total generation time per request:


RPS≈ 

Average Total Generation Time per Request

1

 

Key Considerations:


Input vs. Output Tokens: Specify whether the throughput refers to input, output, or the sum of both.

Concurrency: Throughput often increases with the number of concurrent requests, up to a certain point.

Latency Trade-off: Increasing batch size to improve throughput can increase the latency for individual requests.

Latency of LLM Inference

What it is: The delay between sending an inference request to the LLM and receiving the response. It's a critical factor for user experience, especially in real-time applications. Common metrics include:


Time to First Token (TTFT): The time it takes for the model to generate the very first token of the response after receiving the prompt. This is crucial for perceived responsiveness.

Time Per Output Token (TPOT) / Inter-Token Latency (ITL): The average time taken to generate each subsequent token after the first one.

Total Generation Time / End-to-to-End Latency: The total time from sending the request to receiving the complete response.

Total Generation Time = TTFT + (TPOT * Number of Output Tokens)

Factors Affecting Latency:


Model Size and Complexity: Larger models generally have higher latency due to the increased computations required.

Input/Output Length: Longer prompts require more processing time before the first token can be generated (affecting TTFT). Longer desired output lengths naturally increase the total generation time.

Hardware: Faster GPUs or CPUs with lower memory access times reduce latency.

Batch Size: While batching improves throughput, it can increase the latency for individual requests as they wait to be processed in a batch.

Optimization Techniques: Model compression (quantization, pruning) and optimized attention mechanisms can reduce the computational overhead and thus lower latency.

Network Conditions: For cloud-based APIs, network latency between the user and the inference server adds to the overall latency. Geographical distance to the server matters.

System Load: High load on the inference server can lead to queuing and increased latency.

Cold Starts: The first inference request after a period of inactivity might experience higher latency as the model and necessary data are loaded into memory.

Tokenization: The time taken to tokenize the input prompt also contributes to the initial latency (TTFT).

How to Compute Latency:


Time to First Token (TTFT): Measure the time difference between sending the request and receiving the first token.

Time Per Output Token (TPOT): Measure the time taken to generate the entire response (excluding TTFT) and divide it by the number of output tokens.

TPOT= 

Number of Output Tokens

Total Generation Time - TTFT

 

Total Generation Time: Measure the time difference between sending the request and receiving the last token of the response.

Key Considerations:


TTFT Importance: For interactive applications, minimizing TTFT is crucial for a good user experience.

Trade-off with Throughput: Optimizations for higher throughput (like large batch sizes) can negatively impact latency.

Variability: Latency can vary depending on the specific prompt, the model's state, and the server load. It's often useful to measure average and percentile latencies.

Understanding these three aspects – cost, throughput, and latency – and the factors that influence them is crucial for effectively deploying and utilizing LLMs in various applications. There's often a trade-off between these metrics, and the optimal balance depends on the specific use case and requirements.


Creating Knowledge Graph using SimpleKGPipeline

Below are the properties of the graph we are creating 

Document: metadata for document sources

Chunk: text chunks from the documents with embeddings to power vector retrieval

__Entity__: Entities extracted from the text chunks

Creating a knowledge graph with the GraphRAG Python package is pretty simple

The SimpleKGPipeline class allows you to automatically build a knowledge graph with a few key inputs, including

a driver to connect to Neo4j,

an LLM for entity extraction, and

an embedding model to create vectors on text chunks for similarity search.

Neo4j Driver

The Neo4j driver allows you to connect and perform read and write transactions with the database. You can obtain the URI, username, and password variables from when you created the database. If you created your database on AuraDB, they are in the file you downloaded.

import neo4j

neo4j_driver = neo4j.GraphDatabase.driver(NEO4J_URI,

                                         auth=(NEO4J_USERNAME, NEO4J_PASSWORD))


LLM & Embedding Model

In this case, we will use OpenAI GPT-4o-mini for convenience. It is a fast and low-cost model. The GraphRAG Python package supports opens in new tabany LLM model, including models from OpenAI, Google VertexAI, Anthropic, Cohere, Azure OpenAI, local Ollama models, and any chat model that works with LangChain. You can also implement a custom interface for any other LLM.


Likewise, we will use OpenAI’s default text-embedding-ada-002 for the embedding model, but you can use opens in new tabother embedders from different providers.


import neo4j

from neo4j_graphrag.llm import OpenAILLM

from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings


llm=OpenAILLM(

   model_name="gpt-4o-mini",

   model_params={

       "response_format": {"type": "json_object"}, # use json_object formatting for best results

       "temperature": 0 # turning temperature down for more deterministic results

   }

)


#create text embedder

embedder = OpenAIEmbeddings()


Optional Inputs: Schema & Prompt Template

While not required, adding a graph schema is highly recommended for improving knowledge graph quality. It provides guidance for the node and relationship types to create during entity extraction.


Pro-tip: If you are still deciding what schema to use, try building a graph without a schema first and examine the most common node and relationship types created as a starting point.


For our graph schema, we will define entities (a.k.a. node labels) and relations that we want to extract. While we won’t use it in this simple example, there is also an optional potential_schema argument, which can guide opens in new tabwhich relationships should connect to which nodes.



basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]


academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]


medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent",

                      "CellType", "Condition", "Disease", "Drug",

                      "EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",

                      "MolecularFunction", "Pathway"]


node_labels = basic_node_labels + academic_node_labels + medical_node_labels


# define relationship types

rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",

   "BIOMARKER_FOR", …]


We will also be adding a custom prompt for entity extraction. While the GraphRAG Python package has an internal default prompt, engineering a prompt closer to your use case often helps create a more applicable knowledge graph. The prompt below was created with a bit of experimentation. Be sure to follow the same general format as the opens in new tabdefault prompt.


prompt_template = '''

You are a medical researcher tasks with extracting information from papers 

and structuring it in a property graph to inform further medical and research Q&A.


Extract the entities (nodes) and specify their type from the following Input text.

Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node. 



Return result as JSON using the following format:

{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],

  "relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}


- Use only the information from the Input text. Do not add any additional information.  

- If the input text is empty, return empty Json. 

- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.

- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions. 

- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general. 


Use only fhe following nodes and relationships (if provided):

{schema}


Assign a unique ID (string) to each node, and reuse it to define relationships.

Do respect the source and target node types for relationship and

the relationship direction.


Do not return any additional information other than the JSON in it.


Examples:

{examples}


Input text:


{text}

'''


Creating the SimpleKGPipeline

Create the SimpleKGPipeline using the constructor below:



from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter

from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline


kg_builder_pdf = SimpleKGPipeline(

   llm=ex_llm,

   driver=driver,

   text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),

   embedder=embedder,

   entities=node_labels,

   relations=rel_types,

   prompt_template=prompt_template,

   from_pdf=True

)



Running the Knowledge Graph Builder

You can run the knowledge graph builder with the run_async method. We are going to iterate through 3 PDFs below.


pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf',

            'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf',

            'truncated-pdfs/pgpm-13-39-trunc.pdf']


for path in pdf_file_paths:

    print(f"Processing : {path}")

    pdf_result = await kg_builder_pdf.run_async(file_path=path)

    print(f"Result: {pdf_result}")


Once complete, you can explore the resulting knowledge graph. opens in new tabThe Unified Console provides a great interface for this.


Go to the Query tab and enter the below query to see a sample of the graph.



MATCH p=()-->() RETURN p LIMIT 1000;


Friday, May 16, 2025

Basics of GraphRAG python package

This package contains the official Neo4j GraphRAG features for Python.

The purpose of this package is to provide a first party package to developers, where Neo4j can guarantee long term commitment and maintenance as well as being fast to ship new features and high performing patterns and methods.

 This package is a renamed continuation of neo4j-genai. The package neo4j-genai is deprecated and will no longer be maintained. We encourage all users to migrate to this new package to continue receiving updates and support.

pip install neo4j-graphrag

pip install "neo4j-graphrag[openai]"


LLM providers (at least one is required for RAG and KG Builder Pipeline):

ollama: LLMs from Ollama

openai: LLMs from OpenAI (including AzureOpenAI)

google: LLMs from Vertex AI

cohere: LLMs from Cohere

anthropic: LLMs from Anthropic

mistralai: LLMs from MistralAI


sentence-transformers : to use embeddings from the sentence-transformers Python package


Vector database (to use External Retrievers):

weaviate: store vectors in Weaviate


pinecone: store vectors in Pinecone


qdrant: store vectors in Qdrant


experimental: experimental features mainly from the Knowledge Graph creation pipelines.


nlp:

spaCy: load spaCy trained models for nlp pipelines, used by SpaCySemanticMatchResolver component from the Knowledge Graph creation pipelines.


fuzzy-matching:

rapidfuzz: apply fuzzy matching using string similarity, used by FuzzyMatchResolver component from the Knowledge Graph creation pipelines.



Sample is as below 


Creating the Vector indexes 

===========================


from neo4j import GraphDatabase

from neo4j_graphrag.indexes import create_vector_index


URI = "neo4j://localhost:7687"

AUTH = ("neo4j", "password")


INDEX_NAME = "vector-index-name"


# Connect to Neo4j database

driver = GraphDatabase.driver(URI, auth=AUTH)


# Creating the index

create_vector_index(

    driver,

    INDEX_NAME,

    label="Document",

    embedding_property="vectorProperty",

    dimensions=1536,

    similarity_fn="euclidean",

)



Populating the vector indexes 

===========================


from neo4j import GraphDatabase

from neo4j_graphrag.indexes import upsert_vectors

from neo4j_graphrag.types import EntityType


URI = "neo4j://localhost:7687"

AUTH = ("neo4j", "password")


# Connect to Neo4j database

driver = GraphDatabase.driver(URI, auth=AUTH)


# Upsert the vector

vector = ...

upsert_vectors(

    driver,

    ids=["1234"],

    embedding_property="vectorProperty",

    embeddings=[vector],

    entity_type=EntityType.NODE,

)


Below is how to retrieve the documents 


from neo4j import GraphDatabase

from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings

from neo4j_graphrag.retrievers import VectorRetriever


URI = "neo4j://localhost:7687"

AUTH = ("neo4j", "password")


INDEX_NAME = "vector-index-name"


# Connect to Neo4j database

driver = GraphDatabase.driver(URI, auth=AUTH)


# Create Embedder object

# Note: An OPENAI_API_KEY environment variable is required here

embedder = OpenAIEmbeddings(model="text-embedding-3-large")


# Initialize the retriever

retriever = VectorRetriever(driver, INDEX_NAME, embedder)


# Run the similarity search

query_text = "How do I do similarity search in Neo4j?"

response = retriever.search(query_text=query_text, top_k=5)



references:

https://neo4j.com/docs/neo4j-graphrag-python/current/