Sunday, December 7, 2025

Semantic Search Options in AWS

 Excellent question! **Semantic search** goes beyond keyword matching to understand the *meaning* and *context* of queries. Here's how to enable semantic search across each AWS database service:


## **Two Main Approaches for Semantic Search**


1. **Vector Search**: Converting text/images into embeddings (vectors) and searching by similarity

2. **LLM-Enhanced Search**: Using Large Language Models to understand query intent


---


## **How to Enable Semantic Search on Each Service**


### **1. Amazon OpenSearch Service / OpenSearch Serverless**

**Native Support:** ✅ **Best suited for semantic search**

- **OpenSearch Neural Search Plugin** (built-in):

  - Supports vector search using ML models (BERT, sentence-transformers)

  - Can generate embeddings or ingest pre-computed embeddings

  - `k-NN` (k-nearest neighbors) index for similarity search

- **Implementation:**

  ```json

  // 1. Create index with vector field

  {

    "settings": {"index.knn": true},

    "mappings": {

      "properties": {

        "embedding": {

          "type": "knn_vector",

          "dimension": 768

        },

        "text": {"type": "text"}

      }

    }

  }

  

  // 2. Query using semantic similarity

  {

    "query": {

      "knn": {

        "embedding": {

          "vector": [0.1, 0.2, ...],  // Query embedding

          "k": 10

        }

      }

    }

  }

  ```

- **Use Cases:** Hybrid search (combining keyword + semantic), RAG (Retrieval-Augmented Generation)


### **2. Amazon Aurora PostgreSQL**

**Native Support:** ✅ **via pgvector extension**

- **pgvector extension** adds vector similarity search capabilities:

  ```sql

  -- Enable extension

  CREATE EXTENSION vector;

  

  -- Create table with vector column

  CREATE TABLE documents (

    id SERIAL PRIMARY KEY,

    content TEXT,

    embedding vector(1536)  -- OpenAI embeddings dimension

  );

  

  -- Create index for fast similarity search

  CREATE INDEX ON documents USING ivfflat (embedding vector_cosine_ops);

  

  -- Semantic search query

  SELECT content, embedding <=> '[0.1, 0.2, ...]' as similarity

  FROM documents

  ORDER BY similarity

  LIMIT 10;

  ```

- **Integration:** Use AWS Lambda + Amazon Bedrock/SageMaker to generate embeddings

- **Best for:** Applications already on PostgreSQL needing semantic capabilities


### **3. Amazon RDS for PostgreSQL**

**Native Support:** ✅ **Same as Aurora PostgreSQL**

- Also supports `pgvector` extension (PostgreSQL 11+)

- Similar implementation to Aurora

- **Limitation:** May have slightly lower performance than Aurora for large-scale vector operations


### **4. Amazon DynamoDB**

**No Native Support:** ❌ But can be enabled via:

- **DynamoDB + OpenSearch/Elasticache** pattern:

  1. Store metadata in DynamoDB

  2. Store vectors in OpenSearch/Amazon MemoryDB (Redis with RedisVL)

  3. Use DynamoDB Streams to keep them in sync

- **DynamoDB + S3** pattern:

  1. Store vectors as Parquet files in S3

  2. Use Athena/Pandas for similarity search

  3. Store metadata in DynamoDB

- **Bedrock Knowledge Bases** (newest approach):

  - AWS-managed RAG solution

  - Automatically chunks documents, generates embeddings, stores in vector database

  - Can use OpenSearch as vector store with DynamoDB as metadata store


### **5. Amazon DocumentDB**

**Limited Native Support:** ⚠️ **Workaround needed**

- No native vector data type

- **Solutions:**

  1. **Store embeddings as arrays**: `"embedding": [0.1, 0.2, ...]`

  2. **Use cosine similarity in application code** (not efficient at scale)

  3. **Hybrid approach**: Store in DocumentDB, index vectors in OpenSearch

- Not recommended for production semantic search at scale


### **6. Amazon MemoryDB for Redis**

**Native Support:** ✅ **via RedisVL (Redis Vector Library)**

- **Redis Stack** includes RediSearch with vector search:

  ```bash

  # Create index with vector field

  FT.CREATE doc_idx 

    ON HASH 

    PREFIX 1 doc:

    SCHEMA 

      content TEXT 

      embedding VECTOR 

        FLAT 6 

        TYPE FLOAT32 

        DIM 768 

        DISTANCE_METRIC COSINE

  

  # Search similar vectors

  FT.SEARCH doc_idx 

    "(*)=>[KNN 10 @embedding $query_vector]" 

    PARAMS 2 query_vector "<binary_vector>" 

    DIALECT 2

  ```

- **Advantage:** Ultra-low latency (millisecond search)

- **Best for:** Real-time semantic search, caching vector results


### **7. Amazon Neptune**

**Limited Support:** ⚠️ **Graph-enhanced semantic search**

- Not designed for vector similarity search

- **Alternative approach:** Graph-augmented semantic search

  1. Use OpenSearch for vector search

  2. Use Neptune to navigate relationships in results

  3. Example: Find semantically similar documents, then find related entities in graph

- **Use Case:** Knowledge graphs with semantic understanding


---


## **AWS-Managed Semantic Search Solutions**


### **Option 1: Amazon Bedrock Knowledge Bases**

**Fully managed RAG solution:**

1. Upload documents to S3

2. Bedrock automatically:

   - Chunks documents

   - Generates embeddings (using Titan or Cohere)

   - Stores vectors in supported vector database

3. Query via Retrieve API

4. **Supported vector stores:** Aurora PostgreSQL, OpenSearch, Pinecone


### **Option 2: Amazon Kendra**

**Enterprise semantic search service:**

- Pre-trained models for understanding natural language

- Connectors for various data sources

- No need to manage embeddings/models

- More expensive but requires zero ML expertise


### **Option 3: Custom Architecture with SageMaker**

```

Data Sources → SageMaker (Embedding Model) → Vector Store → Query Service

      ↓               ↓                         ↓            ↓

   DynamoDB     Lambda Functions          OpenSearch    API Gateway

 (Metadata)    (Orchestration)           (Vectors)     (Client)

```


---


## **Recommendations by Use Case**


| Use Case | Recommended AWS Stack |

|----------|----------------------|

| **Starting fresh** | **OpenSearch Service** (neural plugin) or **Bedrock Knowledge Bases** |

| **Already on PostgreSQL** | **Aurora PostgreSQL** with pgvector |

| **Real-time/low-latency** | **MemoryDB for Redis** (RedisVL) |

| **Enterprise/zero-ML** | **Amazon Kendra** |

| **Serverless RAG** | **Bedrock Knowledge Bases** + **DynamoDB** |

| **High-scale hybrid search** | **OpenSearch** (combines BM25 + vector search) |


## **Quick Start Path**

For most applications starting semantic search on AWS:


1. **For simplicity**: Use **Amazon Bedrock Knowledge Bases**

2. **For control/flexibility**: Use **OpenSearch Service with neural plugin**

3. **For existing PostgreSQL apps**: Add **pgvector to Aurora PostgreSQL**

4. **For real-time needs**: Use **MemoryDB for Redis with RedisVL**


The key is generating quality embeddings. AWS offers:

- **Amazon Titan Embeddings** (via Bedrock)

- **SageMaker JumpStart** (pre-trained models)

- **SageMaker Training** (custom models)

Amazon Database Options

 Of course. This is an excellent list that covers the major AWS managed database and analytics services. The key to understanding them is to recognize they solve different types of data problems.


Here’s a breakdown of each service, grouped by their primary purpose.


---


### **1. Search & Analytics Engines**

These services are optimized for full-text search, log analytics, and real-time application monitoring.


*   **Amazon OpenSearch Service:**

    *   **What it is:** A managed service for **OpenSearch** (the open-source fork of Elasticsearch) and Kibana. It's a search and analytics engine.

    *   **Use Case:** Ideal for log and event data analysis (like application logs, cloud infrastructure logs), full-text search (product search on a website), and real-time application monitoring dashboards.

    *   **Analogy:** A super-powered, distributed "Ctrl+F" for your entire application's data, with built-in visualization tools.


*   **Amazon OpenSearch Serverless:**

    *   **What it is:** A **serverless option** for OpenSearch. You don't provision or manage clusters. It automatically scales based on workload.

    *   **Use Case:** Perfect for **spiky or unpredictable search and analytics workloads**. You pay only for the resources you consume during queries and indexing, without the operational overhead.

    *   **Key Difference vs. OpenSearch Service:** No clusters to manage. Automatic, fine-grained scaling.


---


### **2. Relational Databases (SQL)**

These are traditional table-based databases for structured data, ensuring ACID (Atomicity, Consistency, Isolation, Durability) compliance.


*   **Amazon Aurora PostgreSQL:**

    *   **What it is:** A **high-performance, AWS-built, fully compatible** drop-in replacement for PostgreSQL. It uses a distributed, cloud-native storage architecture.

    *   **Use Case:** The default choice for most new relational workloads on AWS. Ideal for complex transactions, e-commerce applications, and ERP systems where you need high throughput, scalability, and durability. It typically offers better performance and availability than standard RDS.

    *   **Key Feature:** Storage automatically grows in 10GB increments up to 128 TB. It replicates data six ways across Availability Zones.


*   **Amazon RDS for PostgreSQL:**

    *   **What it is:** The classic **managed service for running a standard PostgreSQL database** on AWS. AWS handles provisioning, patching, backups, and failure detection.

    *   **Use Case:** When you need a straightforward, fully-managed PostgreSQL instance without the advanced cloud-native architecture of Aurora. It's easier to migrate to from on-premises PostgreSQL.

    *   **Key Difference vs. Aurora:** Uses the standard PostgreSQL storage engine. Simpler architecture, often slightly lower cost for light workloads, but with more manual scaling steps and lower performance ceilings than Aurora.


---


### **3. NoSQL Databases**

These are for non-relational data, optimized for specific data models like documents, key-value, or graphs.


*   **Amazon DocumentDB (with MongoDB compatibility):**

    *   **What it is:** A managed **document database** service that is **API-compatible with MongoDB**. It uses a distributed, durable storage system built by AWS.

    *   **Use Case:** Storing and querying JSON-like documents (e.g., user profiles, product catalogs, content management). Good for workloads that benefit from MongoDB's flexible schema but need AWS's scalability and manageability.

    *   **Note:** It does **not** use the MongoDB server code; it emulates the API.


*   **Amazon DynamoDB:**

    *   **What it is:** A fully managed, **serverless, key-value and document database**. It offers single-digit millisecond performance at any scale.

    *   **Use Case:** High-traffic web applications (like gaming, ad-tech), serverless backends (with AWS Lambda), and any application needing consistent, fast performance for simple lookups and massive scale (e.g., shopping cart data, session storage).

    *   **Key Feature:** "Zero-ETL with..." refers to new integrations where data from other services (like Aurora, S3) can be analyzed in DynamoDB without manual Extract, Transform, Load processes.


*   **Amazon MemoryDB for Redis:**

    *   **What it is:** A **Redis-compatible, in-memory database** service offering high performance and durability. It stores the entire dataset in memory and uses a multi-AZ transactional log for persistence.

    *   **Use Case:** Use as a **primary database** for applications that require ultra-fast performance and data persistence (e.g., real-time leaderboards, session stores, caching with strong consistency). It's more than just a cache.


*   **Amazon Neptune:**

    *   **What it is:** A fully managed **graph database** service.

    *   **Use Case:** For applications where relationships between data points are highly connected and as important as the data itself. Ideal for social networks (friends of friends), fraud detection (unusual connection patterns), knowledge graphs, and network security.


---


### **Summary Table**


| Service | Category | Primary Data Model | Best For |

| :--- | :--- | :--- | :--- |

| **OpenSearch Service** | Search/Analytics | Search Index | Log analytics, full-text search |

| **OpenSearch Serverless** | Search/Analytics | Search Index | **Serverless** log analytics & search |

| **Aurora PostgreSQL** | Relational (SQL) | Tables (Rows/Columns) | High-performance, cloud-native OLTP apps |

| **RDS for PostgreSQL** | Relational (SQL) | Tables (Rows/Columns) | Traditional, fully-managed PostgreSQL |

| **DocumentDB** | NoSQL | Documents (JSON) | MongoDB-compatible document workloads |

| **DynamoDB** | NoSQL | Key-Value & Document | Serverless apps, massive scale, low-latency |

| **MemoryDB for Redis** | NoSQL (In-Memory) | Key-Value, Data Structures | **Primary database** needing microsecond reads |

| **Neptune** | NoSQL | Graph (Nodes/Edges) | Highly connected data (relationships) |


**Choosing the right one depends on:**

1.  **Your Data Structure:** Is it tables, JSON documents, key-value pairs, or interconnected relationships?

2.  **Access Patterns:** Do you need complex queries, simple lookups, full-text search, or graph traversals?

3.  **Scale & Performance Needs:** Predictable workload vs. spiky, need for millisecond vs. sub-millisecond latency.

4.  **Operational Preference:** Do you want to manage servers/clusters (RDS) or go serverless (DynamoDB, OpenSearch Serverless)?


AWS often provides multiple ways to solve a problem (e.g., Aurora vs. RDS), and the best choice is dictated by your specific application requirements.

Wednesday, December 3, 2025

What is Supervisor & supervisord.conf

 supervisord.conf is the configuration file used by Supervisor, a process control system in Linux/Unix environments. Supervisor is used to manage and monitor processes, ensuring they start automatically, restart on failure, and stay running in the background.

Here’s a detailed breakdown:


1. Purpose of supervisord.conf

  • It tells Supervisor how to manage processes.

  • Defines global settings for the Supervisor daemon.

  • Specifies programs or services to start, stop, or monitor.

  • Can configure logging, process limits, environment variables, and more.


2. Key Sections in supervisord.conf

  1. [supervisord]

    • Configures the Supervisor daemon itself.

    • Example settings: log file, pid file, child process behavior.

    [supervisord]
    logfile=/var/log/supervisord.log
    pidfile=/var/run/supervisord.pid
    
  2. [program:x]

    • Defines a program or service to run under Supervisor.

    • x is the program name.

    • Common options:

      • command → command to start the program

      • autostart → start automatically when Supervisor starts

      • autorestart → restart on failure

      • stderr_logfile / stdout_logfile → logging

    [program:myapp]
    command=/usr/bin/python3 /home/user/app.py
    autostart=true
    autorestart=true
    stderr_logfile=/var/log/myapp.err.log
    stdout_logfile=/var/log/myapp.out.log
    
  3. [include] (optional)

    • Allows including multiple configuration files.

    [include]
    files = /etc/supervisor/conf.d/*.conf
    

3. How it works

  1. Start Supervisor daemon:

    sudo supervisord -c /etc/supervisord.conf
    
  2. Control processes with supervisorctl:

    sudo supervisorctl status
    sudo supervisorctl restart myapp
    

In short:
supervisord.conf is the master configuration file for Supervisor, specifying what programs to manage, how to manage them, and how Supervisor itself behaves. It’s widely used for keeping services like web apps, background jobs, or workers running reliably on servers.


If you want, I can also make a sample supervisord.conf template for a Python web app so you can see a real-world example. Do you want me to do that?

Sunday, November 30, 2025

What are most useful WebHooks for MLFlow?

Overview

MLflow webhooks enable real-time notifications when specific events occur in the Model Registry and Prompt Registry. When you register a model or prompt, create a new version, or modify tags and aliases, MLflow can automatically send HTTP POST requests to your specified endpoints. This enables seamless integration with CI/CD pipelines, notification systems, and other external services.


Key Features

Real-time notifications for Model Registry and Prompt Registry events

HMAC signature verification for secure webhook delivery

Multiple event types including model/prompt creation, versioning, and tagging

Built-in testing to verify webhook connectivity

Supported Events

MLflow webhooks support the following Model Registry and Prompt Registry events:


Event Description Payload Schema

registered_model.created Triggered when a new registered model is created RegisteredModelCreatedPayload

model_version.created Triggered when a new model version is created ModelVersionCreatedPayload

model_version_tag.set Triggered when a tag is set on a model version ModelVersionTagSetPayload

model_version_tag.deleted Triggered when a tag is deleted from a model version ModelVersionTagDeletedPayload

model_version_alias.created Triggered when an alias is created for a model version ModelVersionAliasCreatedPayload

model_version_alias.deleted Triggered when an alias is deleted from a model version ModelVersionAliasDeletedPayload

prompt.created Triggered when a new prompt is created PromptCreatedPayload

prompt_version.created Triggered when a new prompt version is created PromptVersionCreatedPayload

prompt_tag.set Triggered when a tag is set on a prompt PromptTagSetPayload

prompt_tag.deleted Triggered when a tag is deleted from a prompt PromptTagDeletedPayload

prompt_version_tag.set Triggered when a tag is set on a prompt version PromptVersionTagSetPayload

prompt_version_tag.deleted Triggered when a tag is deleted from a prompt version PromptVersionTagDeletedPayload

prompt_alias.created Triggered when an alias is created for a prompt version PromptAliasCreatedPayload

prompt_alias.deleted Triggered when an alias is deleted from a prompt PromptAliasDeletedPayload





Best Practices and Use Cases for SHAP Integration

When to Use SHAP Integration

SHAP integration provides the most value in these scenarios:


High Interpretability Requirements - Healthcare and medical diagnosis systems, financial services (credit scoring, loan approval), legal and compliance applications, hiring and HR decision systems, and fraud detection and risk assessment.


Complex Model Types - XGBoost, Random Forest, and other ensemble methods, neural networks and deep learning models, custom ensemble approaches, and any model where feature relationships are non-obvious.


Regulatory and Compliance Needs - Models requiring explainability for regulatory approval, systems where decisions must be justified to stakeholders, applications where bias detection is important, and audit trails requiring detailed decision explanations.


Performance Considerations

Dataset Size Guidelines:


Small datasets (< 1,000 samples): Use exact SHAP methods for precision

Medium datasets (1,000 - 50,000 samples): Standard SHAP analysis works well

Large datasets (50,000+ samples): Consider sampling or approximate methods

Very large datasets (100,000+ samples): Use batch processing with sampling

Memory Management:


Process explanations in batches for large datasets

Use approximate SHAP methods when exact precision isn't required

Clear intermediate results to manage memory usage

Consider model-specific optimizations (e.g., TreeExplainer for tree models)


How to perform SHAP integration with MLFlow ?

SHAP Integration

MLflow's built-in SHAP integration provides automatic model explanations and feature importance analysis during evaluation. SHAP (SHapley Additive exPlanations) values help you understand what drives your model's predictions, making your ML models more interpretable and trustworthy.


Quick Start: Automatic SHAP Explanations

Enable SHAP explanations during model evaluation with a simple configuration:



import mlflow

import xgboost as xgb

import shap

from sklearn.model_selection import train_test_split

from mlflow.models import infer_signature


# Load the UCI Adult Dataset

X, y = shap.datasets.adult()

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.33, random_state=42

)


# Train model

model = xgb.XGBClassifier().fit(X_train, y_train)


# Create evaluation dataset

eval_data = X_test.copy()

eval_data["label"] = y_test


with mlflow.start_run():

    # Log model

    signature = infer_signature(X_test, model.predict(X_test))

    model_info = mlflow.sklearn.log_model(model, name="model", signature=signature)


    # Evaluate with SHAP explanations enabled

    result = mlflow.evaluate(

        model_info.model_uri,

        eval_data,

        targets="label",

        model_type="classifier",

        evaluators=["default"],

        evaluator_config={"log_explainer": True},  # Enable SHAP logging

    )


    print("SHAP artifacts generated:")

    for artifact_name in result.artifacts:

        if "shap" in artifact_name.lower():

            print(f"  - {artifact_name}")


This automatically generates:


Feature importance plots showing which features matter most

SHAP summary plots displaying feature impact distributions

SHAP explainer model saved for future use on new data

Individual prediction explanations for sample predictions


How to Perform Model Evaluation with MLFlow

Introduction

Model evaluation is the cornerstone of reliable machine learning, transforming trained models into trustworthy, production-ready systems. MLflow's comprehensive evaluation framework goes beyond simple accuracy metrics, providing deep insights into model behavior, performance characteristics, and real-world readiness through automated testing, visualization, and validation pipelines.


MLflow's evaluation capabilities democratize advanced model assessment, making sophisticated evaluation techniques accessible to teams of all sizes. From rapid prototyping to enterprise deployment, MLflow evaluation ensures your models meet the highest standards of reliability, fairness, and performance.


Why MLflow Evaluation?

MLflow's evaluation framework provides a comprehensive solution for model assessment and validation:


⚡ One-Line Evaluation: Comprehensive model assessment with mlflow.evaluate() - minimal configuration required

🎛️ Flexible Evaluation Modes: Evaluate models, functions, or static datasets with the same unified API

📊 Rich Visualizations: Automatic generation of performance plots, confusion matrices, and diagnostic charts

🔧 Custom Metrics: Define domain-specific evaluation criteria with easy-to-use metric builders

🧠 Built-in Explainability: SHAP integration for model interpretation and feature importance analysis

👥 Team Collaboration: Share evaluation results and model comparisons through MLflow's tracking interface

🏭 Enterprise Integration: Plugin architecture for specialized evaluation frameworks like Giskard and Trubrics



Automated Model Assessment 


import mlflow

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_wine


# Load and prepare data

wine = load_wine()

X_train, X_test, y_train, y_test = train_test_split(

    wine.data, wine.target, test_size=0.2, random_state=42

)


# Train model

model = RandomForestClassifier(n_estimators=100, random_state=42)

model.fit(X_train, y_train)


# Create evaluation dataset

eval_data = X_test

eval_data["target"] = y_test


with mlflow.start_run():

    # Log model

    mlflow.sklearn.log_model(model, name="model")


    # Comprehensive evaluation with one line

    result = mlflow.models.evaluate(

        model="models:/my-model/1",

        data=eval_data,

        targets="target",

        model_type="classifier",

        evaluators=["default"],

    )

Why XGBoost + MFLow?

XGBoost (eXtreme Gradient Boosting) is a popular gradient boosting library for structured data. MLflow provides native integration with XGBoost for experiment tracking, model management, and deployment.


This integration supports both the native XGBoost API and scikit-learn compatible interface, making it easy to track experiments and deploy models regardless of which API you prefer.



import mlflow

import xgboost as xgb

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split


# Enable autologging - captures everything automatically

mlflow.xgboost.autolog()


# Load and prepare data

data = load_diabetes()

X_train, X_test, y_train, y_test = train_test_split(

    data.data, data.target, test_size=0.2, random_state=42

)


# Prepare data in XGBoost format

dtrain = xgb.DMatrix(X_train, label=y_train)

dtest = xgb.DMatrix(X_test, label=y_test)


# Train model - MLflow automatically logs everything!

with mlflow.start_run():

    model = xgb.train(

        params={

            "objective": "reg:squarederror",

            "max_depth": 6,

            "learning_rate": 0.1,

        },

        dtrain=dtrain,

        num_boost_round=100,

        evals=[(dtrain, "train"), (dtest, "test")],

    )




import mlflow

import xgboost as xgb

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split


# Load data

data = load_diabetes()

X_train, X_test, y_train, y_test = train_test_split(

    data.data, data.target, test_size=0.2, random_state=42

)


# Enable autologging

mlflow.xgboost.autolog()


# Train with native API

with mlflow.start_run():

    dtrain = xgb.DMatrix(X_train, label=y_train)

    model = xgb.train(

        params={"objective": "reg:squarederror", "max_depth": 6},

        dtrain=dtrain,

        num_boost_round=100,

    )



What Gets Logged

When autologging is enabled, MLflow automatically captures:


Parameters: All booster parameters and training configuration

Metrics: Training and validation metrics for each boosting round

Feature Importance: Multiple importance types (weight, gain, cover) with visualizations

Model: The trained model with proper serialization format

Artifacts: Feature importance plots and JSON data


How to Perform Deep learning with MLFlow ?

pip install mlflow torch torchvision

Step 1: Create a new experiment

Create a new MLflow experiment for the tutorial and enable system metrics monitoring. Here we set the monitoring interval to 1 second because the training will be quick, but for longer training runs, you can set it to a larger value.


python


import mlflow


# The set_experiment API creates a new experiment if it doesn't exist.

mlflow.set_experiment("Deep Learning Experiment")


# IMPORTANT: Enable system metrics monitoring

mlflow.config.enable_system_metrics_logging()

mlflow.config.set_system_metrics_sampling_interval(1)



Step 2: Prepare the dataset

In this example, we will use the FashionMNIST dataset, which is a collection of 28x28 grayscale images of 10 different types of clothing.


python


import torch

import torch.nn as nn

import torch.optim as optim

from torch.utils.data import DataLoader

from torchvision import datasets, transforms


# Define device

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Load and prepare data

transform = transforms.Compose(

    [transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]

)

train_dataset = datasets.FashionMNIST(

    "data", train=True, download=True, transform=transform

)

test_dataset = datasets.FashionMNIST("data", train=False, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

test_loader = DataLoader(test_dataset, batch_size=1000)



Step 3: Define the model and optimizer

Define a simple MLP model with 2 hidden layers.


python


import torch.nn as nn



class NeuralNetwork(nn.Module):

    def __init__(self):

        super().__init__()

        self.flatten = nn.Flatten()

        self.linear_relu_stack = nn.Sequential(

            nn.Linear(28 * 28, 512),

            nn.ReLU(),

            nn.Linear(512, 512),

            nn.ReLU(),

            nn.Linear(512, 10),

        )


    def forward(self, x):

        x = self.flatten(x)

        logits = self.linear_relu_stack(x)

        return logits



model = NeuralNetwork().to(device)



Then, define the training parameters and optimizer.


python


# Training parameters

params = {

    "epochs": 5,

    "learning_rate": 1e-3,

    "batch_size": 64,

    "optimizer": "SGD",

    "model_type": "MLP",

    "hidden_units": [512, 512],

}


# Define optimizer and loss function

loss_fn = nn.CrossEntropyLoss()

optimizer = optim.SGD(model.parameters(), lr=params["learning_rate"])




Step 4: Train the model

Now we are ready to train the model. Inside the training loop, we log the metrics and checkpoints to MLflow. The key points in this code are:


Initiate an MLflow run context to start a new run that we will log the model and metadata to.

Log training parameters using mlflow.log_params.

Log various metrics using mlflow.log_metrics.

Save checkpoints for each epoch using mlflow.pytorch.log_model.

python


with mlflow.start_run() as run:

    # Log training parameters

    mlflow.log_params(params)


    for epoch in range(params["epochs"]):

        model.train()

        train_loss = correct, total = 0, 0, 0


        for batch_idx, (data, target) in enumerate(train_loader):

            data, target = data.to(device), target.to(device)


            # Forward pass

            optimizer.zero_grad()

            output = model(data)

            loss = loss_fn(output, target)


            # Backward pass

            loss.backward()

            optimizer.step()


            # Calculate metrics

            train_loss += loss.item()

            _, predicted = output.max(1)

            total += target.size(0)

            correct += predicted.eq(target).sum().item()


            # Log batch metrics (every 100 batches)

            if batch_idx % 100 == 0:

                batch_loss = train_loss / (batch_idx + 1)

                batch_acc = 100.0 * correct / total

                mlflow.log_metrics(

                    {"batch_loss": batch_loss, "batch_accuracy": batch_acc},

                    step=epoch * len(train_loader) + batch_idx,

                )


        # Calculate epoch metrics

        epoch_loss = train_loss / len(train_loader)

        epoch_acc = 100.0 * correct / total


        # Validation

        model.eval()

        val_loss, val_correct, val_total = 0, 0, 0

        with torch.no_grad():

            for data, target in test_loader:

                data, target = data.to(device), target.to(device)

                output = model(data)

                loss = loss_fn(output, target)


                val_loss += loss.item()

                _, predicted = output.max(1)

                val_total += target.size(0)

                val_correct += predicted.eq(target).sum().item()


        # Calculate and log epoch validation metrics

        val_loss = val_loss / len(test_loader)

        val_acc = 100.0 * val_correct / val_total


        # Log epoch metrics

        mlflow.log_metrics(

            {

                "train_loss": epoch_loss,

                "train_accuracy": epoch_acc,

                "val_loss": val_loss,

                "val_accuracy": val_acc,

            },

            step=epoch,

        )

        # Log checkpoint at the end of each epoch

        mlflow.pytorch.log_model(model, name=f"checkpoint_{epoch}")


        print(

            f"Epoch {epoch+1}/{params['epochs']}, "

            f"Train Loss: {epoch_loss:.4f}, Train Acc: {epoch_acc:.2f}%, "

            f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.2f}%"

        )


    # Log the final trained model

    model_info = mlflow.pytorch.log_model(model, name="final_model")



Now view the results in MFLow UI 


mlflow ui --port 5000


What is Optuna?

Optuna is a hyperparameter optimization framework designed specifically for machine learning**. Here's a comprehensive breakdown:


Optuna is an automatic hyperparameter optimization framework that implements state-of-the-art algorithms to efficiently search for optimal hyperparameters. It was created by Preferred Networks and has become one of the most popular hyperparameter tuning libraries in Python.


Core Features:

Define-by-Run API: The most distinctive feature. You define the search space dynamically within the objective function, allowing for conditional parameter spaces.


Efficient Sampling Algorithms:


Tree-structured Parzen Estimator (TPE) - default


CMA-ES (Covariance Matrix Adaptation Evolution Strategy)


Random Search


Grid Search


Pruning (Early Stopping): Automatically stops unpromising trials to save computational resources.


Parallelization: Distributed optimization across multiple processes or machines.


Visualization: Built-in tools for analyzing optimization results.


Key Concepts:

1. Study

A collection of trials (optimization runs) for a single optimization task.


python

study = optuna.create_study(direction="maximize")

2. Trial

A single execution of the objective function with a specific set of hyperparameters.


3. Objective Function

The function you want to optimize (e.g., validation accuracy, loss minimization).


Basic Example:

python

import optuna

import sklearn.datasets

import sklearn.ensemble

import sklearn.model_selection


def objective(trial):

    # 1. Suggest hyperparameters (Define-by-Run)

    n_estimators = trial.suggest_int("n_estimators", 50, 200)

    max_depth = trial.suggest_int("max_depth", 3, 10)

    learning_rate = trial.suggest_float("learning_rate", 0.01, 0.3, log=True)

    

    # 2. Create and train model

    model = sklearn.ensemble.GradientBoostingClassifier(

        n_estimators=n_estimators,

        max_depth=max_depth,

        learning_rate=learning_rate

    )

    

    # 3. Evaluate

    X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)

    scores = sklearn.model_selection.cross_val_score(model, X, y, cv=5)

    

    return scores.mean()


# 4. Create and run study

study = optuna.create_study(direction="maximize")

study.optimize(objective, n_trials=100)


# 5. Best result

print(f"Best trial: {study.best_trial.params}")

print(f"Best value: {study.best_trial.value}")

Why Optuna is Powerful for ML:

1. Dynamic Search Spaces

python

def objective(trial):

    # Conditional hyperparameters

    model_type = trial.suggest_categorical("model_type", ["rf", "gbm"])

    

    if model_type == "rf":

        n_estimators = trial.suggest_int("n_estimators", 100, 500)

        max_depth = trial.suggest_int("max_depth", 3, 15)

    else:  # gbm

        n_estimators = trial.suggest_int("n_estimators", 50, 200)

        learning_rate = trial.suggest_float("learning_rate", 0.01, 0.3)

    

    # Different models based on suggested type

    # ...

2. Pruning (Early Stopping)

python

import optuna

from optuna.trial import TrialState


def objective_with_pruning(trial):

    X, y = load_data()

    

    for epoch in range(100):

        model = train_for_one_epoch(model, X_train, y_train)

        

        # Intermediate evaluation

        accuracy = evaluate(model, X_val, y_val)

        

        # Report intermediate value

        trial.report(accuracy, epoch)

        

        # Handle pruning

        if trial.should_prune():

            raise optuna.TrialPruned()  # Stop this trial early

    

    return accuracy


study = optuna.create_study(

    direction="maximize",

    pruner=optuna.pruners.MedianPruner()  # Default pruner

)

Optuna + MLflow Integration

This is where Optuna becomes particularly powerful. When combined, you get:


1. Comprehensive Tracking

python

import optuna

import mlflow


def objective(trial):

    # Suggest hyperparameters

    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)

    batch_size = trial.suggest_categorical("batch_size", [16, 32, 64])

    

    # Start MLflow run for this trial

    with mlflow.start_run(nested=True):

        # Log all hyperparameters

        mlflow.log_params(trial.params)

        mlflow.log_param("trial_number", trial.number)

        

        # Train model

        model, accuracy = train_model(lr, batch_size)

        

        # Log metrics

        mlflow.log_metric("accuracy", accuracy)

        mlflow.log_metric("trial_value", accuracy)

        

        # Optionally log the model

        if accuracy > 0.9:  # Only log good models

            mlflow.sklearn.log_model(model, "model")

        

        return accuracy


# Create parent MLflow run for the study

with mlflow.start_run(run_name="optuna_optimization"):

    study = optuna.create_study(direction="maximize")

    study.optimize(objective, n_trials=50)

    

    # Log study results to MLflow

    mlflow.log_params({"n_trials": 50})

    mlflow.log_metric("best_accuracy", study.best_value)


How to perform Hyper parameter tuning in MLFLow?

pip install mlflow optuna

import mlflow


# The set_experiment API creates a new experiment if it doesn't exist.

mlflow.set_experiment("Hyperparameter Tuning Experiment")


from sklearn.model_selection import train_test_split

from sklearn.datasets import fetch_california_housing


X, y = fetch_california_housing(return_X_y=True)

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=0)



import mlflow

import optuna

import sklearn



def objective(trial):

    # Setting nested=True will create a child run under the parent run.

    with mlflow.start_run(nested=True, run_name=f"trial_{trial.number}") as child_run:

        rf_max_depth = trial.suggest_int("rf_max_depth", 2, 32)

        rf_n_estimators = trial.suggest_int("rf_n_estimators", 50, 300, step=10)

        rf_max_features = trial.suggest_float("rf_max_features", 0.2, 1.0)

        params = {

            "max_depth": rf_max_depth,

            "n_estimators": rf_n_estimators,

            "max_features": rf_max_features,

        }

        # Log current trial's parameters

        mlflow.log_params(params)


        regressor_obj = sklearn.ensemble.RandomForestRegressor(**params)

        regressor_obj.fit(X_train, y_train)


        y_pred = regressor_obj.predict(X_val)

        error = sklearn.metrics.mean_squared_error(y_val, y_pred)

        # Log current trial's error metric

        mlflow.log_metrics({"error": error})


        # Log the model file

        mlflow.sklearn.log_model(regressor_obj, name="model")

        # Make it easy to retrieve the best-performing child run later

        trial.set_user_attr("run_id", child_run.info.run_id)

        return error




# Create a parent run that contains all child runs for different trials

with mlflow.start_run(run_name="study") as run:

    # Log the experiment settings

    n_trials = 30

    mlflow.log_param("n_trials", n_trials)


    study = optuna.create_study(direction="minimize")

    study.optimize(objective, n_trials=n_trials)


    # Log the best trial and its run ID

    mlflow.log_params(study.best_trial.params)

    mlflow.log_metrics({"best_error": study.best_value})

    if best_run_id := study.best_trial.user_attrs.get("run_id"):

        mlflow.log_param("best_child_run_id", best_run_id)


Now we can view the results in UI 

mlflow ui --port 5000





What are the top level Github actions yaml file constructs ?

Yes, there are several other top-level elements you can use in GitHub Actions workflow files. Here are all the available top-level elements:


## Complete List of Top-Level Elements:


### 1. **`name`** (you have this)

```yaml

name: Workflow Name

```


### 2. **`on`** (you have this)

```yaml

on: [push, pull_request]

```


### 3. **`jobs`** (you have this)

```yaml

jobs:

  my-job:

    runs-on: ubuntu-latest

```


### 4. **`run-name`** (optional)

```yaml

run-name: Deploy to ${{ inputs.deploy_target }} by @${{ github.actor }}

```


### 5. **`env`** - Global environment variables

```yaml

env:

  NODE_ENV: production

  DATABASE_URL: ${{ secrets.DATABASE_URL }}

```


### 6. **`defaults`** - Default settings for all jobs

```yaml

defaults:

  run:

    shell: bash

    working-directory: scripts

```


### 7. **`concurrency`** - Control concurrent workflow runs

```yaml

concurrency:

  group: production-${{ github.ref }}

  cancel-in-progress: true

```


### 8. **`permissions`** - Fine-grained permissions

```yaml

permissions:

  actions: read

  checks: write

  contents: read

  deployments: write

```


### 9. **`on.schedule`** - For scheduled workflows (part of `on`)

```yaml

on:

  schedule:

    - cron: '0 2 * * *'  # Daily at 2 AM

```


### 10. **`on.workflow_dispatch`** - Manual triggers

```yaml

on:

  workflow_dispatch:

    inputs:

      environment:

        description: 'Environment to deploy'

        required: true

        default: 'staging'

```


### 11. **`on.pull_request`** - PR-specific triggers

```yaml

on:

  pull_request:

    types: [opened, synchronize, reopened]

    branches: [main]

```


## Complete Example with All Elements:


```yaml

name: Comprehensive Workflow


on:

  push:

    branches: [main]

  pull_request:

    branches: [main]

  workflow_dispatch:

    inputs:

      logLevel:

        description: 'Log level'

        required: true

        default: 'warning'


env:

  NODE_VERSION: '16'

  PYTHON_VERSION: '3.9'


defaults:

  run:

    shell: bash

    working-directory: ./src


concurrency:

  group: ${{ github.workflow }}-${{ github.ref }}

  cancel-in-progress: true


permissions:

  contents: read

  packages: write


jobs:

  test:

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v3

      - name: Run tests

        run: pytest

  

  build:

    runs-on: ubuntu-latest

    needs: test

    steps:

      - uses: actions/checkout@v3

      - name: Build

        run: npm run build

```


So in summary, while your workflow uses the three essential elements (`name`, `on`, `jobs`), you have several other powerful options available depending on your needs!


What are some of the additional workflow commands that can be added to GitHub action

Additional Components You Can Add:

1. More Trigger Events

yaml

on:

  push:

    branches: [main, develop]

  pull_request:

    branches: [main]

  schedule:

    - cron: '0 2 * * *'  # Daily at 2 AM

  workflow_dispatch:  # Manual trigger

2. Environment and Strategy

yaml

jobs:

  test:

    runs-on: ${{ matrix.os }}

    strategy:

      matrix:

        os: [ubuntu-latest, windows-latest]

        python-version: [3.8, 3.9, '3.10']

3. Services (like databases)

yaml

services:

  postgres:

    image: postgres:13

    env:

      POSTGRES_PASSWORD: postgres

4. Conditional Execution

yaml

steps:

  - name: Deploy

    if: github.ref == 'refs/heads/main'

    run: echo "Deploying..."

5. Artifacts and Caching

yaml

steps:

  - uses: actions/cache@v3

    with:

      path: ~/.cache/pip

      key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}

  

  - uses: actions/upload-artifact@v3

    with:

      name: model-files

      path: models/

6. Matrix Builds

yaml

strategy:

  matrix:

    node-version: [14.x, 16.x, 18.x]

    include:

      - node-version: 18.x

        flags: --experimental-feature

7. Job Outputs and Needs

yaml

jobs:

  job1:

    outputs:

      version: ${{ steps.get_version.outputs.version }}

    steps:

      - id: get_version

        run: echo "::set-output name=version::1.0.0"

  

  job2:

    needs: job1

    steps:

      - run: echo "Version is ${{ needs.job1.outputs.version }}"

8. Different Runner Types

yaml

runs-on: 

  - ubuntu-latest

  - windows-latest

  - macos-latest

  - self-hosted  # Your own runners

9. Timeout and Concurrency

yaml

timeout-minutes: 30

concurrency:

  group: ${{ github.ref }}

  cancel-in-progress: true

10. More Step Options

yaml

steps:

  - name: Setup Python

    uses: actions/setup-python@v4

    with:

      python-version: '3.9'

  

  - name: Multi-line script

    run: |

      echo "First command"

      echo "Second command"

  

  - name: Continue on error

    continue-on-error: true

    run: risky-command.sh

  

  - name: Working directory

    working-directory: ./src

    run: pwd  # Runs in ./src directory

Your workflow is well-structured with proper job dependencies and environment variable management. The main improvement I'd suggest is fixing the typo in model-traning → model-training for consistency.

Workflow Example in Databricks with MLflow

# In Databricks notebook - MLflow is pre-configured

from sklearn.ensemble import RandomForestRegressor

import mlflow

import mlflow.sklearn


# Auto-logging (Databricks enhancement)

mlflow.autolog()


# Train model - automatically tracked

model = RandomForestRegressor()

model.fit(X_train, y_train)


# Log additional metrics

mlflow.log_metric("custom_metric", value)


# Register model in MLflow Model Registry

mlflow.sklearn.log_model(

    model, 

    "revenue_model",

    registered_model_name="PlayStore_Revenue_Predictor"

)



Key Benefits of Using MLflow in Databricks

Zero Setup: MLflow is pre-installed and configured

Unified Interface: Experiments, models, and data in one platform

Scalability: Leverages Databricks' distributed computing

Collaboration: Shared experiments across teams

Production Ready: Easy model deployment and serving


Databricks is the commercial platform that provides the infrastructure and environment, while MLflow is the open-source framework (created by Databricks) for managing machine learning experiments and models. Using them together creates a powerful, integrated solution for enterprise ML workflows.

What is Databricks?

Databricks is a unified data analytics platform built by the creators of Apache Spark. It provides a collaborative cloud-based environment for:

Key Capabilities:

Data Engineering: ETL, data processing, and pipeline management

Data Science & ML: End-to-end machine learning lifecycle

Data Analytics: SQL analytics, business intelligence, and reporting

Data Warehousing: Delta Lake for reliable data lakes

Collaboration: Shared workspaces, notebooks, and dashboards

Core Components:

Databricks Workspace: Collaborative environment with notebooks, dashboards

Databricks Runtime: Optimized Apache Spark environment

Delta Lake: ACID transactions for data lakes

MLflow Integration: Native machine learning lifecycle management

Unity Catalog: Unified governance for data and AI


How Databricks Relates to MLflow

1. MLflow was Created by Databricks

MLflow was originally developed at Databricks as an open-source project


It's now a popular standalone open-source platform for managing the ML lifecycle


2. Native Integration

Databricks provides deep, native integration with MLflow:


# MLflow is automatically available in Databricks notebooks

import mlflow


# Automatic tracking in Databricks

with mlflow.start_run():

    mlflow.log_param("learning_rate", 0.01)

    mlflow.log_metric("accuracy", 0.95)

    mlflow.sklearn.log_model(model, "model")


3. MLflow Tracking Server Built-in

Automatic experiment tracking in Databricks workspace


Centralized model registry for model versioning and staging


UI integration - MLflow experiments visible directly in Databricks UI


4. Enhanced Features in Databricks

Automated MLflow logging for popular libraries (scikit-learn, TensorFlow, etc.)


Managed MLflow - No setup required, fully managed service


Unity Catalog integration - Model lineage and governance


Feature Store integration - Managed feature platform


5. End-to-End ML Platform

Databricks + MLflow provides:


Data Preparation → Model Training → Experiment Tracking → 

Model Registry → Deployment → Monitoring

How to access a DataBricks workspace from MLFlow ?

pip install --upgrade "mlflow[databricks]>=3.1"


Step 2: Create an MLflow Experiment

Open your Databricks workspace

Go to Experiments in the left sidebar under Machine Learning

At the top of the Experiments page, click on New Experiment


Step 3: Configure Authentication

Choose one of the following authentication methods:


Option A: Environment Variables


In your MLflow Experiment, click Generate API Key

Copy and run the generated code in your terminal:

bash


export DATABRICKS_TOKEN=<databricks-personal-access-token>

export DATABRICKS_HOST=https://<workspace-name>.cloud.databricks.com

export MLFLOW_TRACKING_URI=databricks

export MLFLOW_EXPERIMENT_ID=<experiment-id>



Option B: .env File


In your MLflow Experiment, click Generate API Key

Copy the generated code to a .env file in your project root:

bash


DATABRICKS_TOKEN=<databricks-personal-access-token>

DATABRICKS_HOST=https://<workspace-name>.cloud.databricks.com

MLFLOW_TRACKING_URI=databricks

MLFLOW_EXPERIMENT_ID=<experiment-id>


Install the python-dotenv package:

bash


pip install python-dotenv

Load environment variables in your code:

python


# At the beginning of your Python script

from dotenv import load_dotenv


# Load environment variables from .env file

load_dotenv()



Step 4: Verify Your Connection

Create a test file and run this code to verify your connection:


python


import mlflow


# Test logging to verify connection

print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")

with mlflow.start_run():

    print("✓ Successfully connected to MLflow!")


What will be a quick start for MLFlow ?

Step 1: Install MLflow

bash


pip install --upgrade "mlflow>=3.1"


Step 2: Configure Tracking

MLflow supports different backends for tracking your experiment data. Choose one of the following options to get started. Refer to the Self Hosting Guide for detailed setup and configurations.


Option A: Database (Recommended)


Set the tracking URI to a local database URI (e.g., sqlite:///mlflow.db). This is recommended option for quickstart and local development.


python


import mlflow


mlflow.set_tracking_uri("sqlite:///mlflow.db")

mlflow.set_experiment("my-first-experiment")

Option B: File System


MLflow will automatically use local file storage if no tracking URI is specified:


python


import mlflow


# Creates local mlruns directory for experiments

mlflow.set_experiment("my-first-experiment")



Option C: Remote Tracking Server


Start a remote MLflow tracking server following the Self Hosting Guide. Then configure your client to use the remote server:


python


import mlflow


# Connect to remote MLflow server

mlflow.set_tracking_uri("http://localhost:5000")

mlflow.set_experiment("my-first-experiment")

Alternatively, you can configure the tracking URI and experiment using environment variables:


bash


export MLFLOW_TRACKING_URI="http://localhost:5000"

export MLFLOW_EXPERIMENT_NAME="my-first-experiment"


Step 3: Verify Your Connection

Create a test file and run this code:


python


import mlflow


# Print connection information

print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")

print(f"Active Experiment: {mlflow.get_experiment_by_name('my-first-experiment')}")


# Test logging

with mlflow.start_run():

    mlflow.log_param("test_param", "test_value")

    print("✓ Successfully connected to MLflow!")


Step 4: Access MLflow UI

If you are using local tracking (option A or B), run the following command and access the MLflow UI at http://localhost:5000.


bash


# For Option A

mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5000

# For Option B

mlflow ui --port 5000


Wednesday, November 26, 2025

Main features of MLFlow

Track experiments and manage your ML development 

MLflow Tracking provides comprehensive experiment logging, parameter tracking, metrics visualization, and artifact management.

Key Benefits:


Experiment Organization: Track and compare multiple model experiments

Metric Visualization: Built-in plots and charts for model performance

Artifact Storage: Store models, plots, and other files with each run

Collaboration: Share experiments and results across teams


Manage model versions and lifecycle


Core Features

MLflow Model Registry provides centralized model versioning, stage management, and model lineage tracking.


Key Benefits:

Version Control: Track model versions with automatic lineage

Stage Management: Promote models through staging, production, and archived stages

Collaboration: Team-based model review and approval workflows

Model Discovery: Search and discover models across your organization


Deploy models to production environments

Core Features

MLflow Deployment supports multiple deployment targets including REST APIs, cloud platforms, and edge devices.


Key Benefits:


Multiple Targets: Deploy to local servers, cloud platforms, or containerized - enronments

Model Serving: Built-in REST API serving with automatic input validation

Batch Inference: Support for batch scoring and offline predictions

Production Ready: Scalable deployment options for enterprise use


Explore Native MLflow ML Library Integrations

Integrates with


Scikit-learn

XGBoost

TensorFlow

PyTorch

Keras

Spark MLlib


Evaluate and validate your ML models

Core Features

MLflow Evaluation provides comprehensive model validation tools, automated metrics calculation, and model comparison capabilities.


Key Benefits:


Automated Metrics: Built-in evaluation metrics for classification, regression, and - mo

Custom Evaluators: Create custom evaluation functions for domain-specific metrics

Model Comparison: Compare multiple models and versions side-by-side

Validation Datasets: Track evaluation datasets and ensure reproducible results


Monday, November 24, 2025

Human assisted PDF parsing

Your approach of creating a human-assisted parser for complex PDFs with multi-page tables is actually quite sensible and commonly used in practice. Here's my analysis:


## Is this a good approach?


**Yes, for several reasons:**

- Complex PDF layouts with spanning tables are notoriously difficult for automated parsers

- Human-in-the-loop systems can handle edge cases that pure automation misses

- You get progressively better results as the system learns from human corrections

- More practical than waiting for perfect AI solutions


## Standard approaches already available:


### 1. **Commercial Solutions**

- **Amazon A2I** (Augmented AI): Human-in-the-loop service

- **Scale AI**, **Labelbox**: Data annotation platforms

- **Adobe Extract API**: Specialized for complex PDFs


### 2. **Open Source Frameworks**

```python

# Example architecture

def human_assisted_parser(pdf_file):

    automated_extraction = extract_with_pypdf2(pdf_file)

    confidence_scores = calculate_confidence(automated_extraction)

    

    low_confidence_items = filter_low_confidence(confidence_scores)

    human_corrections = present_for_review(low_confidence_items)

    

    return merge_corrections(automated_extraction, human_corrections)

```


### 3. **Common Patterns**


**Progressive Automation:**

- Start with full human review

- Identify patterns for automation

- Gradually reduce human intervention


**Confidence-based Escalation:**

- Auto-process high-confidence extractions

- Flag low-confidence regions for human review


## Recommended Implementation Strategy:


### Phase 1: Manual Correction Interface

```python

# Basic workflow

1. Automated extraction attempt

2. Visual diff between original and parsed data

3. Simple interface for corrections

4. Store corrections as training data

```


### Phase 2: Learning System

- Use human corrections to improve parsing rules

- Train ML models on corrected data

- Implement active learning to prioritize uncertain cases


### Phase 3: Hybrid System

- 80% automated with 20% human verification

- Continuous improvement loop


## Tools to Consider:


**For PDF Parsing:**

- `camelot-py` (specialized for tables)

- `pdfplumber` (layout analysis)

- `tabula-py` (table extraction)


**For Human Interface:**

- Streamlit for quick prototypes

- Django/React for production systems

- Jupyter notebooks with ipywidgets for data scientists


## Best Practices:


1. **Store original + corrections** for future training

2. **Implement version control** for parsing rules

3. **Use confidence scoring** to prioritize human effort

4. **Create reusable correction templates** for similar documents

5. **Monitor accuracy metrics** to measure improvement


## When to Consider Alternatives:


- If you have thousands of similar documents → invest in better automation

- If tables follow consistent patterns → template-based approaches

- If budget allows → commercial enterprise solutions


Your approach is particularly good if:

- Documents are high-value

- Volume is moderate (dozens to hundreds)

- Accuracy requirements are high

- You can reuse learnings across similar documents


Would you like me to elaborate on any specific aspect of the implementation?

Saturday, November 22, 2025

What are various types of indices in GraphRAG

 Indexes are the unsung heroes of GraphRAG. Without them, every retrieval is a costly traversal or vector search.


There are typically three classes of indexes that power an efficient GraphRAG system:


1 ) Text Index for text contents, including B-tree text index for exact match, and full text index for complex, content-based text searches, especially in large datasets.


2 ) Vector Index for embeddings, i.e. vectors encoded from raw data of either text or image. It maps text, embeddings, or image features to nodes for semantic similarity search. Example implementations are pgvector, Qdrant, Milvus.


3 ) Structual Index allows the graph engine to quickly locate nodes, edges, and their relationships without scanning the entire graph. Different types of graph database have their specific implementations over graph patterns.


A practical architecture usually integrates more than one indices (for unstructured context retrieval) and a graph database (for structure-aware traversal).


The challenge for data engineers is keeping them synchronized — when a node or document is updated, both embeddings and graph structure must be refreshed.


In one of my earlier posts shared below, I demonstrated the process of combing both vector match with graph traversals in Neo4j.


What is difference between property graph, RDF Graph, HyperGraphs, Temporal event graphs in Knowledge Graph ?

Excellent question — you’ve touched on an advanced and very important distinction in **Knowledge Graph (KG)** modeling!

Each of these graph types — **Property Graphs**, **RDF Graphs**, **Hypergraphs**, and **Temporal/Event Graphs** — represent *knowledge relationships* but differ in **structure**, **semantics**, and **use case**.


Let’s break them down clearly 👇


---


## 🧩 1. **Property Graphs**


**Used in:** Neo4j, JanusGraph, TigerGraph, ArangoDB, etc.


### 🔹 Structure:


* **Nodes (Vertices)**: represent entities (e.g., Person, Product, City)

* **Edges (Relationships)**: represent relationships between nodes (e.g., *lives_in*, *bought*, *friend_of*)

* **Both nodes and edges can have properties** (key–value pairs)


```plaintext

(Alice) -[BOUGHT {date: '2024-10-12', price: 299}]-> (Laptop)

```


### 🔹 Characteristics:


* Schema-flexible

* Easy for traversal queries (e.g., friends-of-friends)

* Intuitive for graph algorithms (e.g., PageRank, centrality)

* Supports **attributes on relationships**


### 🔹 Example use:


* Social networks, recommendation systems, fraud detection.


---


## 🧩 2. **RDF Graphs (Resource Description Framework)**


**Used in:** Semantic Web, Knowledge Representation, Ontologies

**Technologies:** RDF, OWL, SPARQL, triple stores (e.g., GraphDB, Blazegraph, Apache Jena)


### 🔹 Structure:


* Consists of **triples**: `(subject, predicate, object)`

* All data is represented as **URIs (global identifiers)**.

* Properties cannot directly hold attributes (no “property on relationship” like in Property Graph).


```turtle

:Alice  :bought  :Laptop .

:Alice  :hasAge  "29"^^xsd:int .

```


To represent a relationship’s property (like date), you need **reification**:


```turtle

:txn1  rdf:type :Purchase ;

       :buyer :Alice ;

       :item  :Laptop ;

       :date  "2024-10-12" .

```


### 🔹 Characteristics:


* Strict semantic model with ontology (RDFS/OWL)

* Best for **interoperability, reasoning, and linked data**

* Can be queried using **SPARQL**


### 🔹 Example use:


* Knowledge Graphs like DBpedia, Wikidata, and Google KG

* Semantic web applications, reasoning engines.


---


## 🧩 3. **Hypergraphs**


**Used in:** Complex relational modeling, systems biology, higher-order network analysis.


### 🔹 Structure:


* In a normal graph, an edge connects **two** nodes.

* In a **hypergraph**, an edge (called a *hyperedge*) can connect **multiple** nodes simultaneously.


```plaintext

Hyperedge H1 = {Alice, Bob, Carol}  // e.g., all members of a project

```


### 🔹 Characteristics:


* Models *multi-party relationships* (more than two entities)

* Useful for representing **collaborations**, **transactions**, **group membership**


### 🔹 Example use:


* Modeling research collaborations (one paper connects multiple authors)

* Multi-agent systems or group communications.


---


## 🧩 4. **Temporal / Event Graphs**


**Used in:** Time-based systems, event analysis, dynamic networks.


### 🔹 Structure:


* Extends a Property Graph or RDF Graph with **time or event dimensions**

* Nodes and edges can have **timestamps, intervals, or versions**

* Sometimes represented as a sequence of “snapshots” over time or as **event nodes**.


```plaintext

(Alice) -[BOUGHT {timestamp: '2024-10-12T14:30'}]-> (Laptop)

```


Or as an **Event node**:


```plaintext

(Alice) -> (PurchaseEvent) -> (Laptop)

PurchaseEvent = {date: '2024-10-12', price: 299}

```


### 🔹 Characteristics:


* Tracks evolution of entities/relations over time

* Enables temporal queries: *“Who bought what before 2024?”*

* Supports **versioned knowledge graphs** or **event-driven reasoning**


### 🔹 Example use:


* Financial transactions

* IoT systems (sensor events over time)

* Causal or temporal knowledge graphs for reasoning.


---


## 🧠 Summary Comparison


| Feature                         | Property Graph               | RDF Graph                            | Hypergraph                     | Temporal/Event Graph        |

| ------------------------------- | ---------------------------- | ------------------------------------ | ------------------------------ | --------------------------- |

| **Basic Unit**                  | Node + Edge + Properties     | Triple (subject-predicate-object)    | Hyperedge (connects >2 nodes)  | Node/Edge + Time/Events     |

| **Relationship Properties**     | ✅ Yes                        | ⚠️ Indirect (via reification)        | ✅ Yes (multi-node)             | ✅ Yes (with timestamp)      |

| **Supports Ontology/Semantics** | ⚠️ Limited                   | ✅ Strong (RDFS/OWL)                  | ❌ Usually not                  | ⚠️ Optional                 |

| **Best For**                    | Traversal & graph algorithms | Semantic reasoning, interoperability | Multi-party relationships      | Temporal/causal reasoning   |

| **Examples**                    | Neo4j, JanusGraph            | GraphDB, Blazegraph, Jena            | HyperNetX, Tensor-based graphs | Temporal Neo4j, ChronoGraph |

| **Typical Query Language**      | Cypher, Gremlin              | SPARQL                               | Custom libraries               | Cypher + temporal filters   |


---


### 🧩 Visualization Intuition:


| Type               | Simple Visual                                    |

| ------------------ | ------------------------------------------------ |

| **Property Graph** | Alice —[BOUGHT(price=299)]→ Laptop               |

| **RDF Graph**      | (Alice, bought, Laptop)                          |

| **Hypergraph**     | {Alice, Bob, Carol} all linked via one hyperedge |

| **Temporal Graph** | Alice —[BOUGHT@2024-10-12]→ Laptop               |


---


### 🔮 Advanced Note


In modern KG architectures, you often **combine** these models:


* A **Property Graph** as the base structure

* With **temporal extensions**

* And **semantic RDF mappings** for reasoning

  → This hybrid design powers systems like **Google’s Knowledge Graph** and **Enterprise Knowledge Platforms**.


---


Would you like me to show how a *single example (say: “Alice bought a laptop on Oct 12”)* is represented **in all four graph models side-by-side** (in RDF, property graph, hypergraph, and temporal graph notation)? It gives a crystal-clear comparative view.


Sunday, November 16, 2025

What are Hooks?

 Hooks are special functions that allow functional components to use state, lifecycle methods, context, and other React features that were previously only available in class components.


Basic Rules of Hooks

Only Call Hooks at the Top Level


Don't call Hooks inside loops, conditions, or nested functions


Only Call Hooks from React Functions


Call them from React functional components or custom Hooks


Most Commonly Used Hooks

1. useState - State Management



import React, { useState } from 'react';


function Counter() {

  const [count, setCount] = useState(0); // Initial state


  return (

    <div>

      <p>You clicked {count} times</p>

      <button onClick={() => setCount(count + 1)}>

        Click me

      </button>

    </div>

  );

}



2. useEffect - Side Effects

import React, { useState, useEffect } from 'react';


function UserProfile({ userId }) {

  const [user, setUser] = useState(null);


  // Similar to componentDidMount and componentDidUpdate

  useEffect(() => {

    // Fetch user data

    fetch(`/api/users/${userId}`)

      .then(response => response.json())

      .then(userData => setUser(userData));

  }, [userId]); // Only re-run if userId changes


  return <div>{user ? user.name : 'Loading...'}</div>;

}



How Hooks Work Internally

Hook Storage Mechanism

React maintains a linked list of Hooks for each component. When you call a Hook:


React adds the Hook to the list

On subsequent renders, React goes through the list in the same order

This is why Hooks must be called in the same order every render



Key Differences Between Hooks and Regular Functions

1. State Persistence Across Renders

Regular Function (state resets every call):


function regularCounter() {

  let count = 0; // Reset to 0 every time

  const increment = () => {

    count++;

    console.log(count);

  };

  return increment;

}


const counter1 = regularCounter();

counter1(); // Output: 1

counter1(); // Output: 1 (always starts from 0)



Hook (state persists between renders):


import { useState } from 'react';


function useCounter() {

  const [count, setCount] = useState(0); // Persists across re-renders

  

  const increment = () => {

    setCount(prev => prev + 1);

  };

  

  return [count, increment];

}


function Component() {

  const [count, increment] = useCounter();

  

  return (

    <button onClick={increment}>Count: {count}</button>

    // Clicking multiple times: 1, 2, 3, 4...

  );

}


Hook (proper lifecycle management):


import { useEffect, useState } from 'react';


function useTimer() {

  const [seconds, setSeconds] = useState(0);

  

  useEffect(() => {

    const interval = setInterval(() => {

      setSeconds(prev => prev + 1);

    }, 1000);

    

    // Cleanup function - runs on unmount

    return () => clearInterval(interval);

  }, []); // Empty dependency array = runs once

  

  return seconds;

}


function Component() {

  const seconds = useTimer();

  return <div>Timer: {seconds}s</div>;

  // Automatically cleans up when component unmounts

}





Thursday, November 13, 2025

Guardrail AI: Comprehensive Guide for Python Applications

Guardrail AI is an open-source framework specifically designed for implementing safety guardrails in AI applications. It helps ensure AI systems operate within defined boundaries and follow specific guidelines.


What is Guardrail AI?

Guardrail AI provides:


Validation of AI outputs against custom rules


Quality checks for generated content


Bias detection and mitigation


Structured output enforcement


PII detection and redaction


Custom rule creation


Installation

bash

pip install guardrail-ai

# Or with specific components

pip install guardrail-ai[all]

pip install guardrail-ai[pii]

pip install guardrail-ai[quality]

1. Basic Usage Examples

Simple Content Validation

python

from guardrail import Guardrail

from guardrail.validators import ProfanityFilter, ToxicityFilter, PIIFilter


# Initialize guardrail with validators

guardrail = Guardrail(

    validators=[

        ProfanityFilter(),

        ToxicityFilter(threshold=0.8),

        PIIFilter(entities=["EMAIL", "PHONE_NUMBER", "SSN"])

    ]

)


# Validate text

text = "This is a sample text with an email user@example.com"

result = guardrail.validate(text)


print(f"Valid: {result.is_valid}")

print(f"Violations: {result.violations}")

print(f"Sanitized text: {result.sanitized_text}")


NVIDIA NeMo and Guardrails for AI Applications

NVIDIA NeMo is a framework for building, training, and fine-tuning generative AI models, while "guardrails" refer to safety mechanisms that ensure AI systems behave responsibly and within defined boundaries.


## What is NVIDIA NeMo?


NVIDIA NeMo is a cloud-native framework that provides:

- Pre-trained foundation models (speech, vision, language)

- Tools for model training and customization

- Deployment capabilities for production environments

- Support for multi-modal AI applications


## Implementing Guardrails with NeMo


Here's how to implement basic guardrails using NVIDIA NeMo in Python:


### 1. Installation


```bash

pip install nemo_toolkit[all]

```


### 2. Basic Content Moderation Guardrail


```python

import nemo.collections.nlp as nemo_nlp

from nemo.collections.common.prompts import PromptFormatter


class ContentGuardrail:

    def __init__(self):

        # Load a pre-trained model for content classification

        self.classifier = nemo_nlp.models.TextClassificationModel.from_pretrained(

            model_name="text_classification_model"

        )

        

        # Define prohibited topics

        self.prohibited_topics = [

            "violence", "hate speech", "self-harm", 

            "illegal activities", "personal information"

        ]

    

    def check_content(self, text):

        """Check if content violates safety guidelines"""

        # Basic keyword filtering

        for topic in self.prohibited_topics:

            if topic in text.lower():

                return False, f"Content contains prohibited topic: {topic}"

        

        # ML-based classification (simplified example)

        # In practice, you'd use a fine-tuned safety classifier

        prediction = self.classifier.classifytext([text])

        

        if prediction and self.is_unsafe(prediction[0]):

            return False, "Content classified as unsafe"

        

        return True, "Content is safe"


    def is_unsafe(self, prediction):

        # Implement your safety threshold logic

        return prediction.get('confidence', 0) > 0.8 and prediction.get('label') == 'unsafe'

```


### 3. Response Filtering Guardrail


```python

import re

from typing import List, Tuple


class ResponseGuardrail:

    def __init__(self):

        self.max_length = 1000

        self.blocked_patterns = [

            r"\b\d{3}-\d{2}-\d{4}\b",  # SSN-like patterns

            r"\b\d{16}\b",  # Credit card-like numbers

            r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b"  # Email patterns

        ]

    

    def validate_response(self, response: str) -> Tuple[bool, str]:

        """Validate AI response against safety rules"""

        

        # Check length

        if len(response) > self.max_length:

            return False, f"Response too long: {len(response)} characters"

        

        # Check for PII (Personally Identifiable Information)

        for pattern in self.blocked_patterns:

            if re.search(pattern, response):

                return False, "Response contains sensitive information"

        

        # Check for inappropriate content

        if self.contains_inappropriate_content(response):

            return False, "Response contains inappropriate content"

        

        return True, "Response passed guardrails"

    

    def contains_inappropriate_content(self, text: str) -> bool:

        inappropriate_terms = [

            # Add your list of inappropriate terms

            "hate", "violence", "discrimination"

        ]

        return any(term in text.lower() for term in inappropriate_terms)

```


### 4. Complete Guardrail System


```python

class NeMoGuardrailSystem:

    def __init__(self):

        self.content_guardrail = ContentGuardrail()

        self.response_guardrail = ResponseGuardrail()

        self.conversation_history = []

    

    def process_user_input(self, user_input: str) -> dict:

        """Process user input through all guardrails"""

        

        # Check user input

        is_safe, message = self.content_guardrail.check_content(user_input)

        if not is_safe:

            return {

                "success": False,

                "response": "I cannot process this request due to safety concerns.",

                "reason": message

            }

        

        # Store in conversation history

        self.conversation_history.append({"role": "user", "content": user_input})

        

        return {"success": True, "message": "Input passed guardrails"}

    

    def validate_ai_response(self, ai_response: str) -> dict:

        """Validate AI response before sending to user"""

        

        is_valid, message = self.response_guardrail.validate_response(ai_response)

        if not is_valid:

            return {

                "success": False,

                "response": "I apologize, but I cannot provide this response.",

                "reason": message

            }

        

        # Store valid response

        self.conversation_history.append({"role": "assistant", "content": ai_response})

        

        return {"success": True, "response": ai_response}

    

    def get_safe_response(self, user_input: str, ai_model) -> str:

        """Complete pipeline for safe AI interaction"""

        

        # Step 1: Validate user input

        input_check = self.process_user_input(user_input)

        if not input_check["success"]:

            return input_check["response"]

        

        # Step 2: Generate AI response (placeholder)

        # In practice, you'd use NeMo models here

        raw_response = ai_model.generate_response(user_input)

        

        # Step 3: Validate AI response

        response_check = self.validate_ai_response(raw_response)

        

        return response_check["response"]


# Usage example

def main():

    guardrail_system = NeMoGuardrailSystem()

    

    # Mock AI model

    class MockAIModel:

        def generate_response(self, text):

            return "This is a sample AI response."

    

    ai_model = MockAIModel()

    

    # Test the guardrail system

    user_input = "Tell me about machine learning"

    response = guardrail_system.get_safe_response(user_input, ai_model)

    print(f"AI Response: {response}")


if __name__ == "__main__":

    main()

```


### 5. Advanced Safety with NeMo Models


```python

import torch

from nemo.collections.nlp.models import PunctuationCapitalizationModel


class AdvancedSafetyGuardrail:

    def __init__(self):

        # Load NeMo models for various safety checks

        self.punctuation_model = PunctuationCapitalizationModel.from_pretrained(

            model_name="punctuation_en_bert"

        )

        

    def enhance_safety(self, text: str) -> str:

        """Apply multiple safety enhancements"""

        

        # Add proper punctuation (helps with clarity)

        punctuated_text = self.punctuation_model.add_punctuation_capitalization([text])[0]

        

        # Remove excessive capitalization

        safe_text = self.normalize_capitalization(punctuated_text)

        

        return safe_text

    

    def normalize_capitalization(self, text: str) -> str:

        """Normalize text capitalization for safety"""

        sentences = text.split('. ')

        normalized_sentences = []

        

        for sentence in sentences:

            if sentence:

                # Capitalize first letter, lowercase the rest

                normalized = sentence[0].upper() + sentence[1:].lower()

                normalized_sentences.append(normalized)

        

        return '. '.join(normalized_sentences)

```


## Key Guardrail Strategies


1. **Input Validation**: Check user inputs before processing

2. **Output Filtering**: Validate AI responses before delivery

3. **Content Moderation**: Detect inappropriate content

4. **PII Detection**: Prevent leakage of sensitive information

5. **Length Control**: Manage response sizes

6. **Tone Management**: Ensure appropriate communication style


## Best Practices


- **Layer multiple guardrails** for defense in depth

- **Regularly update** your safety models and rules

- **Monitor and log** all guardrail triggers

- **Provide clear feedback** when content is blocked

- **Test extensively** with diverse inputs


This approach provides a foundation for implementing safety guardrails with NVIDIA NeMo, though in production you'd want to use more sophisticated models and add additional safety layers.