Tuesday, May 6, 2025

Does LLMsherpa use an API to do the parsing? How does it work?

The llmsherpa LayoutPDFReader itself primarily focuses on structure extraction from PDFs, and it does not directly use an LLM for that core task.

Here's a more detailed explanation:

What LayoutPDFReader Does: It's designed to parse PDFs and understand their layout, identifying elements like sections, paragraphs, tables, and lists. This is crucial for preparing PDF content for use with LLMs. It aims to provide a more structured representation of the PDF content than a simple text extraction.

How it Works: LayoutPDFReader uses an API (which may be hosted by llmsherpa) to analyze the PDF and return a structured representation. This process involves parsing the PDF's internal structure.

LLMs in the Broader Context: While LayoutPDFReader doesn't use an LLM for its primary parsing, the output from LayoutPDFReader is intended to be used with LLMs. The structured data it provides makes it much easier to feed PDF content into an LLM for tasks like:

Retrieval Augmented Generation (RAG): Where you retrieve relevant chunks of text from a PDF (processed by LayoutPDFReader) and provide them to an LLM to answer a question.

Summarization: Where you use an LLM to summarize sections of a PDF identified by LayoutPDFReader.


Regarding API Keys:


LayoutPDFReader often interacts with an API to perform the PDF parsing. Therefore, you might need to use an API. The documentation mentions the need for an LLMSherpa API URL.


In summary, LayoutPDFReader is a tool that helps in intelligently extracting information from PDFs, and this structured information is then very useful for LLMs.


References 

OpenAI


Monday, May 5, 2025

What are reasons for Slow Execution of GridSearchCV?

# RandomForestClassifier with GridSearchCV for hyperparameter tuning

param_grid = {

    "n_estimators": [50, 100, 200],  # Number of trees

    "max_depth": [None, 10, 20, 30],  # Maximum depth of the trees

    "min_samples_split": [2, 5, 10],  # Minimum samples required to split an internal node

    "min_samples_leaf": [1, 2, 4],  # Minimum samples required to be at a leaf node

    "criterion": ["gini", "entropy"],

}

grid_search_rf = GridSearchCV(

    RandomForestClassifier(random_state=1),  # Keep random_state for reproducibility

    param_grid,

    cv=5,  # 5-fold cross-validation

    scoring="accuracy",  # Optimize for accuracy

)

grid_search_rf.fit(X_train, y_train)

best_rf_classifier = grid_search_rf.best_estimator_ # Get the best model

Exhaustive Search: GridSearchCV tries every combination of parameters in your param_grid.  The more combinations, the longer it takes.

Cross-Validation:  The cv parameter (in your case, 5) means that for each parameter combination, the model is trained and evaluated 5 times. This multiplies the training time.

Model Complexity: Random Forest itself can be computationally intensive, especially with more trees (n_estimators) and deeper trees (max_depth).

Data Size:  The larger your X_train and y_train, the longer each model training and evaluation takes.

Number of Parameters: Each parameter added to the grid increases the search space exponentially

Strategies to Reduce Time During Development:

Reduce the Parameter Grid Size:

Fewer Values: Instead of [50, 100, 200], try [50, 100] for n_estimators.

Coarser Grid:  Instead of [None, 10, 20, 30], try [None, 20] for max_depth.

Fewer Parameters:  Start with a smaller subset of parameters. For example, initially, tune only n_estimators and max_depth, and fix the others to their default values.  Later, add more parameters to the grid.


Example:


param_grid = {

    "n_estimators": [50, 100],

    "max_depth": [10, 20],

    # "min_samples_split": [2, 5],  # Removed some values

    # "min_samples_leaf": [1, 2],

    "criterion": ["gini"],  # Fixed to 'gini'

}


Reduce Cross-Validation Folds:


Use cv=3 instead of cv=5 during development.  This reduces the number of training/evaluation rounds.  You can increase it to 5 (or 10) for the final run when you're confident in your parameter ranges.


grid_search_rf = GridSearchCV(..., cv=3, ...)


Use a Smaller Subset of Data:


During initial development and testing, train GridSearchCV on a smaller sample of your training data.  For example, use the first 1000 rows.  Once you've found a good range of parameters, train on the full dataset.


X_train_small, _, y_train_small, _ = train_test_split(X_train, y_train, train_size=1000, random_state=42)

grid_search_rf.fit(X_train_small, y_train_small)


Be cautious about using a too small subset, as it might not be representative of the full dataset, and the optimal hyperparameters might be different.


Consider RandomizedSearchCV:


If GridSearchCV is still too slow, consider RandomizedSearchCV.  Instead of trying all combinations, it samples a fixed number of parameter combinations.  This is often much faster, especially with a large parameter space, and can still find good (though not necessarily optimal) hyperparameters.


from sklearn.model_selection import RandomizedSearchCV

param_distributions = {  # Use param_distributions, not param_grid

    "n_estimators": [50, 100, 200, 300],

    "max_depth": [None, 10, 20, 30, 40],

    "min_samples_split": [2, 5, 10, 15],

    "min_samples_leaf": [1, 2, 4, 6],

    "criterion": ["gini", "entropy"],

}

random_search_rf = RandomizedSearchCV(

    RandomForestClassifier(random_state=1),

    param_distributions,  # Use param_distributions

    n_iter=10,  # Number of random combinations to try

    cv=3,

    scoring="accuracy",

    random_state=42,  # Important for reproducibility

)

random_search_rf.fit(X_train, y_train)

best_rf_classifier = random_search_rf.best_estimator_


n_iter controls how many random combinations are tried.  A smaller n_iter will be faster.


Use Parallel Processing (if available):


If your machine has multiple CPU cores, use the n_jobs parameter in GridSearchCV and RandomizedSearchCV.  Setting n_jobs=-1 will use all available cores.  This can significantly speed up the process, especially with larger datasets and complex models.


grid_search_rf = GridSearchCV(..., n_jobs=-1, ...)


However, be aware that parallel processing can increase memory consumption.


Early Stopping (Not Directly Applicable to RandomForest in GridSearchCV):


Some models (like Gradient Boosting) have built-in early stopping mechanisms that can halt training when performance on a validation set stops improving.  Random Forest itself doesn't have this, and GridSearchCV doesn't directly provide early stopping.


Summary of Recommendations:


For faster development:


Start with a smaller param_grid.


Use a lower cv (e.g., 3).


Consider training on a smaller data subset initially.


Explore RandomizedSearchCV for a faster, though potentially less exhaustive, search.


Use n_jobs=-1 to leverage multiple CPU cores.


Remember to increase cv and use the full dataset for your final model training and evaluation to ensure you get the best possible performance.


Saturday, May 3, 2025

Why Imbalanced data matters in Machine Learning

Biased Model: Machine learning models tend to be biased towards the majority class. In your case, a model might become very good at predicting "Rejected" but perform poorly on "Accepted" instances.

Poor Generalization: The model might not learn the characteristics of the minority class ("Accepted") well enough to generalize to unseen data.

Misleading Accuracy: A high overall accuracy can be misleading. For example, a model that always predicts "Rejected" would achieve 66% accuracy on your data, but it would be useless for your actual goal of predicting both "Accepted" and "Rejected" status.

Is Your Imbalance "Highly" Imbalanced?

While a 2:1 ratio is a moderate imbalance, it can still impact model performance. Whether it's "highly" imbalanced depends on the specific problem and the complexity of the data.  A 2:1 ratio might be manageable with the right techniques, but it's definitely something you need to address.

What Can You Do?

Here are several strategies to handle imbalanced data:

Resampling Techniques:

Oversampling the Minority Class: Increase the number of "Accepted" instances.

Random Oversampling: Duplicate existing "Accepted" samples.  Simple but can lead to overfitting.

SMOTE (Synthetic Minority Over-sampling Technique): Create new synthetic "Accepted" samples by interpolating between existing ones.  More sophisticated and generally preferred.

ADASYN (Adaptive Synthetic Sampling Approach): Similar to SMOTE, but generates more synthetic samples for "Accepted" instances that are harder to learn.

Undersampling the Majority Class: Decrease the number of "Rejected" instances.

Random Undersampling: Randomly remove "Rejected" samples.  Can lead to information loss.

Cluster Centroids: Replace clusters of "Rejected" samples with the cluster centroids.

Tomek Links: Remove pairs of very close instances that have opposite classes.


Cost-Sensitive Learning:

Assign different weights to the classes during model training.  Give higher weight to the minority class ("Accepted") to penalize misclassifications more heavily.  Most machine learning libraries (like scikit-learn) have built-in support for class weights.

Ensemble Methods:

Balanced Bagging/Random Forest: Create balanced subsets of the data through bootstrapping.

EasyEnsemble/BalanceCascade: Train several models on balanced subsets and combine their predictions.

Change the Evaluation Metric:

Don't rely on accuracy.  Use metrics that are more robust to class imbalance:

Precision and Recall: Focus on the performance for each class separately.

F1 Score: Harmonic mean of precision and recall, balances both.

AUC-ROC: Area Under the Receiver Operating Characteristic curve.

AUPRC: Area Under the Precision-Recall Curve.  Often preferred for highly imbalanced data.

Data Augmentation:

If your data involves text, you can use techniques like synonym replacement, back-translation, or random insertion/deletion to create more "Accepted" samples.

Recommendation for Your Case

Given your 2:1 ratio, I'd recommend starting with these approaches:

Resampling: Try SMOTE or ADASYN to oversample "Accepted," or try Tomek links for under sampling "Rejected".

Cost-Sensitive Learning: Use class weights in your chosen model (e.g., class_weight='balanced' in scikit-learn).

Evaluation Metrics: Use F1 score, AUC-ROC, and AUPRC to evaluate your model, not just accuracy.

By addressing the class imbalance, you'll create a model that is more accurate and reliable in predicting both "Accepted" and "Rejected" statuses.


What are various matrices in machine learning classification algorithm?

In classification tasks in machine learning, evaluating model performance involves various metrics. The F1 Score is one of the most commonly used, especially when dealing with imbalanced classes. Here's an overview of the F1 Score and other key metrics:


Accuracy: 

Formula: (TP + TN) / (TP + TN + FP + FN ) 

Usecase: when classes are balanced 

Limitation: this will be misleading if the classes are imbalanced 


Precision: 

Forumula: Precision = TP / (TP + FP)

Use case: When false positives are costly (e.g., spam detection).


Recall (Sensitivity or True Positive Rate)

Formula is TP / (TP + FN)

Use case: When false negatives are costly (e.g., medical diagnosis).


F1 Score: 

Formula: Recall = (Precision (.) Recall) / ( Precision + Recall)  

Use cases: When you want a balance between precision and recall, especially for imbalanced datasets.


ROC-AUC (Receiver Operating Characteristic - Area Under Curve)

Use Case: Measures model’s ability to distinguish between classes across all thresholds.

Higher AUC → Better model.


Confusion Matrix 

A confusion matrix is a performance measurement tool used in classification tasks in machine learning. It gives a detailed breakdown of how well a classification model is performing by showing actual vs. predicted class results.

For a binary classification, the confusion matrix looks like:

                                Predicted: Positive Predicted: Negative

Actual: Positive True Positive (TP) False Negative (FN)

Actual: Negative False Positive (FP) True Negative (TN)


True Positive (TP): Correctly predicted positive class

True Negative (TN): Correctly predicted negative class

False Positive (FP): Incorrectly predicted positive (Type I error)

False Negative (FN): Incorrectly predicted negative (Type II error)


Code for getting confusion matrix is given below 

from sklearn.metrics import confusion_matrix, classification_report

y_true = [1, 0, 1, 1, 0, 1, 0]

y_pred = [1, 0, 1, 0, 0, 1, 1]


# Confusion Matrix

cm = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:\n", cm)


# Classification Report

print("\nClassification Report:\n", classification_report(y_true, y_pred))

When Is It Especially Beneficial?

When dealing with imbalanced datasets (e.g., fraud detection, medical diagnosis).

When overall accuracy is misleading — confusion matrix shows where the model fails.

In multi-class classification, the confusion matrix shows errors across all classes.




Sample code for various Accuracy, Precision, Recall and AOC is below 

from sklearn.metrics import (
accuracy_score,
precision_score,
recall_score,
f1_score,
roc_auc_score,
confusion_matrix,
classification_report,
RocCurveDisplay
)
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

# 1. Create a binary classification dataset
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=2,
n_redundant=10, n_classes=2, weights=[0.7, 0.3], random_state=42
)

# 2. Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)

# 3. Train a classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# 4. Predict
y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:, 1] # Probabilities for ROC-AUC

# 5. Compute metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_proba))

# 6. Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# 7. Full classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# 8. Optional: ROC Curve
RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.title("ROC Curve")
plt.show()


references:

OpenAI 


Thursday, May 1, 2025

Hierarchical Agent Pattern Langgraph

As you add more agents to your system, it might become too hard for the supervisor to manage all of them. The supervisor might start making poor decisions about which agent to call next, or the context might become too complex for a single supervisor to keep track of. In other words, you end up with the same problems that motivated the multi-agent architecture in the first place.

To address this, you can design your system hierarchically. For example, you can create separate, specialized teams of agents managed by individual supervisors, and a top-level supervisor to manage the teams.

from typing import Literal

from langchain_openai import ChatOpenAI

from langgraph.graph import StateGraph, MessagesState, START, END

from langgraph.types import Command

model = ChatOpenAI()


# define team 1 (same as the single supervisor example above)


def team_1_supervisor(state: MessagesState) -> Command[Literal["team_1_agent_1", "team_1_agent_2", END]]:

    response = model.invoke(...)

    return Command(goto=response["next_agent"])


def team_1_agent_1(state: MessagesState) -> Command[Literal["team_1_supervisor"]]:

    response = model.invoke(...)

    return Command(goto="team_1_supervisor", update={"messages": [response]})


def team_1_agent_2(state: MessagesState) -> Command[Literal["team_1_supervisor"]]:

    response = model.invoke(...)

    return Command(goto="team_1_supervisor", update={"messages": [response]})


team_1_builder = StateGraph(Team1State)

team_1_builder.add_node(team_1_supervisor)

team_1_builder.add_node(team_1_agent_1)

team_1_builder.add_node(team_1_agent_2)

team_1_builder.add_edge(START, "team_1_supervisor")

team_1_graph = team_1_builder.compile()


# define team 2 (same as the single supervisor example above)

class Team2State(MessagesState):

    next: Literal["team_2_agent_1", "team_2_agent_2", "__end__"]


def team_2_supervisor(state: Team2State):

    ...


def team_2_agent_1(state: Team2State):

    ...


def team_2_agent_2(state: Team2State):

    ...


team_2_builder = StateGraph(Team2State)

...

team_2_graph = team_2_builder.compile()



# define top-level supervisor


builder = StateGraph(MessagesState)

def top_level_supervisor(state: MessagesState) -> Command[Literal["team_1_graph", "team_2_graph", END]]:

    # you can pass relevant parts of the state to the LLM (e.g., state["messages"])

    # to determine which team to call next. a common pattern is to call the model

    # with a structured output (e.g. force it to return an output with a "next_team" field)

    response = model.invoke(...)

    # route to one of the teams or exit based on the supervisor's decision

    # if the supervisor returns "__end__", the graph will finish execution

    return Command(goto=response["next_team"])


builder = StateGraph(MessagesState)

builder.add_node(top_level_supervisor)

builder.add_node("team_1_graph", team_1_graph)

builder.add_node("team_2_graph", team_2_graph)

builder.add_edge(START, "top_level_supervisor")

builder.add_edge("team_1_graph", "top_level_supervisor")

builder.add_edge("team_2_graph", "top_level_supervisor")

graph = builder.compile()

What is Supervisor agent architecture

In this architecture, we define agents as nodes and add a supervisor node (LLM) that decides which agent nodes should be called next. We use Command to route execution to the appropriate agent node based on supervisor's decision. This architecture also lends itself well to running multiple agents in parallel or using map-reduce pattern.


from typing import Literal

from langchain_openai import ChatOpenAI

from langgraph.types import Command

from langgraph.graph import StateGraph, MessagesState, START, END


model = ChatOpenAI()


def supervisor(state: MessagesState) -> Command[Literal["agent_1", "agent_2", END]]:

    # you can pass relevant parts of the state to the LLM (e.g., state["messages"])

    # to determine which agent to call next. a common pattern is to call the model

    # with a structured output (e.g. force it to return an output with a "next_agent" field)

    response = model.invoke(...)

    # route to one of the agents or exit based on the supervisor's decision

    # if the supervisor returns "__end__", the graph will finish execution

    return Command(goto=response["next_agent"])


def agent_1(state: MessagesState) -> Command[Literal["supervisor"]]:

    # you can pass relevant parts of the state to the LLM (e.g., state["messages"])

    # and add any additional logic (different models, custom prompts, structured output, etc.)

    response = model.invoke(...)

    return Command(

        goto="supervisor",

        update={"messages": [response]},

    )


def agent_2(state: MessagesState) -> Command[Literal["supervisor"]]:

    response = model.invoke(...)

    return Command(

        goto="supervisor",

        update={"messages": [response]},

    )


builder = StateGraph(MessagesState)

builder.add_node(supervisor)

builder.add_node(agent_1)

builder.add_node(agent_2)


builder.add_edge(START, "supervisor")


supervisor = builder.compile()



What is Supervisor (tool-calling)


In this variant of the supervisor architecture, we define individual agents as tools and use a tool-calling LLM in the supervisor node. This can be implemented as a ReAct-style agent with two nodes — an LLM node (supervisor) and a tool-calling node that executes tools (agents in this case).


from typing import Annotated

from langchain_openai import ChatOpenAI

from langgraph.prebuilt import InjectedState, create_react_agent


model = ChatOpenAI()


# this is the agent function that will be called as tool

# notice that you can pass the state to the tool via InjectedState annotation

def agent_1(state: Annotated[dict, InjectedState]):

    # you can pass relevant parts of the state to the LLM (e.g., state["messages"])

    # and add any additional logic (different models, custom prompts, structured output, etc.)

    response = model.invoke(...)

    # return the LLM response as a string (expected tool response format)

    # this will be automatically turned to ToolMessage

    # by the prebuilt create_react_agent (supervisor)

    return response.content


def agent_2(state: Annotated[dict, InjectedState]):

    response = model.invoke(...)

    return response.content


tools = [agent_1, agent_2]

# the simplest way to build a supervisor w/ tool-calling is to use prebuilt ReAct agent graph

# that consists of a tool-calling LLM node (i.e. supervisor) and a tool-executing node

supervisor = create_react_agent(model, tools)


What is Network Agent Architecture?

In this architecture, agents are defined as graph nodes. Each agent can communicate with every other agent (many-to-many connections) and can decide which agent to call next. This architecture is good for problems that do not have a clear hierarchy of agents or a specific sequence in which agents should be called.


from typing import Literal

from langchain_openai import ChatOpenAI

from langgraph.types import Command

from langgraph.graph import StateGraph, MessagesState, START, END


model = ChatOpenAI()


def agent_1(state: MessagesState) -> Command[Literal["agent_2", "agent_3", END]]:

    # you can pass relevant parts of the state to the LLM (e.g., state["messages"])

    # to determine which agent to call next. a common pattern is to call the model

    # with a structured output (e.g. force it to return an output with a "next_agent" field)

    response = model.invoke(...)

    # route to one of the agents or exit based on the LLM's decision

    # if the LLM returns "__end__", the graph will finish execution

    return Command(

        goto=response["next_agent"],

        update={"messages": [response["content"]]},

    )


def agent_2(state: MessagesState) -> Command[Literal["agent_1", "agent_3", END]]:

    response = model.invoke(...)

    return Command(

        goto=response["next_agent"],

        update={"messages": [response["content"]]},

    )


def agent_3(state: MessagesState) -> Command[Literal["agent_1", "agent_2", END]]:

    ...

    return Command(

        goto=response["next_agent"],

        update={"messages": [response["content"]]},

    )


builder = StateGraph(MessagesState)

builder.add_node(agent_1)

builder.add_node(agent_2)

builder.add_node(agent_3)