Monday, January 22, 2024

Langchain PydanticOutputParser

The PydanticOutputParser emerges as a valuable asset in the LangChain arsenal. By seamlessly bridging the gap between raw text and organized, JSON-like structures, LangChain empowers users to extract valuable insights with precision and ease. By transforming language model outputs into structured information, LangChain propels us toward a future where the data generated is not just strings but meaningful, structured insights.


In the code that follows, Pydantic is being used to define data models that represent the structure of the competitive intelligence information. Pydantic is a data validation and parsing library for Python that allows you to define simple or complex data structures using Python data types. In this case, we using Pydantic models (Competitor and Company) to define the structure of the competitive intelligence data.


import pandas as pd

from typing import Optional, Sequence

from langchain.llms import OpenAI

from langchain.output_parsers import PydanticOutputParser

from langchain.prompts import PromptTemplate

from pydantic import BaseModel


# Load data from CSV

df = pd.read_csv("data.csv", sep=';')


# Pydantic models for competitive intelligence

class Competitor(BaseModel):

    company: str

    offering: str

    advantage: str

    products_and_services: str

    additional_details: str


class Company(BaseModel):

    """Identifying information about all competitive intelligence in a text."""

    company: Sequence[Competitor]


# Set up a Pydantic parser and prompt template

parser = PydanticOutputParser(pydantic_object=Company)

prompt = PromptTemplate(

    template="Answer the user query.\n{format_instructions}\n{query}\n",

    input_variables=["query"],

    partial_variables={"format_instructions": parser.get_format_instructions()},

)


# Function to process each row and extract information

def process_row(row):

    _input = prompt.format_prompt(query=row['INTEL'])

    model = OpenAI(temperature=0)

    output = model(_input.to_string())

    result = parser.parse(output)

    

    # Convert Pydantic result to a dictionary

    competitor_data = result.model_dump()


    # Flatten the nested structure for DataFrame creation

    flat_data = {'INTEL': [], 'company': [], 'offering': [], 'advantage': [], 'products_and_services': [], 'additional_details': []}


    for entry in competitor_data['company']:

        flat_data['INTEL'].append(row['INTEL'])

        flat_data['company'].append(entry['company'])

        flat_data['offering'].append(entry['offering'])

        flat_data['advantage'].append(entry['advantage'])

        flat_data['products_and_services'].append(entry['products_and_services'])

        flat_data['additional_details'].append(entry['additional_details'])


    # Create a DataFrame from the flattened data

    df_cake = pd.DataFrame(flat_data)


    return df_cake


# Apply the function to each row and concatenate the results

intel_df = pd.concat(df.apply(process_row, axis=1).tolist(), ignore_index=True)


# Display the resulting DataFrame

intel_df.head()



references:

https://medium.com/@shubham.shardul2019/output-parsers-in-langchain-pydantic-json-parsing-31be48ce6cfe

https://medium.com/@ingridwickstevens/extract-structured-data-from-unstructured-text-using-llms-71502addf52b



No comments:

Post a Comment