The PydanticOutputParser emerges as a valuable asset in the LangChain arsenal. By seamlessly bridging the gap between raw text and organized, JSON-like structures, LangChain empowers users to extract valuable insights with precision and ease. By transforming language model outputs into structured information, LangChain propels us toward a future where the data generated is not just strings but meaningful, structured insights.
In the code that follows, Pydantic is being used to define data models that represent the structure of the competitive intelligence information. Pydantic is a data validation and parsing library for Python that allows you to define simple or complex data structures using Python data types. In this case, we using Pydantic models (Competitor and Company) to define the structure of the competitive intelligence data.
import pandas as pd
from typing import Optional, Sequence
from langchain.llms import OpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel
# Load data from CSV
df = pd.read_csv("data.csv", sep=';')
# Pydantic models for competitive intelligence
class Competitor(BaseModel):
company: str
offering: str
advantage: str
products_and_services: str
additional_details: str
class Company(BaseModel):
"""Identifying information about all competitive intelligence in a text."""
company: Sequence[Competitor]
# Set up a Pydantic parser and prompt template
parser = PydanticOutputParser(pydantic_object=Company)
prompt = PromptTemplate(
template="Answer the user query.\n{format_instructions}\n{query}\n",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()},
)
# Function to process each row and extract information
def process_row(row):
_input = prompt.format_prompt(query=row['INTEL'])
model = OpenAI(temperature=0)
output = model(_input.to_string())
result = parser.parse(output)
# Convert Pydantic result to a dictionary
competitor_data = result.model_dump()
# Flatten the nested structure for DataFrame creation
flat_data = {'INTEL': [], 'company': [], 'offering': [], 'advantage': [], 'products_and_services': [], 'additional_details': []}
for entry in competitor_data['company']:
flat_data['INTEL'].append(row['INTEL'])
flat_data['company'].append(entry['company'])
flat_data['offering'].append(entry['offering'])
flat_data['advantage'].append(entry['advantage'])
flat_data['products_and_services'].append(entry['products_and_services'])
flat_data['additional_details'].append(entry['additional_details'])
# Create a DataFrame from the flattened data
df_cake = pd.DataFrame(flat_data)
return df_cake
# Apply the function to each row and concatenate the results
intel_df = pd.concat(df.apply(process_row, axis=1).tolist(), ignore_index=True)
# Display the resulting DataFrame
intel_df.head()
references:
https://medium.com/@shubham.shardul2019/output-parsers-in-langchain-pydantic-json-parsing-31be48ce6cfe
https://medium.com/@ingridwickstevens/extract-structured-data-from-unstructured-text-using-llms-71502addf52b
No comments:
Post a Comment