Supported file types: PDF, .pptx, .docx, .rtf, .pages, .epub, etc…
Transformed output type: Markdown, text
Extraction Capabilities: Text, tables, images, graphs, comic books, mathematics equations
Customized Parsing Instructions: Since LlamaParse is LLM enabled, you can pass it instructions just as if you were prompting an LLM. You could use this prompt to describe the document therefore adding more context for the LLM to use while parsing, indicate how you want the output to look, or ask the LLM to do preprocessing during parsing like sentiment analysis, language translation, summarization, etc…
JSON Mode: Outputs the complete structure of the document, extracts images with size and location metadata, extracts tables in JSON format for easy analysis. This is perfect for custom RAG applications where document structure and metadata are used to maximize informational value of documents and for citing where in a document retrieved nodes originate.
The Markdown Advantage
There are some unique advantages to LlamaParse transforming a PDF into markdown format. Markdown specifies the inherent structure of the document by identifying structural elements like titles, headers, subsections, tables, and images. This may seem trivial, but since markdown identifies these elements, we can easily split a document into smaller chunks based on structure using specialized parsers from LlamaIndex like the MarkdownElementNodeParser(). The result of representing a PDF file in markdown format is it enables us to extract each element of the PDF and ingest them into the RAG pipeline.
Below is a retrieval query for the same
from openai import OpenAI
client = OpenAI()
def embed_query(query):
query_embedding = client.embeddings.create(
input=query,
model="text-embedding-3-small"
)
return query_embedding.data[0].embedding
def retrieve_data(query):
query_embedding = embed_query(query)
results = table.search(vectors={'flat':[query_embedding]},n=5,filter=[('<>','document_id','4a9551df-5dec-4410-90bb-43d17d722918')])
retrieved_data_for_RAG = []
for index, row in results[0].iterrows():
retrieved_data_for_RAG.append(row['text'])
return retrieved_data_for_RAG
def RAG(query):
question = "You will answer this question based on the provided reference material: " + query
messages = "Here is the provided context: " + "\n"
results = retrieve_data(query)
if results:
for data in results:
messages += data + "\n"
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": question},
{
"role": "user",
"content": [
{"type": "text", "text": messages},
],
}
],
max_tokens=300,
)
content = response.choices[0].message.content
return content
No comments:
Post a Comment