Tuesday, September 30, 2025

Overview of LlamaParse Features

Supported file types: PDF, .pptx, .docx, .rtf, .pages, .epub, etc…

Transformed output type: Markdown, text

Extraction Capabilities: Text, tables, images, graphs, comic books, mathematics equations

Customized Parsing Instructions: Since LlamaParse is LLM enabled, you can pass it instructions just as if you were prompting an LLM. You could use this prompt to describe the document therefore adding more context for the LLM to use while parsing, indicate how you want the output to look, or ask the LLM to do preprocessing during parsing like sentiment analysis, language translation, summarization, etc…

JSON Mode: Outputs the complete structure of the document, extracts images with size and location metadata, extracts tables in JSON format for easy analysis. This is perfect for custom RAG applications where document structure and metadata are used to maximize informational value of documents and for citing where in a document retrieved nodes originate.


The Markdown Advantage

There are some unique advantages to LlamaParse transforming a PDF into markdown format. Markdown specifies the inherent structure of the document by identifying structural elements like titles, headers, subsections, tables, and images. This may seem trivial, but since markdown identifies these elements, we can easily split a document into smaller chunks based on structure using specialized parsers from LlamaIndex like the MarkdownElementNodeParser(). The result of representing a PDF file in markdown format is it enables us to extract each element of the PDF and ingest them into the RAG pipeline.



Below is a retrieval query for the same 


from openai import OpenAI

client = OpenAI()


def embed_query(query):

    query_embedding = client.embeddings.create(

            input=query,

            model="text-embedding-3-small"

        )

    return query_embedding.data[0].embedding


def retrieve_data(query):

    query_embedding = embed_query(query)

    results = table.search(vectors={'flat':[query_embedding]},n=5,filter=[('<>','document_id','4a9551df-5dec-4410-90bb-43d17d722918')])

    retrieved_data_for_RAG = []

    for index, row in results[0].iterrows():

      retrieved_data_for_RAG.append(row['text'])

    return retrieved_data_for_RAG


def RAG(query):

  question = "You will answer this question based on the provided reference material: " + query

  messages = "Here is the provided context: " + "\n"

  results = retrieve_data(query)

  if results:

    for data in results:

      messages += data + "\n"

  response = client.chat.completions.create(

      model="gpt-4o",

      messages=[

          {"role": "system", "content": question},

          {

          "role": "user",

          "content": [

              {"type": "text", "text": messages},

          ],

          }

      ],

      max_tokens=300,

  )

  content = response.choices[0].message.content

  return content


No comments:

Post a Comment