Below are the properties of the graph we are creating
Document: metadata for document sources
Chunk: text chunks from the documents with embeddings to power vector retrieval
__Entity__: Entities extracted from the text chunks
Creating a knowledge graph with the GraphRAG Python package is pretty simple
The SimpleKGPipeline class allows you to automatically build a knowledge graph with a few key inputs, including
a driver to connect to Neo4j,
an LLM for entity extraction, and
an embedding model to create vectors on text chunks for similarity search.
Neo4j Driver
The Neo4j driver allows you to connect and perform read and write transactions with the database. You can obtain the URI, username, and password variables from when you created the database. If you created your database on AuraDB, they are in the file you downloaded.
import neo4j
neo4j_driver = neo4j.GraphDatabase.driver(NEO4J_URI,
auth=(NEO4J_USERNAME, NEO4J_PASSWORD))
LLM & Embedding Model
In this case, we will use OpenAI GPT-4o-mini for convenience. It is a fast and low-cost model. The GraphRAG Python package supports opens in new tabany LLM model, including models from OpenAI, Google VertexAI, Anthropic, Cohere, Azure OpenAI, local Ollama models, and any chat model that works with LangChain. You can also implement a custom interface for any other LLM.
Likewise, we will use OpenAI’s default text-embedding-ada-002 for the embedding model, but you can use opens in new tabother embedders from different providers.
import neo4j
from neo4j_graphrag.llm import OpenAILLM
from neo4j_graphrag.embeddings.openai import OpenAIEmbeddings
llm=OpenAILLM(
model_name="gpt-4o-mini",
model_params={
"response_format": {"type": "json_object"}, # use json_object formatting for best results
"temperature": 0 # turning temperature down for more deterministic results
}
)
#create text embedder
embedder = OpenAIEmbeddings()
Optional Inputs: Schema & Prompt Template
While not required, adding a graph schema is highly recommended for improving knowledge graph quality. It provides guidance for the node and relationship types to create during entity extraction.
Pro-tip: If you are still deciding what schema to use, try building a graph without a schema first and examine the most common node and relationship types created as a starting point.
For our graph schema, we will define entities (a.k.a. node labels) and relations that we want to extract. While we won’t use it in this simple example, there is also an optional potential_schema argument, which can guide opens in new tabwhich relationships should connect to which nodes.
basic_node_labels = ["Object", "Entity", "Group", "Person", "Organization", "Place"]
academic_node_labels = ["ArticleOrPaper", "PublicationOrJournal"]
medical_node_labels = ["Anatomy", "BiologicalProcess", "Cell", "CellularComponent",
"CellType", "Condition", "Disease", "Drug",
"EffectOrPhenotype", "Exposure", "GeneOrProtein", "Molecule",
"MolecularFunction", "Pathway"]
node_labels = basic_node_labels + academic_node_labels + medical_node_labels
# define relationship types
rel_types = ["ACTIVATES", "AFFECTS", "ASSESSES", "ASSOCIATED_WITH", "AUTHORED",
"BIOMARKER_FOR", …]
We will also be adding a custom prompt for entity extraction. While the GraphRAG Python package has an internal default prompt, engineering a prompt closer to your use case often helps create a more applicable knowledge graph. The prompt below was created with a bit of experimentation. Be sure to follow the same general format as the opens in new tabdefault prompt.
prompt_template = '''
You are a medical researcher tasks with extracting information from papers
and structuring it in a property graph to inform further medical and research Q&A.
Extract the entities (nodes) and specify their type from the following Input text.
Also extract the relationships between these nodes. the relationship direction goes from the start node to the end node.
Return result as JSON using the following format:
{{"nodes": [ {{"id": "0", "label": "the type of entity", "properties": {{"name": "name of entity" }} }}],
"relationships": [{{"type": "TYPE_OF_RELATIONSHIP", "start_node_id": "0", "end_node_id": "1", "properties": {{"details": "Description of the relationship"}} }}] }}
- Use only the information from the Input text. Do not add any additional information.
- If the input text is empty, return empty Json.
- Make sure to create as many nodes and relationships as needed to offer rich medical context for further research.
- An AI knowledge assistant must be able to read this graph and immediately understand the context to inform detailed research questions.
- Multiple documents will be ingested from different sources and we are using this property graph to connect information, so make sure entity types are fairly general.
Use only fhe following nodes and relationships (if provided):
{schema}
Assign a unique ID (string) to each node, and reuse it to define relationships.
Do respect the source and target node types for relationship and
the relationship direction.
Do not return any additional information other than the JSON in it.
Examples:
{examples}
Input text:
{text}
'''
Creating the SimpleKGPipeline
Create the SimpleKGPipeline using the constructor below:
from neo4j_graphrag.experimental.components.text_splitters.fixed_size_splitter import FixedSizeSplitter
from neo4j_graphrag.experimental.pipeline.kg_builder import SimpleKGPipeline
kg_builder_pdf = SimpleKGPipeline(
llm=ex_llm,
driver=driver,
text_splitter=FixedSizeSplitter(chunk_size=500, chunk_overlap=100),
embedder=embedder,
entities=node_labels,
relations=rel_types,
prompt_template=prompt_template,
from_pdf=True
)
Running the Knowledge Graph Builder
You can run the knowledge graph builder with the run_async method. We are going to iterate through 3 PDFs below.
pdf_file_paths = ['truncated-pdfs/biomolecules-11-00928-v2-trunc.pdf',
'truncated-pdfs/GAP-between-patients-and-clinicians_2023_Best-Practice-trunc.pdf',
'truncated-pdfs/pgpm-13-39-trunc.pdf']
for path in pdf_file_paths:
print(f"Processing : {path}")
pdf_result = await kg_builder_pdf.run_async(file_path=path)
print(f"Result: {pdf_result}")
Once complete, you can explore the resulting knowledge graph. opens in new tabThe Unified Console provides a great interface for this.
Go to the Query tab and enter the below query to see a sample of the graph.
MATCH p=()-->() RETURN p LIMIT 1000;
No comments:
Post a Comment