Integration: Neo4j
Use the Neo4j database with Haystack
Table of Contents
Overview
An integration of Neo4j graph database with Haystack v2.0 by deepset. In Neo4j Vector search index is being used for storing document embeddings and dense retrievals.
The library allows using Neo4j as a
DocumentStore, and implements the required
Protocol methods. You can start working with the implementation by importing it from neo4j_haystack package:
from neo4j_haystack import Neo4jDocumentStore
In addition to the Neo4jDocumentStore the library includes the following haystack components which can be used in a pipeline:
-
Neo4jEmbeddingRetriever - a typical
retriever component which can be used to query vector store index and find related Documents. The component uses
Neo4jDocumentStoreto query embeddings. -
Neo4jDynamicDocumentRetriever is also a retriever component in a sense that it can be used to query Documents in Neo4j. However it is decoupled from
Neo4jDocumentStoreand allows to run arbitrary Cypher query to extract documents. Practically it is possible to query Neo4j same wayNeo4jDocumentStoredoes, including vector search.
The neo4j-haystack library uses
Python Driver and
Cypher Queries to interact with Neo4j database and hide all complexities under the hood.
Neo4jDocumentStore will store Documents as Graph nodes in Neo4j. Embeddings are stored as part of the node, but indexing and querying of vector embeddings using ANN is managed by a dedicated
Vector Index.
+-----------------------------+
| Neo4j Database |
+-----------------------------+
| |
| +----------------+ |
| | Document | |
write_documents | +----------------+ |
+------------------------+----->| properties | |
| | | | |
+---------+----------+ | | embedding | |
| | | +--------+-------+ |
| Neo4jDocumentStore | | | |
| | | |index/query |
+---------+----------+ | | |
| | +--------+--------+ |
| | | Vector Index | |
+----------------------->| | | |
query_embeddings | | (for embedding) | |
| +-----------------+ |
| |
+-----------------------------+
In the above diagram:
Documentis a Neo4j node (with “Document” label)propertiesare Document attributes stored as part of the node.embeddingis also a property of the Document node (just shown separately in the diagram for clarity) which is a vector of typeLIST[FLOAT].Vector Indexis where embeddings are getting indexed by Neo4j as soon as those are updated in Document nodes.
Installation
neo4j-haystack can be installed as any other Python library, using pip:
pip install --upgrade pip # optional
pip install sentence-transformers # required in order to run pipeline examples given below
pip install neo4j-haystack
Usage
Once installed, you can start using Neo4jDocumentStore as any other document stores that support embeddings.
from neo4j_haystack import Neo4jDocumentStore
document_store = Neo4jDocumentStore(
url="bolt://localhost:7687",
username="neo4j",
password="passw0rd",
database="neo4j",
embedding_dim=384,
embedding_field="embedding",
index="document-embeddings", # The name of the Vector Index in Neo4j
node_label="Document", # Providing a label to Neo4j nodes which store Documents
)
Assuming there is a list of documents available you can write/index those in Neo4j, e.g.:
documents: List[Document] = ...
document_store.write_documents(documents)
The full list of parameters accepted by Neo4jDocumentStore can be found in
API documentation.
Please notice you will need to have a running instance of Neo4j database (in-memory version of Neo4j is not supported). There are several options available:
- Docker, other options available in the same Operations Manual
- AuraDB - a fully managed Cloud Instance of Neo4j
- Neo4j Desktop client application
The simplest way to start database locally will be with Docker container:
docker run \
--restart always \
--publish=7474:7474 --publish=7687:7687 \
--env NEO4J_AUTH=neo4j/passw0rd \
neo4j:5.15.0
Retrieving documents
Neo4jEmbeddingRetriever component can be used to retrieve documents from Neo4j by querying vector index using an embedded query. Below is a pipeline which finds documents using query embedding as well as
metadata filtering:
from typing import List
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from neo4j_haystack import Neo4jEmbeddingRetriever, Neo4jDocumentStore
document_store = Neo4jDocumentStore(
url="bolt://localhost:7687",
username="neo4j",
password="passw0rd",
database="neo4j",
embedding_dim=384,
index="document-embeddings",
)
documents = [
Document(content="My name is Morgan and I live in Paris.", meta={"release_date": "2018-12-09"})]
document_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)
document_store.write_documents(documents_with_embeddings.get("documents"))
print(document_store.count_documents())
pipeline = Pipeline()
pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
pipeline.add_component("retriever", Neo4jEmbeddingRetriever(document_store=document_store))
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
result = pipeline.run(
data={
"text_embedder": {"text": "What cities do people live in?"},
"retriever": {
"top_k": 5,
"filters": {"field": "release_date", "operator": "==", "value": "2018-12-09"},
},
}
)
documents: List[Document] = result["retriever"]["documents"]
>>> output:
`[Document(id=e765764ab700b231db1eeae208d6a59047b4b93712d1a9e379ae9599128ffdbd, content: 'My name is Morgan and I live in Paris.', meta: {'release_date': '2018-12-09'}, score: 0.8416308164596558)]`
Retrieving documents using Cypher
Neo4jDynamicDocumentRetriever is a flexible retriever component which can run a Cypher query to obtain documents. The above example of Neo4jEmbeddingRetriever could be rewritten without usage of Neo4jDocumentStore:
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from neo4j_haystack import Neo4jClientConfig, Neo4jDynamicDocumentRetriever
client_config = Neo4jClientConfig(
url="bolt://localhost:7687",
username="neo4j",
password="passw0rd",
database="neo4j",
)
cypher_query = """
CALL db.index.vector.queryNodes($index, $top_k, $query_embedding)
YIELD node as doc, score
MATCH (doc) WHERE doc.release_date = $release_date
RETURN doc{.*, score}, score
ORDER BY score DESC LIMIT $top_k
"""
embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
retriever = Neo4jDynamicDocumentRetriever(
client_config=client_config, runtime_parameters=["query_embedding"], doc_node_name="doc"
)
pipeline = Pipeline()
pipeline.add_component("text_embedder", embedder)
pipeline.add_component("retriever", retriever)
pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
result = pipeline.run(
data={
"text_embedder": {"text": "What cities do people live in?"},
"retriever": {
"query": cypher_query,
"parameters": {"index": "document-embeddings", "top_k": 5, "release_date": "2018-12-09"},
},
}
)
documents: List[Document] = result["retriever"]["documents"]
Please notice how query parameters are being used in the cypher_query:
runtime_parametersis a list of parameter names which are going to be input slots when connecting components in a pipeline. In our casequery_embeddinginput is connected to thetext_embedder.embeddingoutput.pipeline.runspecifies additional parameters to theretrievercomponent which can be referenced in thecypher_query, e.g.top_k.
More examples
You can find more examples in the implementation repository:
- indexing_pipeline.py - Indexing text files (documents) from a remote http location.
-
rag_pipeline.py - Generative question answering RAG pipeline using
Neo4jEmbeddingRetrieverto fetch documents from Neo4j document store and answer question using HuggingFaceTGIGenerator. -
rag_pipeline_cypher.py - Same as
rag_pipeline.pybut usingNeo4jDynamicDocumentRetriever.
You might find more technical details in the Code Reference documentation. For example, in real world scenarios there could be requirements to tune connection settings to Neo4j database (e.g. request timeout). Neo4jDocumentStore accepts an extended client configuration using Neo4jClientConfig class.
License
neo4j-haystack is distributed under the terms of the
MIT license.