RAG Without Vectors: How PageIndex Retrieves by Reasoning
Retrieval is the place most RAG methods quietly break. Traditional pipelines depend on vector similarity—embedding queries and doc chunks into the identical house and fetching the “closest” matches. But similarity is a weak proxy for what we really want: relevance grounded in reasoning. In lengthy, skilled paperwork—like monetary studies, analysis papers, or authorized texts—the appropriate reply usually isn’t in essentially the most semantically related paragraph. It requires navigating construction, understanding context, and performing multi-step reasoning throughout sections. This is strictly the place vector-based RAG begins to disintegrate.
PageIndex is designed to unravel this hole by rethinking retrieval from first ideas. Instead of chunking paperwork and looking by way of embeddings, it builds a hierarchical table-of-contents-style tree index and makes use of LLMs to purpose over that construction—very like a human knowledgeable scanning sections, drilling down, and connecting concepts. This permits a vectorless, reasoning-driven retrieval course of that’s extra interpretable, traceable, and aligned with how information is definitely extracted from advanced paperwork. By changing similarity search with structured exploration and tree-based reasoning, PageIndex delivers considerably increased retrieval accuracy—demonstrated by its robust efficiency on benchmarks like FinanceBench—making it notably efficient for domains that demand precision and deep understanding.

In this text, we’ll use PageIndex to index the seminal Transformer paper — “Attention Is All You Need” — and run two cross-cutting queries in opposition to it with out a single vector or embedding. Instead of chunking the PDF and retrieving by similarity, PageIndex builds a hierarchical tree of the doc’s sections, then makes use of GPT-5.4 to purpose over node summaries and establish precisely which sections include the reply — earlier than studying a single phrase of full textual content.
Setting up the dependencies
For this tutorial, you’d require PageIndex & OpenAI API keys. You can get the identical from https://dash.pageindex.ai/api-keys and https://platform.openai.com/api-keys respectively.
pip set up pageindex openai requests
from pageindex import PageIndexClient
import pageindex.utils as utils
import os
from getpass import getpass
PAGEINDEX_API_KEY = getpass('Enter PageIndex API Key: ')
pi_client = PageIndexClient(api_key=PAGEINDEX_API_KEY)
We import the OpenAI shopper and configure it with an API key to allow entry to LLMs. Then, we outline an asynchronous helper operate that sends prompts to the mannequin and returns the generated response.
import openai
OPENAI_API_KEY = getpass('Enter OpenAI API Key: ')
async def call_llm(immediate, mannequin="gpt-5.4", temperature=0):
shopper = openai.AsyncOpenAI(api_key=OPENAI_API_KEY)
response = await shopper.chat.completions.create(
mannequin=mannequin,
messages=[{"role": "user", "content": prompt}],
temperature=temperature
)
return response.decisions[0].message.content material.strip()
Building the PageIndex Tree
In this chunk, we obtain the Transformer paper instantly from arXiv and submit it to PageIndex, which processes the PDF and builds a hierarchical tree of its sections — every node storing a title, a abstract, and the complete part textual content. Once the tree is prepared, we print it out to examine the construction PageIndex has inferred: each chapter, subsection, and nested heading turns into a node within the tree, preserving the doc’s pure group precisely because the authors meant it.
# ─────────────────────────────────────────────
# Step 1: Build the PageIndex Tree
# ─────────────────────────────────────────────
# 1.1 Download the Transformer paper and submit it
import os, requests
pdf_url = "https://arxiv.org/pdf/1706.03762.pdf"
pdf_path = os.path.be part of("knowledge", pdf_url.break up("/")[-1])
os.makedirs("knowledge", exist_ok=True)
print("Downloading 'Attention Is All You Need'...")
response = requests.get(pdf_url)
with open(pdf_path, "wb") as f:
f.write(response.content material)
print(f"
Saved to {pdf_path}")
doc_id = pi_client.submit_document(pdf_path)["doc_id"]
print(f"
Document submitted. doc_id: {doc_id}")
# 1.2 Retrieve the tree (ballot till prepared)
import time
print("nWaiting for PageIndex tree to be prepared", finish="")
whereas not pi_client.is_retrieval_ready(doc_id):
print(".", finish="", flush=True)
time.sleep(5)
tree = pi_client.get_tree(doc_id, node_summary=True)["result"]
print("nn
Document Tree Structure:")
utils.print_tree(tree)

Reasoning-Based Retrieval
With the tree constructed, we now run a question that’s deliberately cross-cutting — one that may’t be answered by a single part of the paper. We strip the complete textual content from every node, leaving solely titles and summaries, and move the complete tree construction to GPT-5.4. The mannequin then causes over these summaries to establish each node more likely to include a related reply, returning each its step-by-step considering and an inventory of matched node IDs. This is the core of what makes PageIndex completely different: the LLM decides the place to look earlier than any full textual content is loaded.
# ─────────────────────────────────────────────
# Step 2: Reasoning-Based Retrieval
# ─────────────────────────────────────────────
# 2.1 Define a question that requires navigating throughout sections
import json
# This question is deliberately cross-cutting -- it could actually't be answered
# by a single part, which is the place tree search shines over top-k.
question = "Why did the authors select self-attention over recurrence, and what are the complexity trade-offs they in contrast?"
tree_without_text = utils.remove_fields(tree.copy(), fields=["text"])
search_prompt = f"""
You are given a query and a hierarchical tree construction of a analysis paper.
Each node has a node_id, title, and a abstract of its content material.
Your job: establish ALL nodes which might be more likely to include info related to answering the query.
Think rigorously -- the reply could also be unfold throughout a number of sections.
Question: {question}
Document tree:
{json.dumps(tree_without_text, indent=2)}
Reply ONLY on this JSON format, no preamble:
{{
"considering": "<step-by-step reasoning about which nodes are related and why>",
"node_list": ["node_id_1", "node_id_2", ...]
}}
"""
print(f'
Query: "{question}"n')
print("Running tree search with GPT-5.4...")
tree_search_result = await call_llm(search_prompt)
# 2.2 Inspect the retrieval reasoning and matched nodes
node_map = utils.create_node_mapping(tree)
result_json = json.hundreds(tree_search_result)
print("n
LLM Reasoning:")
utils.print_wrapped(result_json["thinking"])
print("n
Retrieved Nodes:")
for node_id in result_json["node_list"]:
node = node_map[node_id]
print(f" • [{node['node_id']}] Page {node['page_index']:>2} -- {node['title']}")

Answer Generation
Once the related nodes are recognized, we pull their full textual content and sew it collectively right into a single context block — every part clearly labeled so the mannequin is aware of the place each bit of knowledge comes from. That mixed context is then handed to GPT-5.4 with a structured immediate that asks for the core motivation, the precise complexity numbers, and any caveats the authors acknowledged. The mannequin solutions utilizing solely what was retrieved, grounding each declare instantly within the paper’s textual content.
# ─────────────────────────────────────────────
# Step 3: Answer Generation
# ─────────────────────────────────────────────
# 3.1 Stitch collectively context from all retrieved nodes
node_list = result_json["node_list"]
relevant_content = "nn---nn".be part of(
f"[Section: {node_map[nid]['title']}]n{node_map[nid]['text']}"
for nid in node_list
)
print(f"n
Retrieved Context Preview (first 1200 chars):n")
utils.print_wrapped(relevant_content[:1200] + "...n")
# 3.2 Generate a structured reply grounded within the retrieved sections
answer_prompt = f"""
You are a technical assistant. Answer the query under utilizing ONLY the offered context.
Be particular -- reference precise design decisions, numbers, and trade-offs talked about within the textual content.
Question: {question}
Context:
{relevant_content}
Structure your reply as:
1. The core motivation for selecting self-attention
2. The particular complexity comparisons made (embody any tables or numbers)
3. Any caveats or limitations the authors acknowledged
"""
print("
Generating reply...n")
reply = await call_llm(answer_prompt)
print("─" * 60)
print("
Final Answer:n")
utils.print_wrapped(reply)
print("─" * 60)


Testing with a Second Query
To present that the tree is constructed as soon as and reused at no further value, we run a second question — this time concentrating on a localized mechanism slightly than a cross-cutting design choice. The similar tree construction is handed to GPT-5.4, which narrows its search to only the eye subsections, retrieves their full textual content, and generates a clear clarification of how multi-head consideration works and why the scaling issue issues. No re-indexing, no re-embedding — only a new query in opposition to the identical tree.
query2 = "How does the multi-head consideration mechanism work, and what's the function of scaling in dot-product consideration?"
search_prompt2 = f"""
You are given a query and a hierarchical tree construction of a analysis paper.
Identify all nodes more likely to include the reply.
Question: {query2}
Document tree:
{json.dumps(tree_without_text, indent=2)}
Reply ONLY on this JSON format:
{{
"considering": "<reasoning>",
"node_list": ["node_id_1", ...]
}}
"""
print(f'nn
Second Query: "{query2}"n')
result2_raw = await call_llm(search_prompt2)
result2 = json.hundreds(result2_raw)
print("
Reasoning:")
utils.print_wrapped(result2["thinking"])
relevant_content2 = "nn---nn".be part of(
f"[Section: {node_map[nid]['title']}]n{node_map[nid]['text']}"
for nid in result2["node_list"]
)
answer_prompt2 = f"""
Answer the next query utilizing ONLY the offered context.
Explain the mechanism clearly, as if for a technical weblog put up.
Question: {query2}
Context: {relevant_content2}
"""
answer2 = await call_llm(answer_prompt2)
print("n
Answer:n")
utils.print_wrapped(answer2)

Check out the Full Codes here. Find 100s of ML/Data Science Colab Notebooks here. Also, be happy to observe us on Twitter and don’t neglect to affix our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up RAG Without Vectors: How PageIndex Retrieves by Reasoning appeared first on MarkTechPost.
