Ingestion
InitRunner's ingestion pipeline extracts text from source files, splits it into chunks, generates embeddings, and stores vectors in a local LanceDB database. Once ingested, an agent can search documents at runtime via the auto-registered search_documents tool.
Quick Start
apiVersion: initrunner/v1
kind: Agent
metadata:
name: kb-agent
description: Knowledge base agent
spec:
role: |
You are a knowledge assistant. Use search_documents to find relevant
content before answering. Always cite your sources.
model:
provider: openai
name: gpt-4o-mini
ingest:
sources:
- "./docs/**/*.md"
- "./knowledge-base/**/*.txt"
chunking:
strategy: fixed
chunk_size: 512
chunk_overlap: 50# Ingest documents
initrunner ingest role.yaml
# Run the agent (search_documents is auto-registered)
initrunner run role.yaml -p "What does the onboarding guide say?"Walkthrough: Build a Knowledge Base Agent
This walkthrough builds a complete RAG agent from scratch — set up docs, configure the agent, ingest, and query.
1. Set up your docs directory
mkdir -p docs
# Add your markdown files to ./docs/2. Create the agent
apiVersion: initrunner/v1
kind: Agent
metadata:
name: rag-agent
description: Knowledge base Q&A agent with document ingestion
spec:
role: |
You are a helpful documentation assistant. You answer user questions
using the ingested knowledge base.
Rules:
- ALWAYS call search_documents before answering a question
- Base your answers only on information found in the documents
- Cite the source document for each claim (e.g., "Per the Getting Started
guide, ...")
- If search_documents returns no relevant results, say so honestly rather
than guessing
- When a user asks about a topic covered across multiple documents,
synthesize the information and cite all relevant sources
- Use read_file to view a full document when the search snippet is not
enough context
model:
provider: openai
name: gpt-4o-mini
temperature: 0.1
ingest:
sources:
- ./docs/**/*.md
chunking:
strategy: paragraph
chunk_size: 512
chunk_overlap: 50
embeddings:
provider: openai
model: text-embedding-3-small
tools:
- type: filesystem
root_path: ./docs
read_only: true
allowed_extensions:
- .md
guardrails:
max_tokens_per_run: 30000
max_tool_calls: 15
timeout_seconds: 120Why
paragraphchunking? It splits on double newlines first, then merges small paragraphs untilchunk_sizeis reached. This preserves natural document structure — a paragraph about "installation" stays together instead of being split mid-sentence. Usefixedfor code files and logs where structure doesn't matter.
3. Ingest the documents
initrunner ingest rag-agent.yamlResolving sources...
./docs/**/*.md → 4 files
Extracting text...
docs/getting-started.md (2,847 chars)
docs/faq.md (3,214 chars)
docs/api-reference.md (5,102 chars)
docs/changelog.md (1,456 chars)
Chunking (paragraph, size=512, overlap=50)...
→ 28 chunks
Embedding with openai:text-embedding-3-small...
→ 28 embeddings
Stored in ~/.initrunner/stores/rag-agent.lance4. Query the agent
initrunner run rag-agent.yaml -p "How do I create a database?"The agent calls search_documents("create database"), gets matching chunks with source file names and similarity scores, then answers with citations.
5. Re-index when docs change
Since v2026.4.10, re-indexing happens automatically on the next initrunner run when any source file has been added, modified, or removed. The check uses an mtime fast-path, so it's cheap enough to run every time.
# Auto-reindex kicks in on the next run
initrunner run rag-agent.yaml -p "What changed?"
# Manual rebuild — still useful for refreshing URL sources or forcing
# a rebuild when timestamps were preserved (e.g. after `cp -p`)
initrunner ingest rag-agent.yaml
initrunner ingest rag-agent.yaml --forceTo opt out of automatic re-indexing, set ingest.auto: false in the role YAML.
See the Examples page for the complete RAG agent with sample docs.
Pipeline
- Resolve sources — Glob patterns are expanded into file paths relative to the role file's directory.
- Extract text — Each file is passed through a format-specific extractor.
- Chunk text — Extracted text is split into overlapping chunks.
- Embed — Chunks are converted to vector embeddings.
- Store — Embeddings and text are stored in LanceDB.
Configuration
| Field | Type | Default | Description |
|---|---|---|---|
sources | list[str] | (required) | Glob patterns for source files |
auto | bool | true | Auto-reindex on every initrunner run when sources have changed (since v2026.4.10). Set to false for manual-only. |
watch | bool | false | Reserved for future use |
chunking.strategy | str | "fixed" | "fixed" or "paragraph" |
chunking.chunk_size | int | 512 | Maximum chunk size in characters |
chunking.chunk_overlap | int | 50 | Overlapping characters between chunks |
embeddings.provider | str | "" | Embedding provider (empty = derives from model) |
embeddings.model | str | "" | Embedding model (empty = provider default) |
embeddings.api_key_env | str | "" | Env var name holding the embedding API key. When empty, the default for the resolved provider is used (OPENAI_API_KEY for OpenAI/Anthropic, GOOGLE_API_KEY for Google). |
store_backend | str | "lancedb" | Vector store backend |
store_path | str | null | null | Custom path (default: ~/.initrunner/stores/<agent-name>.lance) |
Chunking Strategies
Fixed (strategy: fixed)
Splits text into fixed-size character windows with overlap. Best for uniform document types, code files, and logs.
Paragraph (strategy: paragraph)
Splits on double newlines first, then merges small paragraphs until chunk_size is reached. Preserves natural document structure. Best for prose, markdown, and documentation.
Choosing a Strategy and Parameters
- Use
paragraphfor prose, markdown, and documentation — it preserves natural boundaries so a paragraph about "installation" stays together. - Use
fixedfor code files, logs, and machine-generated text where structure doesn't carry semantic meaning.
chunk_size rules of thumb:
| Use Case | Recommended chunk_size |
|---|---|
| Short-answer Q&A | 256–512 |
| Dense technical content, long-form docs | 512–1024 |
chunk_overlap should be roughly 10% of chunk_size (e.g. 50 for a 512 chunk). Overlap ensures that information spanning a boundary is present in at least one chunk.
Recommendations by Document Type
| Document type | Strategy | chunk_size | chunk_overlap | Notes |
|---|---|---|---|---|
| Markdown / articles | paragraph | 512 | 50 | Preserves natural paragraph boundaries |
| Code files | fixed | 1024 | 100 | Larger windows keep function context together |
| API references | paragraph | 256 | 50 | Short, dense entries benefit from smaller chunks |
| CSV / tabular data | fixed | 1024 | 0 | No overlap — rows must not be split across chunks |
| PDFs | fixed | 512–1024 | 50–100 | PDF layout varies; fixed chunking is more predictable |
Supported File Formats
Core Formats (always available)
| Extension | Extractor |
|---|---|
.txt | Plain text (UTF-8) |
.md | Plain text (UTF-8) |
.rst | Plain text (UTF-8) |
.csv | CSV rows joined with commas and newlines |
.json | Pretty-printed JSON |
.html, .htm | HTML to Markdown (scripts/styles removed) |
Optional Formats (pip install initrunner[ingest])
| Extension | Extractor | Library |
|---|---|---|
.pdf | PDF to Markdown | pymupdf4llm |
.docx | Paragraphs joined with double newlines | python-docx |
.xlsx | Sheets as CSV with title headers | openpyxl |
The search_documents Tool
When spec.ingest is configured, a search tool is auto-registered:
search_documents(query: str, top_k: int = 5, source: str | None = None) -> str| Parameter | Type | Default | Description |
|---|---|---|---|
query | str | (required) | Natural-language search string (embedded and compared against stored chunks) |
top_k | int | 5 | Number of results to return |
source | str | None | None | Glob pattern to filter results by source file path |
The tool creates an embedding from the query, searches the vector store for the most similar chunks, and returns results with source attribution and similarity scores.
Result format:
[1] (score: 0.87) ./docs/getting-started.md
To create a new project, run `initrunner init`...
[2] (score: 0.82) ./docs/faq.md
InitRunner supports multiple model providers...Source filtering example:
# Search only billing docs
search_documents("refund policy", source="*billing*")
# Search a specific file
search_documents("authentication", source="*/api-reference.md")If no documents have been ingested, the tool returns a message directing you to run initrunner ingest.
Re-indexing
Since v2026.4.10, initrunner run checks source files for changes on every invocation and re-indexes automatically when anything has been added, modified, or removed. The check uses an mtime fast-path, so it's cheap. URLs already in the store are not re-fetched on auto runs, but new URLs added to the YAML are picked up.
To opt out, set ingest.auto: false in the role YAML. To force a full rebuild (for example, after a timestamp-preserving copy like cp -p, or when you want to refresh URL contents), run the manual command:
initrunner ingest role.yaml # manual re-ingest, refreshes URL contents
initrunner ingest role.yaml --force # authoritative rebuildRunning initrunner ingest is safe and idempotent:
- Resolves glob patterns to find current files.
- Deletes all existing chunks from each source file.
- Inserts new chunks from fresh extraction.
Files that no longer match the patterns have their chunks purged.
Embedding Models
Provider resolution priority:
ingest.embeddings.model— if set, used directlyingest.embeddings.provider— used to look up the defaultspec.model.provider— falls back to agent's model provider
| Provider | Default Embedding Model |
|---|---|
openai | openai:text-embedding-3-small |
anthropic | openai:text-embedding-3-small |
google | google:text-embedding-004 |
ollama | ollama:nomic-embed-text |
Anthropic has no embeddings API. Agents using
provider: anthropicfall back toopenai:text-embedding-3-smallby default (requiresOPENAI_API_KEY). To avoid the OpenAI dependency, setembeddings.provider: googleorembeddings.provider: ollama.
Scaffold
initrunner init --name kb-agent --template ragTroubleshooting
No results from search_documents
- Documents not ingested — Run
initrunner ingest role.yamlbefore querying. The tool returns a message if the store is empty. - Query too specific — Try broader or rephrased queries. Embedding search is semantic, not keyword-exact.
- Wrong embedding model — If you changed the embedding model after ingesting, re-ingest so all vectors use the same model.
EmbeddingModelChangedError
Raised when the configured embedding model differs from the one used to create the existing store. Vectors from different models are incompatible. Fix by re-ingesting:
initrunner ingest role.yaml --forceSince v2026.4.10, this error also surfaces on automatic runs — swapping embeddings.model with otherwise-unchanged sources now triggers the same error and --force hint on the next initrunner run, not just on manual ingest.
DimensionMismatchError
The vector dimensions in the store don't match the current model's output dimensions. This usually happens when switching between embedding providers. Re-ingest with --force to rebuild the store.
Optional format extraction errors
If .pdf, .docx, or .xlsx files fail to extract, install the optional dependencies:
pip install "initrunner[ingest]"This installs pymupdf4llm, python-docx, and openpyxl.
API key not set
Embedding keys are validated at startup. If the required key is missing you will see a clear error message identifying which variable to set.
| Provider | Required env var | Notes |
|---|---|---|
openai | OPENAI_API_KEY | |
anthropic | OPENAI_API_KEY | Anthropic has no native embeddings — falls back to OpenAI by default; set embeddings.provider to switch |
google | GOOGLE_API_KEY | |
ollama | (none) | Runs locally |
Override the variable name — if your key is stored under a non-default name, set embeddings.api_key_env in your ingest or memory config:
spec:
ingest:
embeddings:
provider: openai
api_key_env: MY_EMBED_KEY # read from MY_EMBED_KEY instead of OPENAI_API_KEYDiagnose key issues with:
initrunner doctorThe Embedding Providers table shows which keys are set and which are missing.