Ingestion

InitRunner's ingestion pipeline extracts text from source files, splits it into chunks, generates embeddings, and stores vectors in a local LanceDB database. Once ingested, an agent can search documents at runtime via the auto-registered search_documents tool.

Quick Start

apiVersion: initrunner/v1
kind: Agent
metadata:
  name: kb-agent
  description: Knowledge base agent
spec:
  role: |
    You are a knowledge assistant. Use search_documents to find relevant
    content before answering. Always cite your sources.
  model:
    provider: openai
    name: gpt-4o-mini
  ingest:
    sources:
      - "./docs/**/*.md"
      - "./knowledge-base/**/*.txt"
    chunking:
      strategy: fixed
      chunk_size: 512
      chunk_overlap: 50

# Ingest documents
initrunner ingest role.yaml

# Run the agent (search_documents is auto-registered)
initrunner run role.yaml -p "What does the onboarding guide say?"

Walkthrough: Build a Knowledge Base Agent

This walkthrough builds a complete RAG agent from scratch — set up docs, configure the agent, ingest, and query.

1. Set up your docs directory

mkdir -p docs
# Add your markdown files to ./docs/

2. Create the agent

apiVersion: initrunner/v1
kind: Agent
metadata:
  name: rag-agent
  description: Knowledge base Q&A agent with document ingestion
spec:
  role: |
    You are a helpful documentation assistant. You answer user questions
    using the ingested knowledge base.

    Rules:
    - ALWAYS call search_documents before answering a question
    - Base your answers only on information found in the documents
    - Cite the source document for each claim (e.g., "Per the Getting Started
      guide, ...")
    - If search_documents returns no relevant results, say so honestly rather
      than guessing
    - When a user asks about a topic covered across multiple documents,
      synthesize the information and cite all relevant sources
    - Use read_file to view a full document when the search snippet is not
      enough context
  model:
    provider: openai
    name: gpt-4o-mini
    temperature: 0.1
  ingest:
    sources:
      - ./docs/**/*.md
    chunking:
      strategy: paragraph
      chunk_size: 512
      chunk_overlap: 50
    embeddings:
      provider: openai
      model: text-embedding-3-small
  tools:
    - type: filesystem
      root_path: ./docs
      read_only: true
      allowed_extensions:
        - .md
  guardrails:
    max_tokens_per_run: 30000
    max_tool_calls: 15
    timeout_seconds: 120

Why paragraph chunking? It splits on double newlines first, then merges small paragraphs until chunk_size is reached. This preserves natural document structure — a paragraph about "installation" stays together instead of being split mid-sentence. Use fixed for code files and logs where structure doesn't matter.

3. Ingest the documents

initrunner ingest rag-agent.yaml

Resolving sources...
  ./docs/**/*.md → 4 files
Extracting text...
  docs/getting-started.md (2,847 chars)
  docs/faq.md (3,214 chars)
  docs/api-reference.md (5,102 chars)
  docs/changelog.md (1,456 chars)
Chunking (paragraph, size=512, overlap=50)...
  → 28 chunks
Embedding with openai:text-embedding-3-small...
  → 28 embeddings
Stored in ~/.initrunner/stores/rag-agent.lance

4. Query the agent

initrunner run rag-agent.yaml -p "How do I create a database?"

The agent calls search_documents("create database"), gets matching chunks with source file names and similarity scores, then answers with citations.

5. Re-index when docs change

Since v2026.4.10, re-indexing happens automatically on the next initrunner run when any source file has been added, modified, or removed. The check uses an mtime fast-path, so it's cheap enough to run every time.

# Auto-reindex kicks in on the next run
initrunner run rag-agent.yaml -p "What changed?"

# Manual rebuild — still useful for refreshing URL sources or forcing
# a rebuild when timestamps were preserved (e.g. after `cp -p`)
initrunner ingest rag-agent.yaml
initrunner ingest rag-agent.yaml --force

To opt out of automatic re-indexing, set ingest.auto: false in the role YAML.

See the Examples page for the complete RAG agent with sample docs.

Pipeline

Resolve sources — Glob patterns are expanded into file paths relative to the role file's directory.
Extract text — Each file is passed through a format-specific extractor.
Chunk text — Extracted text is split into overlapping chunks.
Embed — Chunks are converted to vector embeddings.
Store — Embeddings and text are stored in LanceDB.

Configuration

Field	Type	Default	Description
`sources`	`list[str]`	(required)	Glob patterns for source files
`auto`	`bool`	`true`	Auto-reindex on every `initrunner run` when sources have changed (since v2026.4.10). Set to `false` for manual-only.
`watch`	`bool`	`false`	Reserved for future use
`chunking.strategy`	`str`	`"fixed"`	`"fixed"` or `"paragraph"`
`chunking.chunk_size`	`int`	`512`	Maximum chunk size in characters
`chunking.chunk_overlap`	`int`	`50`	Overlapping characters between chunks
`embeddings.provider`	`str`	`""`	Embedding provider (empty = derives from model)
`embeddings.model`	`str`	`""`	Embedding model (empty = provider default)
`embeddings.api_key_env`	`str`	`""`	Env var name holding the embedding API key. When empty, the default for the resolved provider is used (`OPENAI_API_KEY` for OpenAI/Anthropic, `GOOGLE_API_KEY` for Google).
`store_backend`	`str`	`"lancedb"`	Vector store backend
`store_path`	`str \| null`	`null`	Custom path (default: `~/.initrunner/stores/<agent-name>.lance`)

Chunking Strategies

Fixed (`strategy: fixed`)

Splits text into fixed-size character windows with overlap. Best for uniform document types, code files, and logs.

Paragraph (`strategy: paragraph`)

Splits on double newlines first, then merges small paragraphs until chunk_size is reached. Preserves natural document structure. Best for prose, markdown, and documentation.

Choosing a Strategy and Parameters

Use paragraph for prose, markdown, and documentation — it preserves natural boundaries so a paragraph about "installation" stays together.
Use fixed for code files, logs, and machine-generated text where structure doesn't carry semantic meaning.

chunk_size rules of thumb:

Use Case	Recommended `chunk_size`
Short-answer Q&A	256–512
Dense technical content, long-form docs	512–1024

chunk_overlap should be roughly 10% of chunk_size (e.g. 50 for a 512 chunk). Overlap ensures that information spanning a boundary is present in at least one chunk.

Recommendations by Document Type

Document type	Strategy	`chunk_size`	`chunk_overlap`	Notes
Markdown / articles	`paragraph`	512	50	Preserves natural paragraph boundaries
Code files	`fixed`	1024	100	Larger windows keep function context together
API references	`paragraph`	256	50	Short, dense entries benefit from smaller chunks
CSV / tabular data	`fixed`	1024	0	No overlap — rows must not be split across chunks
PDFs	`fixed`	512–1024	50–100	PDF layout varies; fixed chunking is more predictable

Supported File Formats

Core Formats (always available)

Extension	Extractor
`.txt`	Plain text (UTF-8)
`.md`	Plain text (UTF-8)
`.rst`	Plain text (UTF-8)
`.csv`	CSV rows joined with commas and newlines
`.json`	Pretty-printed JSON
`.html`, `.htm`	HTML to Markdown (scripts/styles removed)

Optional Formats (`pip install initrunner[ingest]`)

Extension	Extractor	Library
`.pdf`	PDF to Markdown	`pymupdf4llm`
`.docx`	Paragraphs joined with double newlines	`python-docx`
`.xlsx`	Sheets as CSV with title headers	`openpyxl`

The `search_documents` Tool

When spec.ingest is configured, a search tool is auto-registered:

search_documents(query: str, top_k: int = 5, source: str | None = None) -> str

Parameter	Type	Default	Description
`query`	`str`	(required)	Natural-language search string (embedded and compared against stored chunks)
`top_k`	`int`	`5`	Number of results to return
`source`	`str \| None`	`None`	Glob pattern to filter results by source file path

The tool creates an embedding from the query, searches the vector store for the most similar chunks, and returns results with source attribution and similarity scores.

Result format:

[1] (score: 0.87) ./docs/getting-started.md
  To create a new project, run `initrunner init`...

[2] (score: 0.82) ./docs/faq.md
  InitRunner supports multiple model providers...

Source filtering example:

# Search only billing docs
search_documents("refund policy", source="*billing*")

# Search a specific file
search_documents("authentication", source="*/api-reference.md")

If no documents have been ingested, the tool returns a message directing you to run initrunner ingest.

Re-indexing

Since v2026.4.10, initrunner run checks source files for changes on every invocation and re-indexes automatically when anything has been added, modified, or removed. The check uses an mtime fast-path, so it's cheap. URLs already in the store are not re-fetched on auto runs, but new URLs added to the YAML are picked up.

To opt out, set ingest.auto: false in the role YAML. To force a full rebuild (for example, after a timestamp-preserving copy like cp -p, or when you want to refresh URL contents), run the manual command:

initrunner ingest role.yaml          # manual re-ingest, refreshes URL contents
initrunner ingest role.yaml --force  # authoritative rebuild

Running initrunner ingest is safe and idempotent:

Resolves glob patterns to find current files.
Deletes all existing chunks from each source file.
Inserts new chunks from fresh extraction.

Files that no longer match the patterns have their chunks purged.

Embedding Models

Provider resolution priority:

ingest.embeddings.model — if set, used directly
ingest.embeddings.provider — used to look up the default
spec.model.provider — falls back to agent's model provider

Provider	Default Embedding Model
`openai`	`openai:text-embedding-3-small`
`anthropic`	`openai:text-embedding-3-small`
`google`	`google:text-embedding-004`
`ollama`	`ollama:nomic-embed-text`

Anthropic has no embeddings API. Agents using provider: anthropic fall back to openai:text-embedding-3-small by default (requires OPENAI_API_KEY). To avoid the OpenAI dependency, set embeddings.provider: google or embeddings.provider: ollama.

Scaffold

initrunner init --name kb-agent --template rag

Troubleshooting

No results from `search_documents`

Documents not ingested — Run initrunner ingest role.yaml before querying. The tool returns a message if the store is empty.
Query too specific — Try broader or rephrased queries. Embedding search is semantic, not keyword-exact.
Wrong embedding model — If you changed the embedding model after ingesting, re-ingest so all vectors use the same model.

`EmbeddingModelChangedError`

Raised when the configured embedding model differs from the one used to create the existing store. Vectors from different models are incompatible. Fix by re-ingesting:

initrunner ingest role.yaml --force

Since v2026.4.10, this error also surfaces on automatic runs — swapping embeddings.model with otherwise-unchanged sources now triggers the same error and --force hint on the next initrunner run, not just on manual ingest.

`DimensionMismatchError`

The vector dimensions in the store don't match the current model's output dimensions. This usually happens when switching between embedding providers. Re-ingest with --force to rebuild the store.

Optional format extraction errors

If .pdf, .docx, or .xlsx files fail to extract, install the optional dependencies:

pip install "initrunner[ingest]"

This installs pymupdf4llm, python-docx, and openpyxl.

API key not set

Embedding keys are validated at startup. If the required key is missing you will see a clear error message identifying which variable to set.

Provider	Required env var	Notes
`openai`	`OPENAI_API_KEY`
`anthropic`	`OPENAI_API_KEY`	Anthropic has no native embeddings — falls back to OpenAI by default; set `embeddings.provider` to switch
`google`	`GOOGLE_API_KEY`
`ollama`	(none)	Runs locally

Override the variable name — if your key is stored under a non-default name, set embeddings.api_key_env in your ingest or memory config:

spec:
  ingest:
    embeddings:
      provider: openai
      api_key_env: MY_EMBED_KEY   # read from MY_EMBED_KEY instead of OPENAI_API_KEY

Diagnose key issues with:

initrunner doctor

The Embedding Providers table shows which keys are set and which are missing.

On this page