Context Extraction

OpenViking uses a three-layer async architecture for document parsing and context extraction.

Overview

Input File → Parser → TreeBuilder → SemanticQueue → Vector Index
              ↓           ↓              ↓
          Parse &     Move Files     L0/L1 Generation
          Convert     Queue Semantic  (LLM Async)
          (No LLM)

Design Principle: Parsing and semantics are separated. Parser doesn't call LLM; semantic generation is async.

Parser

Parser handles document format conversion and structuring, creating file structure in temp directory.

Supported Formats

Format	Parser	Extensions	Status
Markdown	MarkdownParser	.md, .markdown	Supported
Plain text	TextParser	.txt	Supported
PDF	PDFParser	.pdf	Supported
HTML	HTMLParser	.html, .htm	Supported
Code	CodeRepositoryParser	.py, .js, .go, etc.
Image	ImageParser	.png, .jpg, etc.
Video	VideoParser	.mp4, .avi, etc.
Audio	AudioParser	.mp3, .wav, etc.

Core Flow (Document Example)

# 1. Parse file
parse_result = registry.parse("/path/to/doc.md")

# 2. Returns temp directory URI
parse_result.temp_dir_path  # viking://temp/abc123

Smart Splitting

If document_tokens <= 1024:
    → Save as single file
Else:
    → Split by headers
    → Section < 512 tokens → Merge
    → Section > 1024 tokens → Create subdirectory

Return Result

ParseResult(
    temp_dir_path: str,    # Temp directory URI
    source_format: str,    # pdf/markdown/html
    parser_name: str,      # Parser name
    parse_time: float,     # Duration (seconds)
    meta: Dict,            # Metadata
)

TreeBuilder

TreeBuilder moves temp directory to AGFS and queues semantic processing.

Core Flow

building_tree = tree_builder.finalize_from_temp(
    temp_dir_path="viking://temp/abc123",
    scope="resources",  # resources/user/agent
)

5-Phase Processing

Find document root: Ensure exactly 1 subdirectory in temp
Determine target URI: Map base URI by scope
Recursively move directory tree: Copy all files to AGFS
Clean up temp directory: Delete temp files
Queue semantic generation: Submit SemanticMsg to queue

URI Mapping

scope	Base URI
resources	`viking://resources`
user	`viking://user`
agent	`viking://agent`

SemanticQueue

SemanticQueue handles async L0/L1 generation and vectorization.

Message Structure

SemanticMsg(
    id: str,           # UUID
    uri: str,          # Directory URI
    context_type: str, # resource/memory/skill
    status: str,       # pending/processing/completed
)

Processing Flow (Bottom-up)

Leaf directories → Parent directories → Root

Single Directory Processing Steps

Concurrent file summary generation: Limited to 10 concurrent
Collect child directory abstracts: Read generated .abstract.md
Generate .overview.md: LLM generates L1 overview
Extract .abstract.md: Extract L0 from overview
Write files: Save to AGFS
Vectorize: Create Context and queue to EmbeddingQueue

Configuration Parameters

Parameter	Default	Description
`max_concurrent_llm`	10	Concurrent LLM calls
`max_images_per_call`	10	Max images per VLM call
`max_sections_per_call`	20	Max sections per VLM call

Code Skeleton Extraction (AST Mode)

For code files, OpenViking supports AST-based skeleton extraction via tree-sitter as a lightweight alternative to LLM summarization, significantly reducing processing cost.

Modes

Controlled by code_summary_mode in ov.conf (see Configuration):

Mode	Description
`"ast"`	Extract structural skeleton for files ≥100 lines, skip LLM calls (default)
`"llm"`	Always use LLM for summarization (original behavior)
`"ast_llm"`	Extract AST skeleton first, then pass it as context to LLM for summarization

What AST Extracts

The skeleton includes:

Module-level docstring (first line)
Import statement list
Class names, base classes, and method signatures (ast mode: first-line docstrings only; ast_llm mode: full docstrings)
Top-level function signatures

Supported Languages

The following languages have dedicated extractors built on tree-sitter:

Language	Status
Python	Supported
JavaScript / TypeScript	Supported
Rust	Supported
Go	Supported
Java	Supported
C / C++	Supported

Other languages automatically fall back to LLM.

Fallback Behavior

The following conditions trigger automatic fallback to LLM, with the reason logged. The overall pipeline is unaffected:

Language not in the supported list
File has fewer than 100 lines
AST parse error
Extraction produces an empty skeleton

File Structure

openviking/parse/parsers/code/ast/
├── extractor.py      # Language detection and dispatch
├── skeleton.py       # CodeSkeleton / FunctionSig / ClassSkeleton data structures
└── languages/        # Per-language extractors

Three Context Types Extraction

Flow Comparison

Phase	Resource	Memory	Skill
Parser	Common flow	Common flow	Common flow
Base URI	`viking://resources`	`viking://user/memories`	`viking://agent/skills`
TreeBuilder scope	resources	user/agent	agent
SemanticMsg type	resource	memory	skill

Resource Extraction

# Add resource
await client.add_resource(
    "/path/to/doc.pdf",
    reason="API documentation"
)

# Flow: Parser → TreeBuilder(scope=resources) → SemanticQueue

Skill Extraction

# Add skill
await client.add_skill({
    "name": "search-web",
    "content": "# search-web\\n..."
})

# Flow: Direct write to viking://agent/skills/{name}/ → SemanticQueue

Memory Extraction

# Memory auto-extracted from session
await session.commit()

# Flow: MemoryExtractor → TreeBuilder(scope=user) → SemanticQueue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context Extraction

Overview

Parser

Supported Formats

Core Flow (Document Example)

Smart Splitting

Return Result

TreeBuilder

Core Flow

5-Phase Processing

URI Mapping

SemanticQueue

Message Structure

Processing Flow (Bottom-up)

Single Directory Processing Steps

Configuration Parameters

Code Skeleton Extraction (AST Mode)

Modes

What AST Extracts

Supported Languages

Fallback Behavior

File Structure

Three Context Types Extraction

Flow Comparison

Resource Extraction

Skill Extraction

Memory Extraction

Related Documents

FilesExpand file tree

06-extraction.md

Latest commit

History

06-extraction.md

File metadata and controls

Context Extraction

Overview

Parser

Supported Formats

Core Flow (Document Example)

Smart Splitting

Return Result

TreeBuilder

Core Flow

5-Phase Processing

URI Mapping

SemanticQueue

Message Structure

Processing Flow (Bottom-up)

Single Directory Processing Steps

Configuration Parameters

Code Skeleton Extraction (AST Mode)

Modes

What AST Extracts

Supported Languages

Fallback Behavior

File Structure

Three Context Types Extraction

Flow Comparison

Resource Extraction

Skill Extraction

Memory Extraction

Related Documents