OpenViking uses a three-layer async architecture for document parsing and context extraction.
Input File → Parser → TreeBuilder → SemanticQueue → Vector Index
↓ ↓ ↓
Parse & Move Files L0/L1 Generation
Convert Queue Semantic (LLM Async)
(No LLM)
Design Principle: Parsing and semantics are separated. Parser doesn't call LLM; semantic generation is async.
Parser handles document format conversion and structuring, creating file structure in temp directory.
| Format | Parser | Extensions | Status |
|---|---|---|---|
| Markdown | MarkdownParser | .md, .markdown | Supported |
| Plain text | TextParser | .txt | Supported |
| PDFParser | Supported | ||
| HTML | HTMLParser | .html, .htm | Supported |
| Code | CodeRepositoryParser | .py, .js, .go, etc. | |
| Image | ImageParser | .png, .jpg, etc. | |
| Video | VideoParser | .mp4, .avi, etc. | |
| Audio | AudioParser | .mp3, .wav, etc. |
# 1. Parse file
parse_result = registry.parse("/path/to/doc.md")
# 2. Returns temp directory URI
parse_result.temp_dir_path # viking://temp/abc123If document_tokens <= 1024:
→ Save as single file
Else:
→ Split by headers
→ Section < 512 tokens → Merge
→ Section > 1024 tokens → Create subdirectory
ParseResult(
temp_dir_path: str, # Temp directory URI
source_format: str, # pdf/markdown/html
parser_name: str, # Parser name
parse_time: float, # Duration (seconds)
meta: Dict, # Metadata
)TreeBuilder moves temp directory to AGFS and queues semantic processing.
building_tree = tree_builder.finalize_from_temp(
temp_dir_path="viking://temp/abc123",
scope="resources", # resources/user/agent
)- Find document root: Ensure exactly 1 subdirectory in temp
- Determine target URI: Map base URI by scope
- Recursively move directory tree: Copy all files to AGFS
- Clean up temp directory: Delete temp files
- Queue semantic generation: Submit SemanticMsg to queue
| scope | Base URI |
|---|---|
| resources | viking://resources |
| user | viking://user |
| agent | viking://agent |
SemanticQueue handles async L0/L1 generation and vectorization.
SemanticMsg(
id: str, # UUID
uri: str, # Directory URI
context_type: str, # resource/memory/skill
status: str, # pending/processing/completed
)Leaf directories → Parent directories → Root
- Concurrent file summary generation: Limited to 10 concurrent
- Collect child directory abstracts: Read generated .abstract.md
- Generate .overview.md: LLM generates L1 overview
- Extract .abstract.md: Extract L0 from overview
- Write files: Save to AGFS
- Vectorize: Create Context and queue to EmbeddingQueue
| Parameter | Default | Description |
|---|---|---|
max_concurrent_llm |
10 | Concurrent LLM calls |
max_images_per_call |
10 | Max images per VLM call |
max_sections_per_call |
20 | Max sections per VLM call |
For code files, OpenViking supports AST-based skeleton extraction via tree-sitter as a lightweight alternative to LLM summarization, significantly reducing processing cost.
Controlled by code_summary_mode in ov.conf (see Configuration):
| Mode | Description |
|---|---|
"ast" |
Extract structural skeleton for files ≥100 lines, skip LLM calls (default) |
"llm" |
Always use LLM for summarization (original behavior) |
"ast_llm" |
Extract AST skeleton first, then pass it as context to LLM for summarization |
The skeleton includes:
- Module-level docstring (first line)
- Import statement list
- Class names, base classes, and method signatures (
astmode: first-line docstrings only;ast_llmmode: full docstrings) - Top-level function signatures
The following languages have dedicated extractors built on tree-sitter:
| Language | Status |
|---|---|
| Python | Supported |
| JavaScript / TypeScript | Supported |
| Rust | Supported |
| Go | Supported |
| Java | Supported |
| C / C++ | Supported |
Other languages automatically fall back to LLM.
The following conditions trigger automatic fallback to LLM, with the reason logged. The overall pipeline is unaffected:
- Language not in the supported list
- File has fewer than 100 lines
- AST parse error
- Extraction produces an empty skeleton
openviking/parse/parsers/code/ast/
├── extractor.py # Language detection and dispatch
├── skeleton.py # CodeSkeleton / FunctionSig / ClassSkeleton data structures
└── languages/ # Per-language extractors
| Phase | Resource | Memory | Skill |
|---|---|---|---|
| Parser | Common flow | Common flow | Common flow |
| Base URI | viking://resources |
viking://user/memories |
viking://agent/skills |
| TreeBuilder scope | resources | user/agent | agent |
| SemanticMsg type | resource | memory | skill |
# Add resource
await client.add_resource(
"/path/to/doc.pdf",
reason="API documentation"
)
# Flow: Parser → TreeBuilder(scope=resources) → SemanticQueue# Add skill
await client.add_skill({
"name": "search-web",
"content": "# search-web\\n..."
})
# Flow: Direct write to viking://agent/skills/{name}/ → SemanticQueue# Memory auto-extracted from session
await session.commit()
# Flow: MemoryExtractor → TreeBuilder(scope=user) → SemanticQueue- Architecture Overview - System architecture
- Context Layers - L0/L1/L2 model
- Storage Architecture - AGFS and vector index
- Session Management - Memory extraction details