feat(pdf): extract bookmarks as markdown headings for hierarchical parsing by r266-tech · Pull Request #403 · volcengine/OpenViking

r266-tech · 2026-03-03T20:41:42Z

Summary

Implements Phase 1 of #393 — PDF bookmark/outline extraction and injection as markdown headings.

Problem

When parsing large structured PDFs (textbooks, pharmacopoeias, standards documents), the local pdfplumber strategy only outputs  HTML comment markers. Since MarkdownParser._find_headings() ignores HTML comments, it finds zero headings and falls back to paragraph-based splitting. This produces hundreds of flat numbered files (name_1.md through name_538.md) with no semantic organization.

Solution

Added _extract_bookmarks() method that extracts PDF outlines via pdfminer's doc.get_outlines() (accessed through pdfplumber)
Bookmark destinations are mapped to page numbers by resolving page object IDs
Bookmarks are injected as markdown headings (#, ##, etc.) at the correct page positions before each page's text content
MarkdownParser's existing heading-based splitting then naturally builds a hierarchical directory structure

Before

data/viking/resources/pharmacopoeia/
├── pharmacopoeia_1.md      # Page 2-3 raw text
├── pharmacopoeia_2.md      # Page 4 raw text  
├── ...
└── pharmacopoeia_538.md    # Last chunk

After (for PDFs with bookmarks)

data/viking/resources/pharmacopoeia/
├── 第一部_药材/
│   ├── 川木通.md
│   └── ...
├── 第二部_化学药/
│   └── ...
└── ...

Details

Heading levels capped at 1–6 for markdown compatibility
Empty/whitespace titles are skipped
Unresolvable page destinations gracefully produce page_num=None (not injected at wrong position)
New bookmarks_extracted metadata field added
Fully backward-compatible: PDFs without bookmarks behave identically to before

Tests

7 unit tests added in tests/parse/test_pdf_bookmark_extraction.py:

Normal extraction with page mapping
No outlines → empty list
Missing get_outlines method → empty list
Empty/whitespace titles skipped
Level capping at 6
Unresolved page destinations
Exception handling (corrupt PDF)

Future Work (from #393)

Phase 2: Font-size heading detection (fallback for PDFs without bookmarks)
Phase 3: Directory auto-grouping (MAX_CHILDREN_PER_DIR threshold)

Ref: #393

…rsing Addresses Phase 1 of #393 — PDF bookmark/outline extraction. **Problem:** When parsing large structured PDFs (textbooks, pharmacopoeias, standards), the local pdfplumber strategy only outputs `` HTML comment markers. Since MarkdownParser._find_headings() ignores HTML comments, it finds zero headings and falls back to paragraph-based splitting, producing hundreds of flat numbered files with no semantic organization. **Solution:** - Add `_extract_bookmarks()` method that extracts PDF outlines via pdfminer's `doc.get_outlines()` (accessed through pdfplumber) - Map bookmark destinations to page numbers by resolving page object IDs - Inject bookmarks as markdown headings (`#`, `##`, etc.) at the correct page positions before the page's text content - This allows MarkdownParser's existing heading-based splitting to build a hierarchical directory structure naturally **Details:** - Heading levels capped at 1-6 for markdown compatibility - Empty/whitespace-only titles are skipped - Unresolvable page destinations gracefully result in page_num=None (bookmark is logged but not injected at a wrong position) - New `bookmarks_extracted` metadata field tracks extraction count - Fully backward-compatible: PDFs without bookmarks behave identically **Tests:** - 7 unit tests covering: normal extraction, no outlines, missing method, empty titles, level capping, unresolved pages, and exception handling Ref: #393

github-project-automation bot added this to OpenViking project Mar 3, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 3, 2026

r266-tech added 2 commits March 4, 2026 05:02

style: format pdf.py and remove unused variable

3c6d578

style: fix import sorting and remove unused import in tests

2109834

r266-tech force-pushed the feat/pdf-bookmark-headings branch from b91cab2 to 2109834 Compare March 3, 2026 21:38

r266-tech mentioned this pull request Mar 4, 2026

[Feature]: Transaction mechanism for atomic multi-subsystem operations #390

Open

1 task

MaojiaSheng approved these changes Mar 4, 2026

View reviewed changes

MaojiaSheng merged commit 59d8bae into volcengine:main Mar 4, 2026
5 checks passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pdf): extract bookmarks as markdown headings for hierarchical parsing#403

feat(pdf): extract bookmarks as markdown headings for hierarchical parsing#403
MaojiaSheng merged 3 commits intovolcengine:mainfrom
r266-tech:feat/pdf-bookmark-headings

r266-tech commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r266-tech commented Mar 3, 2026

Summary

Problem

Solution

Before

After (for PDFs with bookmarks)

Details

Tests

Future Work (from #393)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants