Skip to content

feat(pdf): extract bookmarks as markdown headings for hierarchical parsing#403

Merged
MaojiaSheng merged 3 commits intovolcengine:mainfrom
r266-tech:feat/pdf-bookmark-headings
Mar 4, 2026
Merged

feat(pdf): extract bookmarks as markdown headings for hierarchical parsing#403
MaojiaSheng merged 3 commits intovolcengine:mainfrom
r266-tech:feat/pdf-bookmark-headings

Conversation

@r266-tech
Copy link
Contributor

Summary

Implements Phase 1 of #393 — PDF bookmark/outline extraction and injection as markdown headings.

Problem

When parsing large structured PDFs (textbooks, pharmacopoeias, standards documents), the local pdfplumber strategy only outputs <!-- Page N --> HTML comment markers. Since MarkdownParser._find_headings() ignores HTML comments, it finds zero headings and falls back to paragraph-based splitting. This produces hundreds of flat numbered files (name_1.md through name_538.md) with no semantic organization.

Solution

  • Added _extract_bookmarks() method that extracts PDF outlines via pdfminer's doc.get_outlines() (accessed through pdfplumber)
  • Bookmark destinations are mapped to page numbers by resolving page object IDs
  • Bookmarks are injected as markdown headings (#, ##, etc.) at the correct page positions before each page's text content
  • MarkdownParser's existing heading-based splitting then naturally builds a hierarchical directory structure

Before

data/viking/resources/pharmacopoeia/
├── pharmacopoeia_1.md      # Page 2-3 raw text
├── pharmacopoeia_2.md      # Page 4 raw text  
├── ...
└── pharmacopoeia_538.md    # Last chunk

After (for PDFs with bookmarks)

data/viking/resources/pharmacopoeia/
├── 第一部_药材/
│   ├── 川木通.md
│   └── ...
├── 第二部_化学药/
│   └── ...
└── ...

Details

  • Heading levels capped at 1–6 for markdown compatibility
  • Empty/whitespace titles are skipped
  • Unresolvable page destinations gracefully produce page_num=None (not injected at wrong position)
  • New bookmarks_extracted metadata field added
  • Fully backward-compatible: PDFs without bookmarks behave identically to before

Tests

7 unit tests added in tests/parse/test_pdf_bookmark_extraction.py:

  1. Normal extraction with page mapping
  2. No outlines → empty list
  3. Missing get_outlines method → empty list
  4. Empty/whitespace titles skipped
  5. Level capping at 6
  6. Unresolved page destinations
  7. Exception handling (corrupt PDF)

Future Work (from #393)

  • Phase 2: Font-size heading detection (fallback for PDFs without bookmarks)
  • Phase 3: Directory auto-grouping (MAX_CHILDREN_PER_DIR threshold)

Ref: #393

…rsing

Addresses Phase 1 of #393 — PDF bookmark/outline extraction.

**Problem:**
When parsing large structured PDFs (textbooks, pharmacopoeias, standards),
the local pdfplumber strategy only outputs `<!-- Page N -->` HTML comment
markers. Since MarkdownParser._find_headings() ignores HTML comments, it
finds zero headings and falls back to paragraph-based splitting, producing
hundreds of flat numbered files with no semantic organization.

**Solution:**
- Add `_extract_bookmarks()` method that extracts PDF outlines via
  pdfminer's `doc.get_outlines()` (accessed through pdfplumber)
- Map bookmark destinations to page numbers by resolving page object IDs
- Inject bookmarks as markdown headings (`#`, `##`, etc.) at the correct
  page positions before the page's text content
- This allows MarkdownParser's existing heading-based splitting to build a
  hierarchical directory structure naturally

**Details:**
- Heading levels capped at 1-6 for markdown compatibility
- Empty/whitespace-only titles are skipped
- Unresolvable page destinations gracefully result in page_num=None
  (bookmark is logged but not injected at a wrong position)
- New `bookmarks_extracted` metadata field tracks extraction count
- Fully backward-compatible: PDFs without bookmarks behave identically

**Tests:**
- 7 unit tests covering: normal extraction, no outlines, missing method,
  empty titles, level capping, unresolved pages, and exception handling

Ref: #393
@MaojiaSheng MaojiaSheng merged commit 59d8bae into volcengine:main Mar 4, 2026
5 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants