You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: improve markitdown parser consistency and add real-file tests
Critical fixes:
- WordParser: preserve table position in document order (was appending
all tables at end, losing context). Walk document body XML in order
instead of iterating paragraphs then tables separately.
- PowerPointParser: replace magic number (type == 1) with proper
PP_PLACEHOLDER enum constants, also handle CENTER_TITLE.
- AudioParser: add Vorbis/FLAC/OGG tag extraction (previously only
handled ID3 and MP4 formats). Tries all format mappings with dedup.
- ZipParser: replace emoji in tree view with plain text markers
for robustness in text processing pipelines.
- TextParser: set parser_name='TextParser' on parse_content results
for consistency with all other parsers.
- __init__.py: export all new parser classes for public API.
Tests (16 new, 39 total):
- Real .docx/.xlsx/.pptx file creation and parsing
- EPub HTML-to-markdown conversion edge cases
- ZIP bad-file error handling and no-emoji tree view
- AudioParser Vorbis tag extraction and edge cases
- WordParser can_parse() extension matching
0 commit comments