Skip to content

refactor: make office docs optional and reduce core dependency footprint#255

Merged
mlikasam-askui merged 4 commits intomainfrom
refactor/remove-unnecessary-dependencies
Apr 3, 2026
Merged

refactor: make office docs optional and reduce core dependency footprint#255
mlikasam-askui merged 4 commits intomainfrom
refactor/remove-unnecessary-dependencies

Conversation

@mlikasam-askui
Copy link
Copy Markdown
Contributor

Summary

This PR reduces the default installation footprint and clarifies optional dependency usage.

Dependency changes

  • Removed markitdown from core dependencies and introduced optional extra: office_document
  • Updated all extra to include office-document (and no longer depend on the removed android extra)
  • Moved pure-python-adb into default dependencies
  • Removed bson from core dependencies

Runtime/code changes

  • generate_time_ordered_id() no longer uses bson.ObjectId; now builds IDs from time.time_ns() + UUID suffix
  • convert_to_markdown() now imports markitdown lazily and raises a clear install hint:
    • pip install "askui[office-document]"

Documentation/config updates

  • README now promotes minimal install (pip install askui) and explains optional extras
  • docs/10_extracting_data.md explicitly notes Excel/Word (OfficeDocumentSource) requires office-document
  • docs/01_setup.md updated Python requirement text
  • pyproject.toml/pdm.lock synchronized with new extras + deps
  • Removed stale mypy ignore section for bson

Why

This keeps the base package lighter, avoids forcing Office-conversion dependencies on all users, and makes Office document support explicit and discoverable.

- Move MarkItDown to new `office_document` extra and lazy-load in markdown conversion
- Remove bson usage; generate time-ordered IDs via `time_ns` + UUID fragment
- Promote `pure-python-adb` to default deps; replace `android` extra in `all`
- Relax Python constraint to `>=3.10` and align setup/readme docs
- Remove obsolete mypy ignore for `bson`
str: Time-ordered ID string
"""

return f"{prefix}_{str(bson.ObjectId())}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know what bson was doing and what effects removing it has. Out of curiosity: can you maybe explain?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the reason why the SDK was not compatible with Python 3.14 and later, but now the imagehash library is the new issue causing incompatibility.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: bson.objectid ist used for mongodb

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which we don't use anymore

str: Time-ordered ID string
"""

return f"{prefix}_{str(bson.ObjectId())}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: bson.objectid ist used for mongodb

@mlikasam-askui mlikasam-askui merged commit 8204c91 into main Apr 3, 2026
1 check passed
@mlikasam-askui mlikasam-askui deleted the refactor/remove-unnecessary-dependencies branch April 3, 2026 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants