Skip to content

rahulmisra2010-ctrl/PDF-Manager

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

214 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PDF-Manager

Production-Ready PDF Manager β€” Upload a PDF, extract data with triple OCR engines + AI/RAG, edit fields interactively, and export.


Features

  • πŸ“€ PDF Upload – drag-and-drop or browse; up to 50 MB
  • πŸ” Triple OCR Engine – Tesseract + EasyOCR + PaddleOCR with ensemble confidence scoring
  • πŸ€– AI Field Extraction – NER (spaCy) + rule-based + RAG (LangChain + HuggingFace embeddings)
  • πŸ”₯ Confidence Heatmaps – pixel-wise Green/Yellow/Red visualisation per word
  • πŸ“Š Performance Dashboard – document quality score, regional scores, word confidence breakdown
  • πŸ–ŠοΈ Inline Editing – split layout: PDF viewer on left, editable fields on right
  • ⬇️ Export – JSON or CSV with full metadata and confidence scores
  • πŸ“œ Edit History – all field edits are versioned and audited

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Browser (React)                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  PDFViewer   β”‚  β”‚ FieldsEditor β”‚  β”‚ Heatmap / β”‚  β”‚
β”‚  β”‚  (react-pdf) β”‚  β”‚  (inline     β”‚  β”‚ Dashboard β”‚  β”‚
β”‚  β”‚  zoom/scroll β”‚  β”‚   edit)      β”‚  β”‚           β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚ REST /api/v1
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Flask Backend (Python)                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚               API v1 Blueprint                  β”‚ β”‚
β”‚  β”‚  POST /upload  POST /extract/ocr  POST /extract/aiβ”‚ β”‚
β”‚  β”‚  GET  /fields  PUT  /fields/:id   GET /heatmap  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                     β”‚                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                 OCR Layer                        β”‚ β”‚
β”‚  β”‚  Tesseract ─┐                                   β”‚ β”‚
β”‚  β”‚  EasyOCR   ─┼─ Ensemble Merge β†’ WordResult[]   β”‚ β”‚
β”‚  β”‚  PaddleOCR β”€β”˜                                   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                     β”‚                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚              Extraction Layer                    β”‚ β”‚
β”‚  β”‚  FieldDetector (NER + rules)                    β”‚ β”‚
β”‚  β”‚  RAGSystem (LangChain + sentence-transformers)  β”‚ β”‚
β”‚  β”‚  ConfidenceCalculator β†’ DocumentQuality         β”‚ β”‚
β”‚  β”‚  HeatmapGenerator β†’ JSON + PNG                  β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                     β”‚                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚              SQLAlchemy (SQLite / PostgreSQL)    β”‚ β”‚
β”‚  β”‚  documents Β· extracted_fields Β· field_edit_historyβ”‚ β”‚
β”‚  β”‚  ocr_character_data Β· rag_embeddings Β· audit_logsβ”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Project Structure

PDF-Manager/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py
β”‚   β”œβ”€β”€ database.py
β”‚   β”œβ”€β”€ models.py              (Pydantic API models)
β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”œβ”€β”€ app.py                 (entry point)
β”‚   β”œβ”€β”€ ocr/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ ocr_engine.py      (Tesseract + EasyOCR + PaddleOCR)
β”‚   β”‚   β”œβ”€β”€ confidence_calculator.py
β”‚   β”‚   └── heatmap_generator.py
β”‚   β”œβ”€β”€ extraction/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ extractor.py       (orchestrator)
β”‚   β”‚   β”œβ”€β”€ rag_system.py      (LangChain + HuggingFace)
β”‚   β”‚   └── field_detector.py  (NER + rules)
β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   └── routes.py          (REST API v1 blueprint)
β”‚   └── services/
β”‚       β”œβ”€β”€ pdf_service.py
β”‚       β”œβ”€β”€ ai_extraction_service.py
β”‚       └── ml_service.py
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ public/index.html
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.js
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ PDFViewer.js           (react-pdf)
β”‚   β”‚   β”‚   β”œβ”€β”€ FieldsEditor.js        (editable fields table)
β”‚   β”‚   β”‚   β”œβ”€β”€ OCRConfidenceHeatmap.js
β”‚   β”‚   β”‚   β”œβ”€β”€ PerformanceDashboard.js
β”‚   β”‚   β”‚   └── ExtractionPage.js      (split layout orchestrator)
β”‚   β”‚   β”œβ”€β”€ services/api.js
β”‚   β”‚   └── styles/extraction.css
β”‚   └── package.json
β”œβ”€β”€ models.py                  (SQLAlchemy models)
β”œβ”€β”€ blueprints/                (Flask web-UI blueprints)
β”œβ”€β”€ templates/                 (Jinja2 HTML templates)
β”œβ”€β”€ static/                    (CSS, JS for server-rendered UI)
β”œβ”€β”€ database/                  (SQL init scripts)
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ API.md
β”‚   β”œβ”€β”€ ARCHITECTURE.md
β”‚   └── SETUP.md
β”œβ”€β”€ docker-compose.yml
└── README.md

Quick Start

Using Docker Compose (recommended)

git clone https://github.com/rahulmisra2010-ctrl/PDF-Manager.git
cd PDF-Manager
docker compose up --build
Service URL
Frontend (React) http://localhost:3000
Backend (Flask + UI/API) http://localhost:5000
Dashboard login http://localhost:5000/auth/login

Manual setup

If you prefer not to use Docker:

git clone https://github.com/rahulmisra2010-ctrl/PDF-Manager.git
cd PDF-Manager

python -m venv .venv
source .venv/bin/activate          # Windows: .venv\Scripts\activate
pip install -r backend/requirements.txt

cp .env.example .env               # optional; adjust values as needed
python app.py                      # opens http://localhost:5000

See docs/SETUP.md for more details.


.env Configuration

Where to place the file

The .env file must be created in the project root (the same directory that contains app.py):

PDF-Manager/          ← repository root
β”œβ”€β”€ .env              ← place it here
β”œβ”€β”€ .env.example      ← template to copy from
β”œβ”€β”€ app.py
└── ...

app.py also checks for a backend/.env file for backwards compatibility, but the project root is the canonical location.


Creating the file

Copy the bundled template and edit it with your values:

cp .env.example .env

Then open .env in your editor and set at least the two critical keys described below.


Critical values

SECRET_KEY

Used by Flask to sign session cookies and CSRF tokens. Every restart with a new key invalidates all active sessions.

Generate a strong value:

python -c "import secrets; print(secrets.token_hex(32))"

Then set it in .env:

SECRET_KEY=<paste-the-generated-value-here>

⚠️ Duplicate-key warning β€” Your .env file must contain exactly one SECRET_KEY line. If the key appears more than once, python-dotenv uses the first occurrence and silently ignores the rest, which can cause confusing behaviour. Search the file before saving:

grep -n "SECRET_KEY" .env   # should print exactly one line

ADMIN_PASSWORD

Password for the auto-created admin account on first run. Leave it blank to have the app generate and print a random password at startup, but always set an explicit password in production:

ADMIN_PASSWORD=<your-strong-password>

Production vs development settings

Setting Development Production
DEBUG true false
SECRET_KEY Any non-empty string Cryptographically random value (β‰₯ 32 hex chars)
ADMIN_PASSWORD Convenient test value Strong, unique password
DATABASE_URL sqlite:///instance/pdf_manager.db PostgreSQL connection string
ALLOWED_ORIGINS http://localhost:3000 Your real frontend domain(s)

A minimal development .env:

DEBUG=true
SECRET_KEY=dev-change-me
ADMIN_PASSWORD=dev-admin-please-change
DATABASE_URL=sqlite:///instance/pdf_manager.db

A minimal production .env:

DEBUG=false
SECRET_KEY=<output-of-secrets.token_hex(32)>
ADMIN_PASSWORD=<strong-unique-password>
DATABASE_URL=postgresql://pdfmanager:<password>@db-host:5432/pdfmanager
ALLOWED_ORIGINS=["https://your-frontend-domain.com"]

Support

If you run into any problems locating, creating, or editing your .env β€” for either production or development β€” open an issue in this repository and include the error message you are seeing (but never paste the actual secret values).

Let me know if you need any more details about this setup or help correcting your .env for production or development.


Project Structure

PDF-Manager/
β”œβ”€β”€ app.py                     # Flask application factory (root)
β”œβ”€β”€ pdf_manager_app.py         # Convenience entry point / demo runner
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ app.py                 # Wrapper that loads the root app.py
β”‚   β”œβ”€β”€ models.py              # SQLAlchemy models
β”‚   β”œβ”€β”€ config.py              # Environment-based configuration
β”‚   β”œβ”€β”€ routes/
β”‚   β”‚   └── pdf_routes.py      # REST endpoints (legacy)
β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”œβ”€β”€ pdf_service.py     # PDF extraction & export (PyMuPDF + OpenCV)
β”‚   β”‚   └── ml_service.py      # ML field classification (PyTorch)
β”‚   └── requirements.txt
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ App.js
β”‚   β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”‚   β”œβ”€β”€ UploadPDF.js
β”‚   β”‚   β”‚   β”œβ”€β”€ DataDisplay.js
β”‚   β”‚   β”‚   └── EditData.js
β”‚   β”‚   └── services/
β”‚   β”‚       └── api.js
β”‚   └── package.json
β”œβ”€β”€ database/
β”‚   β”œβ”€β”€ schema.sql              # PostgreSQL table definitions
β”‚   └── init.sql                # Role creation & seed data
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ PHASE_1.md              # Week 1-2 development guide
β”‚   β”œβ”€β”€ ARCHITECTURE.md         # System design
β”‚   β”œβ”€β”€ API_DOCS.md             # REST API reference
β”‚   └── SETUP.md                # Local development setup
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ requirements.txt            # Root-level (delegates to backend/)
└── .gitignore

A note on entry points: the root-level app.py is the primary Flask application. backend/app.py is a thin compatibility wrapper so the app can also be started from inside the backend/ directory, but python app.py from the repository root is the recommended command.


Documentation

Document Description
docs/PHASE_1.md Week 1-2 task checklist and acceptance criteria
docs/ARCHITECTURE.md System design and component diagram
docs/API_DOCS.md Full REST API reference
docs/SETUP.md Local development setup guide

Tech Stack

Layer Technology
Frontend React 18
Backend Flask (Python 3.11)
PDF parsing PyMuPDF
Image processing OpenCV
ML PyTorch
Database SQLite (default) / PostgreSQL (optional)
Containerisation Docker Compose

License

MIT

About

It will read and export required data. user alter the data and print the updated pdf.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors