Layers 10 and 11 extend SSCA to handle scanned documents (PDFs, images with text). These layers achieve dramatic compression (5-10x better than standard PDF compression) by:
Layer 10 (OCR Extraction): Extract text from images using OCR
Layer 11 (PDF Reconstruction): Store minimal image data + text separately
SSCA Layers 1-6: Compress extracted text using semantic compression (lossless)
Reconstruction: Rebuild readable PDF from compressed text + minimal image scaffold
Key Innovation:
Traditional PDF compression: Compress the IMAGE (90-95% of original size)
SSCA + OCR: Extract TEXT → compress semantically (5-10% of original size) + tiny image scaffold
Result: 90-95% compression on text-heavy scanned documents
Critical Feature: SSCA’s lossless compression ensures zero OCR error propagation. Text decompresses perfectly, eliminating the “OCR drift” problem where errors accumulate.
Problem Statement – Current PDF Compression Limitations
Scenario: Legal firm has 10,000 scanned contracts (PDF format)
Compression Flow:
INPUT: Scanned PDF or Image
│
├─► Layer 10: OCR Text Extraction
│ │
│ └─ Detect text regions → OCR to text → Preserve layout
│ │
│ └─ Create low-res image scaffold
│
│
├─► TEXT DATA → Layers 1-6: SSCA Semantic Compression (lossless)
│
└─► IMAGE SCAFFOLD → JPEG/PNG Compression (traditional)
│
└─ Layer 11: Combine & Package
│
└─ Compressed text + compressed scaffold + layout metadata
│
└─ OUTPUT: .ssca file (5-10% of original size)
Decompression Flow:
INPUT: .ssca file
│
├─► Layer 11: Unpackage
│ │
│ └─ Extract compressed text + scaffold + layout
│
│
├─► Layers 1-6: SSCA Decompress Text (lossless)
│
└─► Decompress Image Scaffold
│
└─ Layer 10: PDF Reconstruction
│
└─ Render scaffold as background
└─ Overlay text in correct positions
└─ Apply formatting
│
└─ OUTPUT: Reconstructed PDF (searchable, editable)
Critical: SSCA guarantees perfect text fidelity — OCR errors do not propagate or accumulate.
Critical Feature: Lossless Text Preservation
The OCR Error Propagation Problem:
Standard OCR compression: Original PDF → OCR → Text (with errors) → Compress → Decompress → Display (errors remain)
SSCA Solution: Original PDF → OCR → Text (with possible errors) → SSCA Compress → SSCA Decompress → EXACT same text (lossless)
Key insight: OCR may introduce errors during extraction BUT once extracted, SSCA preserves EXACTLY that text. No additional degradation. Multiple compress/decompress cycles: NO new errors, bit-perfect reconstruction every time.
Why This Matters:
Legal Documents: Contract terms must be exact — single character error can change meaning
Medical Records: Patient names, dosages must be perfect — life-threatening differences otherwise