Inventor: R. Claude Armstrong · Everett, WA
Structured Semantic Compression Algorithm
Engineering Overview
& Scope Reference
Version 0.9 · Patent Pending · February 2026
①
Zero Meaning LossLossless decompression to original meaning is non-negotiable for all mission-critical data.
②
Maximum EfficiencyEvery layer, every routing decision optimises the four walls: energy · hardware · infrastructure · cooling.
③
No Accident PermittedLike Tesla's zero-accident standard, SSCA treats any data corruption as an absolute system failure.
Section 01
What Structured Semantic Compression Algorithm Is
SSCA is a multi-layer, lossless text compression platform built on semantic lookup tables. Unlike character-level or entropy-based compression (gzip, zstd, LZ4), SSCA operates at the meaning level — replacing words, phrases, and semantic concepts with compact symbols drawn from a 247-primitive universal lookup table, then cascading seven specialized compression layers on top.
Core Insight: "Rapid," "fast," "quick," and "swift" represent the same mental image. Enterprise server logs repeat the same structural template millions of times per day. Legal documents restate the same clause dozens of times per contract. SSCA exploits each type of redundancy with a dedicated layer designed for exactly that pattern.
The architecture was conceived by R. Claude Armstrong, an 81-year-old independent engineer with 75+ years of pattern recognition experience across industrial systems maintenance, wastewater treatment, welding processes, and medical equipment maintenance.
Section 02
The 8-Layer Compression Stack
Three tiers: open-source foundation (L1–L4), proprietary domain layers (L5–L7), and an optional specialty module (L8). Compression gains are multiplicative, not additive.
| Layer |
Name |
Tier |
Gain |
Status |
Mechanism |
| L1 |
Symbol Substitution |
OPEN |
20–30% |
✓ Prod |
550-entry word/phrase → §symbol dictionary. Word-level token replacement. |
| L2 |
Contextual Compression |
OPEN |
+7–12% |
✓ Prod |
Bigram/trigram collocations (~200 patterns). Phrase-level token replacement. |
| L3 |
Hierarchical Abstraction |
OPEN |
+10–15% |
⚠ Bug |
Hypernym substitution (sedan→car). BYPASSED for precision data types. |
| L4 |
Predictive Inference |
OPEN |
+8–15% |
✓ Prod |
Omission-map: highly predictable words dropped, position map stored. |
| L5 |
Data-Driven Dictionary |
PROP |
+10–25% |
✓ Prod |
Learns corpus-optimal symbols. Extends L1 dictionary with training-derived high-frequency symbols. |
| L6 |
Template Repetition |
PROP |
+20–40% |
✓ Prod |
Detects recurring structural templates (logs, forms, contracts). Stores template once + delta variables. Strongest on telemetry. |
| L7 |
Cross-Reference |
PROP |
+10–25% |
✓ Prod |
Replaces repeated long strings with §RN reference IDs. Works across document boundaries with shared reference registry. |
| L8 |
Metaphor Compression |
OPT |
~0–5% |
Optional |
Lakoff/Johnson metaphor families (25+ categories). Near-zero gain on enterprise data. Not in default cascade. |
The symbol format §H:building (11 characters) is longer than the word it replaces (8 characters), producing negative compression.
IN: he drove his sedan to the office building
OUT: he drove his §H:car to the §H:building building (-14.6%)
symbol = f"§H:{general}"
self._h_counter += 1
symbol = f"§H{self._h_counter}"
Section 03
DNA/P³ Router — Intelligent Controller
The front-end classifier that sits ahead of the entire compression stack. Inspects incoming data, determines the optimal layer cascade, and produces a RoutingPacket.
RoutingPacket Format — Fixed Overhead: 8–14 Bytes
| Field | Size | Description |
| pipeline | 1 byte | Enum: one of 8 pipeline identifiers (NL_GEN, MD_PREC, etc.) |
| primitive_ids | 2 bytes each (max 10) | Routing index: [action_id, domain_id, complexity_id, precision_flag]. Used to configure parameters — not reconstruction data. |
| domain_confidence | 4 bits | 0.0–1.0 float. Below 0.70 triggers structural fallback. |
| complexity_tier | 2 bits | 1=simple (<10 words), 2=moderate, 3=complex, 4=expert (>200 words) |
| precision_required | 1 bit | True = semantic-lossy layers (L3, parts of L2) are bypassed |
| is_pre_compressed | 1 bit | True = bypass all layers, direct output |
| original_payload | N bytes | The actual data. Always present. Never discarded. |
8 Pipelines the Router Can Select
| Pipeline | Data Type | Layer Cascade | Notes |
| NL_GEN | General Text | L1→L2→L3→L4 | Default path. Semantic abstraction active. |
| NL_DOM | Domain Text | L1→L2→L3→L4→L5 | L5 corpus dict extends L1. |
| ST_REP | Structured/Logs | L1→L6→L7 | Best compression ratios. Template detection. |
| LG_TECH | Legal/Technical | L1→L2→L4→L7 | L3 skipped — no semantic abstraction on legal. |
| MD_PREC | Medical/Precision | L1→L4→L7 | L2 & L3 skipped. Precision flag active. |
| CD_SYN | Code/Syntax | L1→L6 | L3 skipped. Structural templates only. |
| BIN_PASS | Pre-Compressed | BYPASS | Magic-byte or entropy >7.5 hit. Pass-through. |
| ST_CHK | Streaming >100MB | L1→L2 (chunked) | Sliding window. Chunked with overlap. |
Section 04
Semantic Lookup Cascade — Three Tiers
At the heart of SSCA's compression philosophy is a three-tier semantic lookup cascade where the 247 universal primitives are applied. The tiers run in series; each handles the concepts the previous tier did not match.
A
Base Primitive Lookup
65 NSM Universal Primes
Wierzbicka/Goddard Natural Semantic Metalanguage primes — irreducible meaning atoms cross-language universal. If a token matches an NSM prime, it is encoded as a prime symbol and exits the cascade. Examples: I · YOU · DO · KNOW · WANT · GOOD · BAD · BECAUSE · IF · WHEN
B
Valence Modifier Lookup
4 × 22 Compound Meanings
4 base reaction primitives (WHAT · WHERE · HOW · WHY) crossed with 22 Hebrew valence modifiers to produce compound meaning representations. Example: "intensely" → HOW × intensity-modifier.
C
Image-Equivalence Tables
247 SSCA Primitives + Domain
Word clusters that map to the same mental image regardless of phrasing. "Rapid" = "fast" = "quick" = "swift" → single image symbol. This is where L3's hypernym dictionary connects to the semantic cascade.
⚕ Medical precision example: "etiology: bacterial pneumonia" must decompress to exactly "etiology: bacterial pneumonia" — not "cause: lung infection." The precision flag enforces this at the routing level, not the individual layer level.
Section 05
Compression Claims — Corrected
The original per-layer READMEs stated cumulative compression claims as if each layer's gain was additive off the original input. This is mathematically incorrect — each layer compresses what remains after the previous layer, making gains multiplicative. Claims that can be disproved by running the code on real data are a liability.
| Layer Stack | Data Type | Claimed | Honest Estimate | Notes |
| L1–L4 |
General text |
50–60% |
~34–48% |
Realistic for mixed prose corpora |
| L1–L5 |
Domain-specific text |
65–75% |
~41–61% |
Medical, legal, scientific corpora |
| L1–L6 |
Server logs / telemetry |
80–90% |
~70–85% |
Defensible: genuine template repetition |
| L1–L7 |
Legal / technical docs |
80–95% |
~65–80% |
Defensible: clause cross-reference heavy |
The log/telemetry and legal document claims are defensible because L6 and L7 were purpose-built for highly repetitive structured data. The general text claims require recalibration against real corpora before they appear in any external communications.
Section 06
OCR / PDF Pre-Processing Pipeline
Fully architected but not yet implemented. The pathway for document-heavy enterprise deployments — legal archives, medical records, insurance documents.
1
Scanned document / image PDF arrives at OCR/PDF Input node.
2
Decision: Is the PDF text-extractable? YES → PDF Text Extractor. NO → Image→PDF Converter (avoid OCR where possible).
3
Doc ID Tagger: assigns a unique DocID that links the text stream to any accompanying image stream (charts, signatures, embedded photos).
4
Tagged text stream enters the main SSCA stack via the domain classifier.
5
Image stream travels separately, compressed by existing image compressors (JPEG2000, AVIF, JBIG2).
6
At destination: DocID reunites the text stream and image stream into the original document structure.
Without DocID tagging, the image and text streams are separate files at the destination with no structural relationship. DocID is the linking mechanism that makes SSCA output a drop-in replacement for the original document — not just compressed text fragments.
Section 07
Current Technical State — Assessment
Engineers considering contribution can treat this as the accurate starting point. State of codebase as reviewed February 2026.
Working & Production-Ready
- Layer 1 — Symbol Substitution ✓
- Layer 2 — Contextual Compression ✓
- Layer 4 — Predictive Inference ✓
- Layer 5 — Data-Driven Dictionary ✓
- Layer 6 — Template Repetition ✓
- Layer 7 — Cross-Reference ✓
- ssca_router.py — runs, correct RoutingPackets
- Architecture documentation — extensive
Needs Work
- Layer 3 symbol format bug — CRITICAL
- DNA/P³ Router — needs standalone module
- PipelineExecutor stubs — not wired
- OCR/PDF pipeline — not implemented
- Benchmark validation — no real corpus yet
- Error handling — no recovery paths
- Scalability proof — latency budget needed
Section 08
Engineering Work Items — Ordered by Foundation Logic
Ordered by dependency, not complexity. Items 1 and 2 are blockers for everything else.
1
Fix L3 Symbol Format
Replace §H:building (11 chars) with §H14 (4 chars). Add lookahead to prevent "§H:building building" collision. Fix compression turning negative.
CRITICAL · 1–2 days
2
Build DNA/P³ Router
Standalone module. Domain classifier, magic-byte check, entropy scan, parser tier routing, bypass controller. ssca_router.py is the spec/draft.
CRITICAL · 2–3 weeks
3
Wire Layers 1–7 into Router
Replace _apply_layer() stubs in PipelineExecutor with actual imports from each layer module. Define shared CompressedData type.
HIGH · 1–2 weeks
4
OCR/PDF Pre-Processing
PDF text extraction, image/text stream separation, DocID tagging. Connect left pipeline in flowchart to main stack.
HIGH · 3–4 weeks
5
Benchmark & Validate Claims
Run all layers against representative corpora (logs, legal, medical, prose). Replace additive % claims with measured multiplicative figures. Patent & investor evidence base.
HIGH · 2–3 weeks
6
Integration & Production API
Single callable interface: ingest → route → cascade → validate → output. Error handling, logging, config. Defines external API surface for pilots.
MEDIUM · 2–3 weeks
7
Address Critique Items
Error paths for parser failures. Scalability proof at 110-domain tier. Latency budget on edge devices (<1ms). Error recovery spec.
MEDIUM · Ongoing
Section 09
Why This Matters — The Four Walls
Data infrastructure cost — storage, transmission, compute, cooling — is the limiting constraint for AI companies, cloud providers, and any organization operating at scale. SSCA addresses all four simultaneously because they are all driven by the same root variable: data volume.
Energy
Less data stored and transmitted means less I/O, less compute cycles, measurably lower power draw across large infrastructure. The primary return on compression investment at hyperscale.
Hardware
Storage and memory that does not need to be purchased, racked, configured, or maintained. At enterprise scale, 30% compression across a petabyte-class cluster eliminates 300TB of required hardware.
Infrastructure
Bandwidth, data center floor space, cooling — all scale linearly with data volume. A single SSCA deployment touches all infrastructure cost lines simultaneously.
Cooling
One of the fastest-growing AI infrastructure costs. Less compute → less heat → less cooling required → less energy for cooling. The energy-cooling relationship is non-linear.
Section 10
Patent Position — Novel Claims
One provisional patent filed covering the first three data efficiency parameters of the 9-layer stack. The following claims are the defensible novel contributions as identified through development and AI-assisted review.
Novel Claims — ssca_router.py
O(1) Semantic Classification
Method for O(1) semantic classification of arbitrary text input using a pre-computed 247-primitive hash table — the classifier makes a routing decision in sub-millisecond time regardless of input size.
RoutingPacket Format
Fixed-overhead header (8–14 bytes) that configures a multi-layer compression pipeline without requiring content analysis at each layer — the header travels with the data, not instead of it.
Precision-Flag Mechanism
Single-bit signal that bypasses semantic-lossy layers for medical, legal, mathematical, and code content — enforced at the routing level before any layer processing begins.
Entropy-Based Pre-Compression Bypass
Magic-byte + entropy threshold check that prevents double-compression overhead on already-compressed content — protects against the most common compression pipeline error.
Architecture Claims
Three-Tier Semantic Cascade
NSM primes (Tier A) → valence modifiers (Tier B) → image-equivalence tables (Tier C) — serial fallback architecture for meaning classification.
DocID Tagging
Text and image streams compressed separately, reunited at destination via DocID — enables SSCA to process scanned document archives without discarding embedded images.
Layer 9 Dynamic Ontology Learning
Routing analytics feed back into classifier improvement over deployment lifetime — the system improves its classification accuracy from production traffic.
⚠ Patent Notice for Engineering Team
Flag immediately any implementation decision that modifies the behavior described in the patent claims above. Changes to the primitive fingerprint format, the routing overhead size, or the precision-flag bypass logic should trigger a provisional update review.
Section 11
File & Module Reference
Treat ssca_router.py as the primary integration reference — it shows the intended API for every module.
| File / Module | Status | Role |
| ssca_router.py | ✓ Works | DNA/P³ router spec + draft. DNARouter, DomainClassifier, RoutingPacket, PipelineExecutor (stubs). Run this first. |
| 01_Layer_1_Symbol_Substitution/ | ✓ Works | L1 module. Import: from layer1 import compress |
| 02_Layer_2_Contextual_Compression/ | ✓ Works | L2 module. Import: from layer2 import compress |
| 03_Layer_3_Hierarchical_Abstraction/ | ⚠ Bug | Fix symbol format in hierarchy_dictionary.py BEFORE wiring. Bug in _load_safe_hypernyms(). |
| 04_Layer_4_Predictive_Inference/ | ✓ Works | L4 module. Import: from layer4 import compress |
| 05_Layer_5_DataDriven/ | ✓ Works | L5 proprietary. Import: from layer5 import compress |
| 06_Layer_6_Template_Repetition/ | ✓ Works | L6 proprietary. Strong on telemetry/logs. |
| 07_Layer_7_Cross_Reference/ | ✓ Works | L7 proprietary. §RN reference IDs working. |
| ssca_dev_brief.html | Reference | Recruitment-facing overview page. Dark-themed. |
| ssca_flowchart.html | Reference | Interactive SVG flowchart. Pan/zoom enabled. Architecture reference. |
Section 12
Quick Start for a New Engineer
Recommended sequence for a senior engineer or CS team picking this up for the first time.
1
Run ssca_router.py — cd to its directory, python3 ssca_router.py. Read the output. Understand the RoutingPacket structure and the 8 pipelines.
2
Fix the L3 bug — open 03_Layer_3_.../hierarchy_dictionary.py. Change _load_safe_hypernyms() to generate §H14-style symbols. Add a lookahead in compressor.py. Verify positive compression ratios on all test cases.
3
Wire L1 into PipelineExecutor — replace the Layer 1 stub in _apply_layer() with from layer1 import compress; data = compress(data, **config). Test end-to-end: route a sentence, execute the pipeline, confirm compressed output is shorter than input.
4
Repeat for L2, L4, L5, L6, L7 — each should be a 3-line change per layer once you have the L1 pattern working.
5
Build the benchmark harness — collect representative test files for each pipeline type (prose, logs, legal, medical, code). Run each through the full stack. Record actual compression ratios. This becomes your evidence base.
6
OCR/PDF pipeline — implement PDF text extraction using pdfminer.six or pypdf2. Implement DocID tagging. Wire into the receiving edge.
7
Production API surface — define the public interface: ssca.compress(data) → CompressedOutput, ssca.decompress(output) → original. This is what enterprise pilots will call.
→ The entire SSCA TriplePlay mandate — zero meaning loss, maximum efficiency, no accident permitted — should be visible as a framed principle in whatever project management system you use. Every PR, every design decision, every benchmark result should be evaluated against all three.
R. Claude Armstrong
Inventor · Structured Semantic Compression Algorithm
Patent Pending · Everett, Washington · © 2025–2026
SSCA = Safety · Security · Correct · Accurate