Inventor: R. Claude Armstrong · Everett, WA

Structured Semantic Compression Algorithm

Engineering Overview
& Scope Reference

Version 0.9 · Patent Pending · February 2026

①

Zero Meaning LossLossless decompression to original meaning is non-negotiable for all mission-critical data.

②

Maximum EfficiencyEvery layer, every routing decision optimises the four walls: energy · hardware · infrastructure · cooling.

③

No Accident PermittedLike Tesla's zero-accident standard, SSCA treats any data corruption as an absolute system failure.

Section 01

What Structured Semantic Compression Algorithm Is

SSCA is a multi-layer, lossless text compression platform built on semantic lookup tables. Unlike character-level or entropy-based compression (gzip, zstd, LZ4), SSCA operates at the meaning level — replacing words, phrases, and semantic concepts with compact symbols drawn from a 247-primitive universal lookup table, then cascading seven specialized compression layers on top.

Core Insight: "Rapid," "fast," "quick," and "swift" represent the same mental image. Enterprise server logs repeat the same structural template millions of times per day. Legal documents restate the same clause dozens of times per contract. SSCA exploits each type of redundancy with a dedicated layer designed for exactly that pattern.

The architecture was conceived by R. Claude Armstrong, an 81-year-old independent engineer with 75+ years of pattern recognition experience across industrial systems maintenance, wastewater treatment, welding processes, and medical equipment maintenance.

Section 02

The 8-Layer Compression Stack

Three tiers: open-source foundation (L1–L4), proprietary domain layers (L5–L7), and an optional specialty module (L8). Compression gains are multiplicative, not additive.

Layer	Name	Tier	Gain	Status	Mechanism
L1	Symbol Substitution	OPEN	20–30%	✓ Prod	550-entry word/phrase → §symbol dictionary. Word-level token replacement.
L2	Contextual Compression	OPEN	+7–12%	✓ Prod	Bigram/trigram collocations (~200 patterns). Phrase-level token replacement.
L3	Hierarchical Abstraction	OPEN	+10–15%	⚠ Bug	Hypernym substitution (sedan→car). BYPASSED for precision data types.
L4	Predictive Inference	OPEN	+8–15%	✓ Prod	Omission-map: highly predictable words dropped, position map stored.
L5	Data-Driven Dictionary	PROP	+10–25%	✓ Prod	Learns corpus-optimal symbols. Extends L1 dictionary with training-derived high-frequency symbols.
L6	Template Repetition	PROP	+20–40%	✓ Prod	Detects recurring structural templates (logs, forms, contracts). Stores template once + delta variables. Strongest on telemetry.
L7	Cross-Reference	PROP	+10–25%	✓ Prod	Replaces repeated long strings with §RN reference IDs. Works across document boundaries with shared reference registry.
L8	Metaphor Compression	OPT	~0–5%	Optional	Lakoff/Johnson metaphor families (25+ categories). Near-zero gain on enterprise data. Not in default cascade.

⚠ L3 Bug — Critical Fix Required (1–2 Days)

The symbol format §H:building (11 characters) is longer than the word it replaces (8 characters), producing negative compression.

hierarchy_dictionary.py L3 Bug → Fix

# OBSERVED OUTPUT (broken):
IN:  he drove his sedan to the office building
OUT: he drove his §H:car to the §H:building building  (-14.6%)
# BUG 1: §H:building (11 chars) > building (8 chars) → expands
# BUG 2: office → §H:building collides with next word → "the building building"

# BEFORE (broken):
symbol = f"§H:{general}"   # e.g. §H:building = 11 chars

# AFTER (fixed):
self._h_counter += 1
symbol = f"§H{self._h_counter}"   # e.g. §H14 = 4 chars, always shorter  ✓
# Also: add one-token lookahead to prevent collision case
    

Section 03

DNA/P³ Router — Intelligent Controller

The front-end classifier that sits ahead of the entire compression stack. Inspects incoming data, determines the optimal layer cascade, and produces a RoutingPacket.

RoutingPacket Format — Fixed Overhead: 8–14 Bytes

Field	Size	Description
pipeline	1 byte	Enum: one of 8 pipeline identifiers (NL_GEN, MD_PREC, etc.)
primitive_ids	2 bytes each (max 10)	Routing index: [action_id, domain_id, complexity_id, precision_flag]. Used to configure parameters — not reconstruction data.
domain_confidence	4 bits	0.0–1.0 float. Below 0.70 triggers structural fallback.
complexity_tier	2 bits	1=simple (<10 words), 2=moderate, 3=complex, 4=expert (>200 words)
precision_required	1 bit	True = semantic-lossy layers (L3, parts of L2) are bypassed
is_pre_compressed	1 bit	True = bypass all layers, direct output
original_payload	N bytes	The actual data. Always present. Never discarded.

8 Pipelines the Router Can Select

Pipeline	Data Type	Layer Cascade	Notes
NL_GEN	General Text	L1→L2→L3→L4	Default path. Semantic abstraction active.
NL_DOM	Domain Text	L1→L2→L3→L4→L5	L5 corpus dict extends L1.
ST_REP	Structured/Logs	L1→L6→L7	Best compression ratios. Template detection.
LG_TECH	Legal/Technical	L1→L2→L4→L7	L3 skipped — no semantic abstraction on legal.
MD_PREC	Medical/Precision	L1→L4→L7	L2 & L3 skipped. Precision flag active.
CD_SYN	Code/Syntax	L1→L6	L3 skipped. Structural templates only.
BIN_PASS	Pre-Compressed	BYPASS	Magic-byte or entropy >7.5 hit. Pass-through.
ST_CHK	Streaming >100MB	L1→L2 (chunked)	Sliding window. Chunked with overlap.

Section 04

Semantic Lookup Cascade — Three Tiers

At the heart of SSCA's compression philosophy is a three-tier semantic lookup cascade where the 247 universal primitives are applied. The tiers run in series; each handles the concepts the previous tier did not match.

Base Primitive Lookup
65 NSM Universal Primes

Wierzbicka/Goddard Natural Semantic Metalanguage primes — irreducible meaning atoms cross-language universal. If a token matches an NSM prime, it is encoded as a prime symbol and exits the cascade. Examples: I · YOU · DO · KNOW · WANT · GOOD · BAD · BECAUSE · IF · WHEN

Valence Modifier Lookup
4 × 22 Compound Meanings

4 base reaction primitives (WHAT · WHERE · HOW · WHY) crossed with 22 Hebrew valence modifiers to produce compound meaning representations. Example: "intensely" → HOW × intensity-modifier.

Image-Equivalence Tables
247 SSCA Primitives + Domain

Word clusters that map to the same mental image regardless of phrasing. "Rapid" = "fast" = "quick" = "swift" → single image symbol. This is where L3's hypernym dictionary connects to the semantic cascade.

⚕ Medical precision example: "etiology: bacterial pneumonia" must decompress to exactly "etiology: bacterial pneumonia" — not "cause: lung infection." The precision flag enforces this at the routing level, not the individual layer level.

Section 05

Compression Claims — Corrected

The original per-layer READMEs stated cumulative compression claims as if each layer's gain was additive off the original input. This is mathematically incorrect — each layer compresses what remains after the previous layer, making gains multiplicative. Claims that can be disproved by running the code on real data are a liability.

Layer Stack	Data Type	Claimed	Honest Estimate	Notes
L1–L4	General text	50–60%	~34–48%	Realistic for mixed prose corpora
L1–L5	Domain-specific text	65–75%	~41–61%	Medical, legal, scientific corpora
L1–L6	Server logs / telemetry	80–90%	~70–85%	Defensible: genuine template repetition
L1–L7	Legal / technical docs	80–95%	~65–80%	Defensible: clause cross-reference heavy

The log/telemetry and legal document claims are defensible because L6 and L7 were purpose-built for highly repetitive structured data. The general text claims require recalibration against real corpora before they appear in any external communications.

Section 06

OCR / PDF Pre-Processing Pipeline

Fully architected but not yet implemented. The pathway for document-heavy enterprise deployments — legal archives, medical records, insurance documents.

Scanned document / image PDF arrives at OCR/PDF Input node.

Decision: Is the PDF text-extractable? YES → PDF Text Extractor. NO → Image→PDF Converter (avoid OCR where possible).

Doc ID Tagger: assigns a unique DocID that links the text stream to any accompanying image stream (charts, signatures, embedded photos).

Tagged text stream enters the main SSCA stack via the domain classifier.

Image stream travels separately, compressed by existing image compressors (JPEG2000, AVIF, JBIG2).

At destination: DocID reunites the text stream and image stream into the original document structure.

Without DocID tagging, the image and text streams are separate files at the destination with no structural relationship. DocID is the linking mechanism that makes SSCA output a drop-in replacement for the original document — not just compressed text fragments.

Section 07

Current Technical State — Assessment

Engineers considering contribution can treat this as the accurate starting point. State of codebase as reviewed February 2026.

Working & Production-Ready

Layer 1 — Symbol Substitution ✓
Layer 2 — Contextual Compression ✓
Layer 4 — Predictive Inference ✓
Layer 5 — Data-Driven Dictionary ✓
Layer 6 — Template Repetition ✓
Layer 7 — Cross-Reference ✓
ssca_router.py — runs, correct RoutingPackets
Architecture documentation — extensive

Needs Work

Layer 3 symbol format bug — CRITICAL
DNA/P³ Router — needs standalone module
PipelineExecutor stubs — not wired
OCR/PDF pipeline — not implemented
Benchmark validation — no real corpus yet
Error handling — no recovery paths
Scalability proof — latency budget needed

Section 08

Engineering Work Items — Ordered by Foundation Logic

Ordered by dependency, not complexity. Items 1 and 2 are blockers for everything else.

Fix L3 Symbol Format

Replace §H:building (11 chars) with §H14 (4 chars). Add lookahead to prevent "§H:building building" collision. Fix compression turning negative.

CRITICAL · 1–2 days

Build DNA/P³ Router

Standalone module. Domain classifier, magic-byte check, entropy scan, parser tier routing, bypass controller. ssca_router.py is the spec/draft.

CRITICAL · 2–3 weeks

Wire Layers 1–7 into Router

Replace _apply_layer() stubs in PipelineExecutor with actual imports from each layer module. Define shared CompressedData type.

HIGH · 1–2 weeks

OCR/PDF Pre-Processing

PDF text extraction, image/text stream separation, DocID tagging. Connect left pipeline in flowchart to main stack.

HIGH · 3–4 weeks

Benchmark & Validate Claims

Run all layers against representative corpora (logs, legal, medical, prose). Replace additive % claims with measured multiplicative figures. Patent & investor evidence base.

HIGH · 2–3 weeks

Integration & Production API

Single callable interface: ingest → route → cascade → validate → output. Error handling, logging, config. Defines external API surface for pilots.

MEDIUM · 2–3 weeks

Address Critique Items

Error paths for parser failures. Scalability proof at 110-domain tier. Latency budget on edge devices (<1ms). Error recovery spec.

MEDIUM · Ongoing

Section 09

Why This Matters — The Four Walls

Data infrastructure cost — storage, transmission, compute, cooling — is the limiting constraint for AI companies, cloud providers, and any organization operating at scale. SSCA addresses all four simultaneously because they are all driven by the same root variable: data volume.

Energy

Less data stored and transmitted means less I/O, less compute cycles, measurably lower power draw across large infrastructure. The primary return on compression investment at hyperscale.

Hardware

Storage and memory that does not need to be purchased, racked, configured, or maintained. At enterprise scale, 30% compression across a petabyte-class cluster eliminates 300TB of required hardware.

Infrastructure

Bandwidth, data center floor space, cooling — all scale linearly with data volume. A single SSCA deployment touches all infrastructure cost lines simultaneously.

Cooling

One of the fastest-growing AI infrastructure costs. Less compute → less heat → less cooling required → less energy for cooling. The energy-cooling relationship is non-linear.

Section 10

Patent Position — Novel Claims

One provisional patent filed covering the first three data efficiency parameters of the 9-layer stack. The following claims are the defensible novel contributions as identified through development and AI-assisted review.

Novel Claims — ssca_router.py

O(1) Semantic Classification

Method for O(1) semantic classification of arbitrary text input using a pre-computed 247-primitive hash table — the classifier makes a routing decision in sub-millisecond time regardless of input size.

RoutingPacket Format

Fixed-overhead header (8–14 bytes) that configures a multi-layer compression pipeline without requiring content analysis at each layer — the header travels with the data, not instead of it.

Precision-Flag Mechanism

Single-bit signal that bypasses semantic-lossy layers for medical, legal, mathematical, and code content — enforced at the routing level before any layer processing begins.

Entropy-Based Pre-Compression Bypass

Magic-byte + entropy threshold check that prevents double-compression overhead on already-compressed content — protects against the most common compression pipeline error.

Architecture Claims

Three-Tier Semantic Cascade

NSM primes (Tier A) → valence modifiers (Tier B) → image-equivalence tables (Tier C) — serial fallback architecture for meaning classification.

DocID Tagging

Text and image streams compressed separately, reunited at destination via DocID — enables SSCA to process scanned document archives without discarding embedded images.

Layer 9 Dynamic Ontology Learning

Routing analytics feed back into classifier improvement over deployment lifetime — the system improves its classification accuracy from production traffic.

⚠ Patent Notice for Engineering Team

Flag immediately any implementation decision that modifies the behavior described in the patent claims above. Changes to the primitive fingerprint format, the routing overhead size, or the precision-flag bypass logic should trigger a provisional update review.

Section 11

File & Module Reference

Treat ssca_router.py as the primary integration reference — it shows the intended API for every module.

File / Module	Status	Role
ssca_router.py	✓ Works	DNA/P³ router spec + draft. DNARouter, DomainClassifier, RoutingPacket, PipelineExecutor (stubs). Run this first.
01_Layer_1_Symbol_Substitution/	✓ Works	L1 module. Import: from layer1 import compress
02_Layer_2_Contextual_Compression/	✓ Works	L2 module. Import: from layer2 import compress
03_Layer_3_Hierarchical_Abstraction/	⚠ Bug	Fix symbol format in hierarchy_dictionary.py BEFORE wiring. Bug in _load_safe_hypernyms().
04_Layer_4_Predictive_Inference/	✓ Works	L4 module. Import: from layer4 import compress
05_Layer_5_DataDriven/	✓ Works	L5 proprietary. Import: from layer5 import compress
06_Layer_6_Template_Repetition/	✓ Works	L6 proprietary. Strong on telemetry/logs.
07_Layer_7_Cross_Reference/	✓ Works	L7 proprietary. §RN reference IDs working.
ssca_dev_brief.html	Reference	Recruitment-facing overview page. Dark-themed.
ssca_flowchart.html	Reference	Interactive SVG flowchart. Pan/zoom enabled. Architecture reference.

Section 12

Quick Start for a New Engineer

Recommended sequence for a senior engineer or CS team picking this up for the first time.

Run ssca_router.py — cd to its directory, python3 ssca_router.py. Read the output. Understand the RoutingPacket structure and the 8 pipelines.

Fix the L3 bug — open 03_Layer_3_.../hierarchy_dictionary.py. Change _load_safe_hypernyms() to generate §H14-style symbols. Add a lookahead in compressor.py. Verify positive compression ratios on all test cases.

Wire L1 into PipelineExecutor — replace the Layer 1 stub in _apply_layer() with from layer1 import compress; data = compress(data, **config). Test end-to-end: route a sentence, execute the pipeline, confirm compressed output is shorter than input.

Repeat for L2, L4, L5, L6, L7 — each should be a 3-line change per layer once you have the L1 pattern working.

Build the benchmark harness — collect representative test files for each pipeline type (prose, logs, legal, medical, code). Run each through the full stack. Record actual compression ratios. This becomes your evidence base.

OCR/PDF pipeline — implement PDF text extraction using pdfminer.six or pypdf2. Implement DocID tagging. Wire into the receiving edge.

Production API surface — define the public interface: ssca.compress(data) → CompressedOutput, ssca.decompress(output) → original. This is what enterprise pilots will call.

→ The entire SSCA TriplePlay mandate — zero meaning loss, maximum efficiency, no accident permitted — should be visible as a framed principle in whatever project management system you use. Every PR, every design decision, every benchmark result should be evaluated against all three.

R. Claude Armstrong

Inventor · Structured Semantic Compression Algorithm

SSCA = Safety · Security · Correct · Accurate

Engineering Overview& Scope Reference