SSCA
Engineering Overview & Scope Reference
Version 0.9 · Patent Pending
Inventor: R. Claude Armstrong · Everett, WA
Structured Semantic Compression Algorithm

Engineering Overview
& Scope Reference

Version 0.9 · Patent Pending · February 2026
Zero Meaning LossLossless decompression to original meaning is non-negotiable for all mission-critical data.
Maximum EfficiencyEvery layer, every routing decision optimises the four walls: energy · hardware · infrastructure · cooling.
No Accident PermittedLike Tesla's zero-accident standard, SSCA treats any data corruption as an absolute system failure.
Section 01

What Structured Semantic Compression Algorithm Is

SSCA is a multi-layer, lossless text compression platform built on semantic lookup tables. Unlike character-level or entropy-based compression (gzip, zstd, LZ4), SSCA operates at the meaning level — replacing words, phrases, and semantic concepts with compact symbols drawn from a 247-primitive universal lookup table, then cascading seven specialized compression layers on top.

Core Insight: "Rapid," "fast," "quick," and "swift" represent the same mental image. Enterprise server logs repeat the same structural template millions of times per day. Legal documents restate the same clause dozens of times per contract. SSCA exploits each type of redundancy with a dedicated layer designed for exactly that pattern.

The architecture was conceived by R. Claude Armstrong, an 81-year-old independent engineer with 75+ years of pattern recognition experience across industrial systems maintenance, wastewater treatment, welding processes, and medical equipment maintenance.

Section 02

The 8-Layer Compression Stack

Three tiers: open-source foundation (L1–L4), proprietary domain layers (L5–L7), and an optional specialty module (L8). Compression gains are multiplicative, not additive.

Layer Name Tier Gain Status Mechanism
L1 Symbol Substitution OPEN 20–30% ✓ Prod 550-entry word/phrase → §symbol dictionary. Word-level token replacement.
L2 Contextual Compression OPEN +7–12% ✓ Prod Bigram/trigram collocations (~200 patterns). Phrase-level token replacement.
L3 Hierarchical Abstraction OPEN +10–15% ⚠ Bug Hypernym substitution (sedan→car). BYPASSED for precision data types.
L4 Predictive Inference OPEN +8–15% ✓ Prod Omission-map: highly predictable words dropped, position map stored.
L5 Data-Driven Dictionary PROP +10–25% ✓ Prod Learns corpus-optimal symbols. Extends L1 dictionary with training-derived high-frequency symbols.
L6 Template Repetition PROP +20–40% ✓ Prod Detects recurring structural templates (logs, forms, contracts). Stores template once + delta variables. Strongest on telemetry.
L7 Cross-Reference PROP +10–25% ✓ Prod Replaces repeated long strings with §RN reference IDs. Works across document boundaries with shared reference registry.
L8 Metaphor Compression OPT ~0–5% Optional Lakoff/Johnson metaphor families (25+ categories). Near-zero gain on enterprise data. Not in default cascade.
⚠ L3 Bug — Critical Fix Required (1–2 Days)

The symbol format §H:building (11 characters) is longer than the word it replaces (8 characters), producing negative compression.

hierarchy_dictionary.py L3 Bug → Fix
# OBSERVED OUTPUT (broken): IN: he drove his sedan to the office building OUT: he drove his §H:car to the §H:building building (-14.6%) # BUG 1: §H:building (11 chars) > building (8 chars) → expands # BUG 2: office → §H:building collides with next word → "the building building" # BEFORE (broken): symbol = f"§H:{general}" # e.g. §H:building = 11 chars # AFTER (fixed): self._h_counter += 1 symbol = f"§H{self._h_counter}" # e.g. §H14 = 4 chars, always shorter # Also: add one-token lookahead to prevent collision case
Section 03

DNA/P³ Router — Intelligent Controller

The front-end classifier that sits ahead of the entire compression stack. Inspects incoming data, determines the optimal layer cascade, and produces a RoutingPacket.

RoutingPacket Format — Fixed Overhead: 8–14 Bytes
FieldSizeDescription
pipeline1 byteEnum: one of 8 pipeline identifiers (NL_GEN, MD_PREC, etc.)
primitive_ids2 bytes each (max 10)Routing index: [action_id, domain_id, complexity_id, precision_flag]. Used to configure parameters — not reconstruction data.
domain_confidence4 bits0.0–1.0 float. Below 0.70 triggers structural fallback.
complexity_tier2 bits1=simple (<10 words), 2=moderate, 3=complex, 4=expert (>200 words)
precision_required1 bitTrue = semantic-lossy layers (L3, parts of L2) are bypassed
is_pre_compressed1 bitTrue = bypass all layers, direct output
original_payloadN bytesThe actual data. Always present. Never discarded.
8 Pipelines the Router Can Select
PipelineData TypeLayer CascadeNotes
NL_GENGeneral TextL1→L2→L3→L4Default path. Semantic abstraction active.
NL_DOMDomain TextL1→L2→L3→L4→L5L5 corpus dict extends L1.
ST_REPStructured/LogsL1→L6→L7Best compression ratios. Template detection.
LG_TECHLegal/TechnicalL1→L2→L4→L7L3 skipped — no semantic abstraction on legal.
MD_PRECMedical/PrecisionL1→L4→L7L2 & L3 skipped. Precision flag active.
CD_SYNCode/SyntaxL1→L6L3 skipped. Structural templates only.
BIN_PASSPre-CompressedBYPASSMagic-byte or entropy >7.5 hit. Pass-through.
ST_CHKStreaming >100MBL1→L2 (chunked)Sliding window. Chunked with overlap.
Section 04

Semantic Lookup Cascade — Three Tiers

At the heart of SSCA's compression philosophy is a three-tier semantic lookup cascade where the 247 universal primitives are applied. The tiers run in series; each handles the concepts the previous tier did not match.

A
Base Primitive Lookup
65 NSM Universal Primes
Wierzbicka/Goddard Natural Semantic Metalanguage primes — irreducible meaning atoms cross-language universal. If a token matches an NSM prime, it is encoded as a prime symbol and exits the cascade. Examples: I · YOU · DO · KNOW · WANT · GOOD · BAD · BECAUSE · IF · WHEN
B
Valence Modifier Lookup
4 × 22 Compound Meanings
4 base reaction primitives (WHAT · WHERE · HOW · WHY) crossed with 22 Hebrew valence modifiers to produce compound meaning representations. Example: "intensely" → HOW × intensity-modifier.
C
Image-Equivalence Tables
247 SSCA Primitives + Domain
Word clusters that map to the same mental image regardless of phrasing. "Rapid" = "fast" = "quick" = "swift" → single image symbol. This is where L3's hypernym dictionary connects to the semantic cascade.
⚕ Medical precision example: "etiology: bacterial pneumonia" must decompress to exactly "etiology: bacterial pneumonia" — not "cause: lung infection." The precision flag enforces this at the routing level, not the individual layer level.
Section 05

Compression Claims — Corrected

The original per-layer READMEs stated cumulative compression claims as if each layer's gain was additive off the original input. This is mathematically incorrect — each layer compresses what remains after the previous layer, making gains multiplicative. Claims that can be disproved by running the code on real data are a liability.

Layer StackData TypeClaimedHonest EstimateNotes
L1–L4 General text 50–60% ~34–48% Realistic for mixed prose corpora
L1–L5 Domain-specific text 65–75% ~41–61% Medical, legal, scientific corpora
L1–L6 Server logs / telemetry 80–90% ~70–85% Defensible: genuine template repetition
L1–L7 Legal / technical docs 80–95% ~65–80% Defensible: clause cross-reference heavy
The log/telemetry and legal document claims are defensible because L6 and L7 were purpose-built for highly repetitive structured data. The general text claims require recalibration against real corpora before they appear in any external communications.
Section 06

OCR / PDF Pre-Processing Pipeline

Fully architected but not yet implemented. The pathway for document-heavy enterprise deployments — legal archives, medical records, insurance documents.

1
Scanned document / image PDF arrives at OCR/PDF Input node.
2
Decision: Is the PDF text-extractable? YES → PDF Text Extractor. NO → Image→PDF Converter (avoid OCR where possible).
3
Doc ID Tagger: assigns a unique DocID that links the text stream to any accompanying image stream (charts, signatures, embedded photos).
4
Tagged text stream enters the main SSCA stack via the domain classifier.
5
Image stream travels separately, compressed by existing image compressors (JPEG2000, AVIF, JBIG2).
6
At destination: DocID reunites the text stream and image stream into the original document structure.
Without DocID tagging, the image and text streams are separate files at the destination with no structural relationship. DocID is the linking mechanism that makes SSCA output a drop-in replacement for the original document — not just compressed text fragments.
Section 07

Current Technical State — Assessment

Engineers considering contribution can treat this as the accurate starting point. State of codebase as reviewed February 2026.

Working & Production-Ready
  • Layer 1 — Symbol Substitution ✓
  • Layer 2 — Contextual Compression ✓
  • Layer 4 — Predictive Inference ✓
  • Layer 5 — Data-Driven Dictionary ✓
  • Layer 6 — Template Repetition ✓
  • Layer 7 — Cross-Reference ✓
  • ssca_router.py — runs, correct RoutingPackets
  • Architecture documentation — extensive
Needs Work
  • Layer 3 symbol format bug — CRITICAL
  • DNA/P³ Router — needs standalone module
  • PipelineExecutor stubs — not wired
  • OCR/PDF pipeline — not implemented
  • Benchmark validation — no real corpus yet
  • Error handling — no recovery paths
  • Scalability proof — latency budget needed
Section 08

Engineering Work Items — Ordered by Foundation Logic

Ordered by dependency, not complexity. Items 1 and 2 are blockers for everything else.

1
Fix L3 Symbol Format
Replace §H:building (11 chars) with §H14 (4 chars). Add lookahead to prevent "§H:building building" collision. Fix compression turning negative.
CRITICAL · 1–2 days
2
Build DNA/P³ Router
Standalone module. Domain classifier, magic-byte check, entropy scan, parser tier routing, bypass controller. ssca_router.py is the spec/draft.
CRITICAL · 2–3 weeks
3
Wire Layers 1–7 into Router
Replace _apply_layer() stubs in PipelineExecutor with actual imports from each layer module. Define shared CompressedData type.
HIGH · 1–2 weeks
4
OCR/PDF Pre-Processing
PDF text extraction, image/text stream separation, DocID tagging. Connect left pipeline in flowchart to main stack.
HIGH · 3–4 weeks
5
Benchmark & Validate Claims
Run all layers against representative corpora (logs, legal, medical, prose). Replace additive % claims with measured multiplicative figures. Patent & investor evidence base.
HIGH · 2–3 weeks
6
Integration & Production API
Single callable interface: ingest → route → cascade → validate → output. Error handling, logging, config. Defines external API surface for pilots.
MEDIUM · 2–3 weeks
7
Address Critique Items
Error paths for parser failures. Scalability proof at 110-domain tier. Latency budget on edge devices (<1ms). Error recovery spec.
MEDIUM · Ongoing
Section 09

Why This Matters — The Four Walls

Data infrastructure cost — storage, transmission, compute, cooling — is the limiting constraint for AI companies, cloud providers, and any organization operating at scale. SSCA addresses all four simultaneously because they are all driven by the same root variable: data volume.

Energy
Less data stored and transmitted means less I/O, less compute cycles, measurably lower power draw across large infrastructure. The primary return on compression investment at hyperscale.
Hardware
Storage and memory that does not need to be purchased, racked, configured, or maintained. At enterprise scale, 30% compression across a petabyte-class cluster eliminates 300TB of required hardware.
Infrastructure
Bandwidth, data center floor space, cooling — all scale linearly with data volume. A single SSCA deployment touches all infrastructure cost lines simultaneously.
Cooling
One of the fastest-growing AI infrastructure costs. Less compute → less heat → less cooling required → less energy for cooling. The energy-cooling relationship is non-linear.
Section 10

Patent Position — Novel Claims

One provisional patent filed covering the first three data efficiency parameters of the 9-layer stack. The following claims are the defensible novel contributions as identified through development and AI-assisted review.

Novel Claims — ssca_router.py
O(1) Semantic Classification
Method for O(1) semantic classification of arbitrary text input using a pre-computed 247-primitive hash table — the classifier makes a routing decision in sub-millisecond time regardless of input size.
RoutingPacket Format
Fixed-overhead header (8–14 bytes) that configures a multi-layer compression pipeline without requiring content analysis at each layer — the header travels with the data, not instead of it.
Precision-Flag Mechanism
Single-bit signal that bypasses semantic-lossy layers for medical, legal, mathematical, and code content — enforced at the routing level before any layer processing begins.
Entropy-Based Pre-Compression Bypass
Magic-byte + entropy threshold check that prevents double-compression overhead on already-compressed content — protects against the most common compression pipeline error.
Architecture Claims
Three-Tier Semantic Cascade
NSM primes (Tier A) → valence modifiers (Tier B) → image-equivalence tables (Tier C) — serial fallback architecture for meaning classification.
DocID Tagging
Text and image streams compressed separately, reunited at destination via DocID — enables SSCA to process scanned document archives without discarding embedded images.
Layer 9 Dynamic Ontology Learning
Routing analytics feed back into classifier improvement over deployment lifetime — the system improves its classification accuracy from production traffic.
⚠ Patent Notice for Engineering Team

Flag immediately any implementation decision that modifies the behavior described in the patent claims above. Changes to the primitive fingerprint format, the routing overhead size, or the precision-flag bypass logic should trigger a provisional update review.

Section 11

File & Module Reference

Treat ssca_router.py as the primary integration reference — it shows the intended API for every module.

File / ModuleStatusRole
ssca_router.py✓ WorksDNA/P³ router spec + draft. DNARouter, DomainClassifier, RoutingPacket, PipelineExecutor (stubs). Run this first.
01_Layer_1_Symbol_Substitution/✓ WorksL1 module. Import: from layer1 import compress
02_Layer_2_Contextual_Compression/✓ WorksL2 module. Import: from layer2 import compress
03_Layer_3_Hierarchical_Abstraction/⚠ BugFix symbol format in hierarchy_dictionary.py BEFORE wiring. Bug in _load_safe_hypernyms().
04_Layer_4_Predictive_Inference/✓ WorksL4 module. Import: from layer4 import compress
05_Layer_5_DataDriven/✓ WorksL5 proprietary. Import: from layer5 import compress
06_Layer_6_Template_Repetition/✓ WorksL6 proprietary. Strong on telemetry/logs.
07_Layer_7_Cross_Reference/✓ WorksL7 proprietary. §RN reference IDs working.
ssca_dev_brief.htmlReferenceRecruitment-facing overview page. Dark-themed.
ssca_flowchart.htmlReferenceInteractive SVG flowchart. Pan/zoom enabled. Architecture reference.
Section 12

Quick Start for a New Engineer

Recommended sequence for a senior engineer or CS team picking this up for the first time.

1
Run ssca_router.py — cd to its directory, python3 ssca_router.py. Read the output. Understand the RoutingPacket structure and the 8 pipelines.
2
Fix the L3 bug — open 03_Layer_3_.../hierarchy_dictionary.py. Change _load_safe_hypernyms() to generate §H14-style symbols. Add a lookahead in compressor.py. Verify positive compression ratios on all test cases.
3
Wire L1 into PipelineExecutor — replace the Layer 1 stub in _apply_layer() with from layer1 import compress; data = compress(data, **config). Test end-to-end: route a sentence, execute the pipeline, confirm compressed output is shorter than input.
4
Repeat for L2, L4, L5, L6, L7 — each should be a 3-line change per layer once you have the L1 pattern working.
5
Build the benchmark harness — collect representative test files for each pipeline type (prose, logs, legal, medical, code). Run each through the full stack. Record actual compression ratios. This becomes your evidence base.
6
OCR/PDF pipeline — implement PDF text extraction using pdfminer.six or pypdf2. Implement DocID tagging. Wire into the receiving edge.
7
Production API surface — define the public interface: ssca.compress(data) → CompressedOutput, ssca.decompress(output) → original. This is what enterprise pilots will call.
→ The entire SSCA TriplePlay mandate — zero meaning loss, maximum efficiency, no accident permitted — should be visible as a framed principle in whatever project management system you use. Every PR, every design decision, every benchmark result should be evaluated against all three.
R. Claude Armstrong
Inventor · Structured Semantic Compression Algorithm
Patent Pending · Everett, Washington · © 2025–2026
SSCA = Safety · Security · Correct · Accurate
Web Design & Production Sir Si'licon Claude AI · Anthropic · February 2026 Crafted for Claude R. Armstrong · SSCA Patent Pending