SSCA — Open Development Brief

// 01 · Overview

What SSCA Actually Is

Most compression works at the character or bit level. SSCA works at the meaning level — replacing words, phrases, and semantic concepts with compact symbols drawn from a 247-primitive universal lookup table, then cascading seven additional specialized layers on top. The result is lossless compression that scales with data repetition and domain specificity, not just entropy.

The inventor, R. Claude Armstrong, is an 80-year-old independent engineer from Everett, Washington, with 75+ years of pattern recognition experience across industrial crane systems, wastewater treatment, welding processes, and medical equipment maintenance. SSCA grew from observing how mission-critical systems — including the Apollo Guidance Computer — achieved remarkable efficiency through symbolic abstraction rather than raw compute power.

The core insight: language and structured data are massively redundant at the meaning level. The word "rapid," "fast," "quick," and "swift" all represent the same mental image. Enterprise logs repeat the same structural template millions of times per day. Legal documents restate the same clause dozens of times per contract. SSCA exploits each of these redundancy types with a dedicated layer.

// 02 · Architecture

The 8-Layer Compression Stack

Layers 1–4 form the open-source foundation. Layers 5–7 are proprietary. Layer 8 is an optional specialty module for literary/marketing content.

Layer	Name	Mechanism	Status
L1	Symbol SubstitutionOPEN	550-entry word/phrase → §symbol dictionary. 20–30% on general text.	✓ Production
L2	Contextual CompressionOPEN	Bigrams, trigrams, collocations (~200 patterns). +7–12% additional.	✓ Production
L3	Hierarchical AbstractionOPEN	Hypernym substitution, redundant modifier removal. +10–15%. Semantic lossless.	⚠ Bug — see §04
L4	Predictive InferenceOPEN	Omission-map: drops highly predictable words, stores position map. +8–15%.	✓ Production
L5	Data-Driven DictionaryPROPRIETARY	Learns optimal symbols from corpus. Domain detection built in. +10–25%.	✓ Production
L6	Template RepetitionPROPRIETARY	Detects structural templates (logs, forms). Stores template once + variables.	✓ Production
L7	Cross-ReferencePROPRIETARY	Replaces repeated long strings with short §RN reference IDs. +10–25%.	✓ Production
L8	Metaphor CompressionSPECIALTY	25+ metaphor families (Lakoff/Johnson). Useful for literary/marketing content only.	Optional module

Each layer operates on the output of the previous one. The DNA/P³ router at the front of the pipeline detects domain, data type, and whether content is already compressed — routing accordingly before any layer processing begins.

// 03 · Compression Claims — Corrected

The Numbers, Honestly Stated

The original per-layer READMEs stated cumulative compression claims as if each layer's gain was additive off the original. That is mathematically incorrect — each layer compresses what remains, making gains multiplicative. Below is the corrected picture, verified by stacking analysis:

// Cumulative Compression — Claimed vs Mathematically Honest

L1–L4 · General text 50–60% claimed ~34–48% realistic

L1–L5 · Domain-specific text 65–75% claimed ~41–61% realistic

L1–L6 · Server logs / structured data 80–90% claimed ~70–85% defensible

L1–L7 · Legal / technical documents 80–95% claimed ~65–80% defensible

The log and legal claims are defensible because those data types are genuinely and measurably highly repetitive — L6 and L7 were purpose-built for them. The general text claims need recalibration. This correction matters for patent filings and investor discussions: claims that can be disproved by running the code on real data are a liability, not a strength.

// 04 · Current Technical State

What Works, What Needs Work

This is an honest assessment based on reading all eight layer implementations, running the code, and checking the math. The goal is to show engineers exactly what they're walking into.

Solid

Layers 1, 2, 4, 5, 6, 7

All run correctly. APIs are clean. Cascade integration is straightforward. Losslessness verified. L6 on server log data is genuinely impressive.

Needs Fix

Layer 3 — Symbol Format Bug

Symbol format §H:building (11 chars) is longer than the word it replaces (8 chars), producing negative compression. Two-line fix needed.

Needs Repositioning

Layer 8 — Scope Clarification

Solid implementation but mislabeled as a core layer. Contributes near-zero on enterprise data types. Should be an optional specialty module.

Not Yet Built

DNA/P³ Router

Specified in architecture and flowchart, but not yet implemented as a standalone module. Currently layers run serially. The router is the critical next build.

Not Yet Built

OCR / PDF Pre-Processing

Scanned document ingestion, PDF text extraction, image/text stream separation, and DocID tagging are architected but not implemented.

Complete

Patent Provisional Filings

Three provisionals covering the first three data efficiency parameters of the 9-layer stack. Architecture documentation extensive.

The Layer 3 bug, for those who want to see it:

# Actual output running layer3.py against real sentences

IN:  he drove his sedan to the office building
OUT: he drove his §H:car to the §H:building building   (-14.6%)
# Bug 1: §H:building (11 chars) > building (8 chars) → file gets LARGER
# Bug 2: office → §H:building collides with next word 'building'
#        decompresses to "the building building" — grammatically broken

IN:  the laptop is on the desk
OUT: the §H:computer is on the §H:furniture   (-52.0%)
# Negative compression. Symbol is longer than the word it replaced.

# The fix: numbered symbols like L1/L2 already use correctly
§H:building  →  §H14    (4 chars vs 11 — always shorter than source)
§CAT:big     →  §C3     (3 chars)

// 05 · Development Roadmap

The Work That Needs Doing

These are ordered by foundation-first logic, not complexity. A small team of senior engineers could move through this stack systematically.

01

Fix Layer 3 Symbol Format

Replace verbose §H:term with numbered compact symbols (§H14 etc). Add single-token lookahead to prevent collision with adjacent identical words. Estimated: 1–2 days for an experienced Python dev.
02

Build the DNA/P³ Router

The domain classifier that sits in front of the entire stack. Detects data type via magic-number / header / entropy scan. Routes to appropriate tier parsers. Implements bypass for pre-compressed content. This is the critical path item — without it, the stack can't self-configure.
03

Build the OCR/PDF Pre-Processing Pipeline

Scanned document ingestion → PDF text extraction → image/text stream separation → DocID tagging for stream reunion at destination. Connects the "left pipeline" shown in the architecture flowchart to the main compression stack.
04

Calibrate and Validate Compression Claims

Run all layers against representative corpora for each data type (logs, legal docs, medical records, general text). Produce reproducible benchmark results. Replace current additive percentage claims with verified multiplicative figures. This output becomes the patent and investor evidence base.
05

Integration Layer: Full Pipeline Orchestration

Wire all 7 core layers + router + OCR pipeline into a single callable interface with proper error handling, logging, and configuration. Define the production API surface. Prepare for external pilot deployment.

// 06 · The Opportunity

Why Compression at This Level Matters Now

Data infrastructure cost — storage, transmission, compute, cooling — is the limiting constraint for AI companies, cloud providers, and any organization operating at scale. The "four walls" that define every large data operation are:

Wall 1

Energy Consumption

Less data stored and transmitted means less I/O, less compute, meaningfully lower power draw across massive infrastructure.

Wall 2

Hardware Requirements

Storage and memory that doesn't need to be purchased, racked, or maintained because the data is smaller to begin with.

Wall 3

Infrastructure Costs

Bandwidth, data center space, cooling — all scale with data volume. Compression is a multiplier on all of them simultaneously.

Wall 4

Cooling Systems

One of the fastest-growing infrastructure costs in AI. Reduced compute cycles from smaller data reduces thermal output directly.

SSCA's approach — domain-aware, meaning-level compression that improves with data repetition — is particularly well-suited to the workloads that drive the highest infrastructure costs: AI training logs, legal document archives, medical record systems, and structured telemetry at scale.

Structured Semantic
Compression Algorithm