The Origin Story
At 5 AM, while brainstorming ways to minimize the wire footprint of a custom ICMP-based C2 agent, an idea hit me: “What if I just store the first byte, calculate the difference to the next one, and count how many times that same difference repeats?”
The logic was dead simple. No dictionaries, no Huffman trees, no frequency tables — just subtraction and counting. Two primitive operations that could fit in a handful of x64 instructions.
I felt like a genius… until I realized I had just accidentally reinvented Differential Pulse-Code Modulation (DPCM) combined with Run-Length Encoding (RLE) — techniques formalized by telecom engineers in the 1950s-1970s.
But here’s the thing: writing it completely from scratch in pure x86-64 Assembly, with zero libc, zero dependencies, and under 200 instructions — that part was entirely new. And it works beautifully for its intended purpose.
The Problem: Why Compress C2 Traffic?
In a covert channel scenario, every byte matters. Consider a typical ls -la /root output: ~450 bytes. Over ICMP with 56-byte chunk fragmentation, that’s 9 packets on the wire. Each packet is a detection opportunity — a chance for an IDS to correlate, flag, and alert.
Now compress that 450 bytes down to ~250 bytes. You just dropped to 5 packets. Fewer packets means less traffic, smaller timing windows, and a reduced statistical footprint. The math is simple: less wire time = less detection surface.
But the compressor itself must be tiny. In a PIC (Position Independent Code) shellcode context, you can’t afford to drag in zlib or LZ4. You need something that compiles to a few dozen instructions and requires zero heap allocation.
The Algorithm: DPCM + RLE Hybrid
The VESQER algorithm operates in two conceptual layers that execute in a single pass:
Layer 1 — Delta Encoding (DPCM): Instead of storing raw byte values, we store the difference (delta) between consecutive bytes. If the input is A B C D (ASCII 65, 66, 67, 68), the deltas are all +1. This transforms sequential patterns into repetitive values.
Layer 2 — Run-Length Encoding (RLE): Once the deltas are calculated, identical consecutive deltas are collapsed into a single (count, delta) pair. Those four bytes A B C D become: Anchor: A + (3, +1) — just 3 bytes instead of 4.
The real power appears with longer sequences. ABCDEFGHIJ (10 bytes) becomes A + (9, +1) — only 3 bytes. That’s a 70% reduction from two simple operations.
The Compression Format
The output binary format is minimal and signature-free:
[Anchor Byte] [Count₁] [Delta₁] [Count₂] [Delta₂] ... [CountN] [DeltaN]
- Anchor (1 byte): The literal first byte of the input. This is the reference point for the entire stream.
- Count (1 byte, unsigned): How many times to apply the associated delta. Range: 0–255.
- Delta (1 byte, signed via two’s complement): The difference to add for each step. Range: -128 to +127.
No magic bytes. No headers. No metadata. Just raw compressed data — exactly what you want when avoiding static signatures.
Walk-Through: How It Actually Works
Let’s trace the algorithm with a concrete example. Given the input string:
"AABCDFFF333A" (12 bytes)
Step 1 — Anchor Setup:
Read the first byte A (0x41) and write it directly to the output. This is our anchor — the starting reference point. The compressor now peeks at the second byte A and calculates the initial delta: 0x41 - 0x41 = 0. Active Delta is set to 0.
Step 2 — Main Compression Loop:
The compressor walks through the remaining bytes. For each byte, it calculates current - previous and compares it to the Active Delta:
| Position | Char | Value | Previous | Delta | Active Delta | Match? | Action |
|---|---|---|---|---|---|---|---|
| 1 | A | 0x41 | A (0x41) | 0 | 0 | ✅ | Count++ → 1 |
| 2 | B | 0x42 | A (0x41) | +1 | 0 | ❌ | FLUSH (1, 0) → New delta: +1 |
| 3 | C | 0x43 | B (0x42) | +1 | +1 | ✅ | Count++ → 2 |
| 4 | D | 0x44 | C (0x43) | +1 | +1 | ✅ | Count++ → 3 |
| 5 | F | 0x46 | D (0x44) | +2 | +1 | ❌ | FLUSH (3, +1) → New delta: +2 |
| 6 | F | 0x46 | F (0x46) | 0 | +2 | ❌ | FLUSH (1, +2) → New delta: 0 |
| 7 | F | 0x46 | F (0x46) | 0 | 0 | ✅ | Count++ → 2 |
| 8 | 3 | 0x33 | F (0x46) | -19 | 0 | ❌ | FLUSH (2, 0) → New delta: -19 |
| 9 | 3 | 0x33 | 3 (0x33) | 0 | -19 | ❌ | FLUSH (1, -19) → New delta: 0 |
| 10 | 3 | 0x33 | 3 (0x33) | 0 | 0 | ✅ | Count++ → 2 |
| 11 | A | 0x41 | 3 (0x33) | +14 | 0 | ❌ | FLUSH (2, 0) → New delta: +14 |
| — | EOF | — | — | — | — | — | FINAL FLUSH (1, +14) |
Compressed Output:
[0x41] [01,00] [03,01] [01,02] [02,00] [01,ED] [02,00] [01,0E]
A 1×(0) 3×(+1) 1×(+2) 2×(0) 1×(-19) 2×(0) 1×(+14)
Result: 15 bytes from 12. In this particular short example, the overhead of frequent delta changes actually expands the data. This is expected — the algorithm’s strength is with longer, more predictable sequences.
Where It Dominates
Consider a typical log line repeated across thousands of entries:
"192.168.1.96 - - [10/Apr/2026:05:38:27 +0000] "GET /index.html..."
The spaces, repeated digits, sequential timestamps, and identical structural characters create long delta runs. A 1KB log block can easily compress to 400-500 bytes — a 50-60% reduction with zero dictionary overhead.
The x64 Assembly Implementation
Register Architecture
The compressor is designed for maximum register economy — critical for shellcode deployments where every byte of machine code counts:
RSI → Source pointer (raw input data)
RDI → Destination pointer (compressed output)
RCX → Remaining bytes counter (loop control)
R8B → RLE Count (0–255, single-byte register)
R9B → Active Delta (current difference being tracked)
AL → Working register (current byte / delta calculation)
BL → Previous byte value (the "anchor" for delta math)
DL → Temporary save register (preserves current byte)
No heap. No stack frames. No memory allocation. Everything lives in registers during the hot loop.
The Compressor Core
The heart of the algorithm — the main loop — is remarkably compact:
_compress_loop:
test rcx, rcx
jz _end
mov al, byte[rsi] ; Read current byte
mov dl, al ; Save it (we'll need the original)
sub al, bl ; Calculate delta: current - previous
cmp al, r9b ; Does it match our Active Delta?
jne _flush ; No → flush the current run
_inc_r8b:
inc r8b ; Yes → extend the run
inc rsi ; Advance source pointer
dec rcx ; Decrement remaining counter
jz _end ; If zero → final flush and exit
mov bl, dl ; Update "previous" byte
cmp r8b, 255 ; Overflow guard (max 1-byte count)
jne _compress_loop ; Continue scanning
The hot path — when deltas match — executes in just 8 instructions per byte. No branches taken, no memory writes, pure register operations. Branch prediction loves this because jne _flush is “not taken” for the majority of iterations in sequential data.
The Flush Mechanism
When a delta mismatch occurs, the current run is written to the output and a new sequence begins:
_flush:
mov byte [rdi], r8b ; Write RLE count to output
inc rdi
mov byte [rdi], r9b ; Write Active Delta to output
inc rdi
; Start tracking the new sequence
mov r9b, al ; New Active Delta = the mismatched delta
mov r8b, 1 ; Reset count to 1 (this byte starts a new run)
mov bl, dl ; Update previous byte
inc rsi
dec rcx
jz _end
jmp _compress_loop
The 255 Overflow Guard
A single byte can only represent counts up to 255. When a run exceeds this limit, a “mini-flush” writes the current state and resets the counter — without changing the Active Delta:
_flush_255:
mov byte [rdi], r8b ; Write 255 to output
inc rdi
mov byte [rdi], r9b ; Write current delta
inc rdi
mov r8b, 0 ; Reset counter, KEEP the same delta
jmp _compress_loop ; Continue — same series, new count
This ensures seamless handling of runs longer than 255 bytes — they simply produce multiple (255, delta) pairs followed by the remainder.
The Decompressor
Decompression is the inverse operation and is intentionally simpler — this is critical because in a C2 context, the decompressor runs on the client side where code complexity is less constrained, but in the agent-side PIC context, simplicity means fewer instructions and smaller shellcode.
_decompress_loop:
test rcx, rcx
jz _print_and_exit
mov dl, byte [rsi] ; Read Count
inc rsi
dec rcx
test rcx, rcx
jz _print_and_exit
mov al, byte [rsi] ; Read Delta
inc rsi
dec rcx
test dl, dl ; Count == 0? Skip (edge case)
jz _decompress_loop
_write_loop:
add bl, al ; Apply delta: previous + delta = current
mov byte [rdi], bl ; Write reconstructed byte
inc rdi
dec dl
jnz _write_loop ; Repeat for the full count
jmp _decompress_loop
The key detail is in _write_loop: the delta is applied cumulatively. Each iteration adds the delta to the running byte value. This is what reconstructs sequences like A → B → C → D from a single (3, +1) pair — not by writing B three times, but by computing A+1=B, B+1=C, C+1=D.
Performance: Compression Scenarios
Here’s what to expect with real-world data on a 1GB file scale:
Scenario 1: Server Log Files (access.log)
Repeated IP addresses, timestamps with sequential seconds, identical HTTP status codes, recurring URL paths.
Expected Compression: ~40-55% reduction → ~450-600 MB output
Delta encoding crushes timestamp sequences (incrementing seconds produce delta=1 runs). Repeated structural elements (spaces, brackets, quotes) create long zero-delta runs.
Scenario 2: Random / Encrypted Binary Data
Uniform byte distribution, no patterns, no sequential relationships.
Expected Compression: 0% — data EXPANDS by ~50-80% → ~1.5-1.8 GB output
Every delta change triggers a flush, producing 2 bytes (count + delta) for almost every input byte, plus the anchor overhead. This is mathematically unavoidable — Shannon’s entropy theorem guarantees no lossless algorithm can compress truly random data.
Scenario 3: Database Exports (CSV / SQL Dumps)
Sequential IDs (1, 2, 3…), repeated column names, consistent delimiters, similar numeric values.
Expected Compression: ~30-45% reduction → ~550-700 MB output
Auto-incrementing IDs produce perfect delta=1 runs. Repeated commas, quotes, and fixed-width fields compress well. Variable-length string fields (names, addresses) reduce overall effectiveness.
Versus Traditional Algorithms
| Algorithm | Log File | Random | DB Dump | Dictionary? | Decode Complexity |
|---|---|---|---|---|---|
| VESQER (DPCM+RLE) | ~50% | Expands | ~40% | ❌ No | Trivial (~15 instructions) |
| gzip (LZ77+Huffman) | ~80% | ~0% | ~70% | ✅ Yes | Heavy (sliding window + tree) |
| LZ4 | ~65% | ~0% | ~55% | ✅ Yes | Medium (hash table lookups) |
| zstd | ~85% | ~0% | ~75% | ✅ Yes | Heavy (FSE + match finder) |
VESQER doesn’t compete with dictionary-based algorithms on raw compression ratio. It was never designed to. Its advantages are elsewhere:
- Zero memory overhead: No hash tables, no sliding windows, no frequency tables.
- Tiny code footprint: The entire compressor fits in ~100 instructions. Critical for PIC shellcode.
- Trivial decompression: ~15 instructions to decompress. The client-side decoder is negligible.
- No signatures: No magic bytes, no headers, no identifiable structure. The output is indistinguishable from random noise.
C2 Integration: The Real Purpose
In the Ghost-C2 project, VESQER sits in the data pipeline between command execution and network transmission:
Command Output (RAM) → VESQER Compress → Rolling XOR → ICMP Chunking → Wire
The compression step achieves two things simultaneously:
1. Reduced Wire Footprint: A ls -la output that would normally require 9 ICMP packets now fits in 5. Fewer packets = fewer detection opportunities = lower IDS correlation probability.
2. Double-Layer Obfuscation: Compressed data already has high entropy — it looks like random noise. When Rolling XOR is applied on top, the result is indistinguishable from encrypted random data. A DPI engine analyzing the payload sees no ASCII patterns, no structural markers, nothing to fingerprint.
The PIC-integrated version of the compressor is called via a simple function convention:
; RSI = Source buffer (raw command output)
; RDI = Destination buffer (compressed output)
; RCX = Input length
call _vesqer_compress
; RAX = Compressed length (new size for chunking)
All registers are preserved via push/pop, making it a clean drop-in function within the agent’s execution pipeline.
The “Accidental Reinvention” Reflection
The irony isn’t lost on me. DPCM was formalized in the 1950s for voice digitization. RLE dates back to the 1960s. Combining predictive coding with run-length encoding has been done in countless forms — JPEG uses a variant, video codecs use it, even fax machines used it.
But none of those implementations were written in pure x86-64 Assembly, with zero dependencies, optimized for shellcode deployment inside a fileless C2 agent running inside a hijacked cron process.
Sometimes reinventing the wheel is the whole point — when you need that wheel to be invisible.
📖 Source Code & Implementation
The complete standalone compressor and decompressor are available as open-source tools:
VESQER Compressor Repository: dpcm-rle-hybrid-x64-compressor on GitHub
Ghost-C2 (Integrated Version): Ghost-C2 on GitHub
⚠️ Legal Disclaimer
This project is created for educational purposes and security research only. The compression algorithm itself is a general-purpose tool, but its integration with C2 frameworks is discussed purely in the context of authorized penetration testing and red team operations. Unauthorized access to computer systems is illegal. The author is not responsible for any misuse of this tool.