OmegaMatch — High-Throughput Multi‑Pattern Text Matching Engine

OmegaMatch is a fast, embeddable, multi‑pattern exact matcher for products and platforms that need to scan large text streams against thousands of patterns in near real time. It couples a memory‑mapped compiled pattern store with cache‑efficient data structures, multi‑threaded execution (OpenMP), and optional normalization (case‑insensitive, punctuation‑ignoring, whitespace‑eliding) — all in a lean Apache 2.0 licensed package.

Build search, compliance, moderation, enrichment, deduplication, or security pipelines that must never become the bottleneck.


At a Glance

CapabilityWhat it means for you
Memory‑mapped compiled pattern storeCompile once; deploy everywhere; near-zero warmup.
Multi-threaded (OpenMP) coreScales with cores for batch or streaming workloads.
Profile-Guided Optimization (PGO) buildsOptional variants (Linux: GCC & Clang, Windows: MSVC) yielding typical +5–20% throughput on representative workloads.
Two-tier pipeline (Bloom + hash table)Filters >90% of non-matches early; fewer cache misses; higher throughput.
Specialized short‑pattern acceleratorUltra-fast handling of 1–4 byte patterns (bitmap + binary search).
Rich post-filtersWord/line anchors, longest-only, no-overlap, prefix/suffix constraints.
Transform layerCase-insensitive, ignore punctuation, elide whitespace — without per-call reallocation.
Deterministic & exactNo false positives (Bloom pre-filters only; exact hash validation).
Cross-platformWindows, Linux, macOS (C library + Python bindings + CLI tools).
Low operational footprintShare the same compiled pattern file across processes; minimal per-instance memory.
Apache 2.0 FOSSUse in commercial, cloud, on-prem, embedded — no licensing friction.

Why Teams Adopt OmegaMatch

  1. Throughput Headroom — Keeps pace with high-volume ingestion (logs, chat, email, documents) without sharding complexity.
  2. Operational Simplicity — Pattern lists become artifacts you can version, sign, ship, and hot‑reload.
  3. Predictable Latency — Pure CPU, no GC pauses, no dynamic regex backtracking explosions.
  4. Feature Pragmatism — Focused on exact multi-pattern matching with powerful structural filters, not an overgeneral regex VM.
  5. Transparent Performance — Benchmarked openly (includes its own performance harness). PGO builds available for maximum optimization.
  6. Extensible Surface — Clean C API plus Python module and CLI tools for rapid integration and prototyping.

Core Differentiators

  • Compile → Map → Match workflow decouples pattern management from runtime workloads.
  • Cache-conscious layout — Patterns clustered; short patterns fast‑pathed; streaming k‑way merge enables linear post‑filter passes.
  • Selective transforms — Applied only when requested; no permanent preprocessing cost.
  • Pluggable concurrency — Tune thread count and chunk size to balance throughput vs. tail latency.

Supporting Higher-Level DSLs (e.g., OmegaOMG)

Higher‑level rule / grammar DSLs (such as OmegaOMG) can layer on top of OmegaMatch. OmegaMatch contributes:

  • High‑throughput literal multi‑pattern baseline feeding grammar evaluation.
  • Pre‑anchored longest / non‑overlapping match streams that reduce downstream search.
  • Memory‑mapped compiled store for fast cold starts and concurrent evaluators with minimal overhead.
  • Structural filters (word / line boundaries, prefix / suffix) to prune impossible spans early.
  • Optional normalization (case‑insensitive, ignore punctuation, elide whitespace) applied consistently before higher‑level assembly.
  • Deterministic, stable match offsets for reliable resolver / AST mapping.

This lets the DSL runtime focus on rule semantics while delegating raw scanning to OmegaMatch.


Performance Snapshot (Representative Excerpts)

Measured via included perf_test.py on a sample dataset (≈2K surname patterns over ≈4MB text). These are illustrative, not guarantees; always benchmark with your own corpus.

ScenarioRelease (MB/s)Grep (MB/s)Ratio (Ω / Grep)Notes
Baseline (multi-pattern)7992.79N/AHigh raw throughput.
line-end + ignore-case6336.8712.49507.36×Anchor + CI optimization shines.
line-start + line-end7988.92102.8577.68×Dual anchors reduce candidate set.
ignore-case + word-boundary8924.43159.5255.95×Structural boundary filtering.
longest + no-overlap7578.938696.320.87×Transparent case where grep is comparable/faster.

Interpretation: OmegaMatch excels on structurally constrained multi‑pattern workloads (anchors / boundaries) and remains competitive even when post‑filters reduce raw scan efficiency. Showing a slower case builds trust and clarifies fit boundaries.

Run the suite yourself: python perf_test.py --show-status (auto-falls back if system grep unavailable).


Ideal Use Cases

DomainExamples
Security & Threat IntelIOC / keyword sweeps across logs, email gateways, DLP scanners.
Trust & Safety / ModerationDisallowed phrases, brand abuse, impersonation handles.
Compliance & GovernancePolicy term detection (PII tokens, restricted lexicons).
Data Enrichment / ETLTagging entities (names, TLDs, usernames) before downstream NLP.
Deduplication / FilteringFast elimination of already-known markers or signatures.
Content AnalyticsPre-filter candidate spans before heavier semantic / ML stages.

If your workload involves (a) large text streams, (b) thousands of literal patterns, and (c) strict latency / throughput targets — OmegaMatch is purpose‑built.


How It Works (Internal Pipelines)

1. Compilation (one-time or on pattern change)

patterns.txt
  │  (read & parse; normalize options flags)
  ▼
Pattern Store Builder ──> store literals contiguously
  │
  ├── Build Short Matcher (len1..len4 bitmaps/arrays)
  ├── Build Bloom Filter (n hashes, sized by pattern set)
  ├── Build Hash Table (index array + bucket data for ≥5)
  ▼
Serialize Header + [Pattern Store][Bloom][Hash][Short Matcher]
  ▼
Binary File (memory‑mappable)

2. Runtime (long patterns ≥5)

Memory‑map compiled file (O(1) startup)
  │
  │ (optional transform table active?)
  ├── Yes: chunked normalize window (case / punct / ws) + offset map
  │       ▼
  │   Normalized chunk
  │
  │ (OpenMP / threaded scan over haystack positions)
  ▼
Boundary Fast Check (word/line anchors pre-skip)
  │ (if remaining ≥4 bytes)
  ▼
Pack 4‑gram → Bloom Filter
  │  (likely reject → skip)
  │  (pass)
  ▼
Hash Probe → Bucket Scan (exact compare, append matches)
  ▼
Per‑thread Match Vectors
  ▼ (merge)
K‑way Merge (offset asc, length desc)
  ▼
Post Filters (longest-only → no-overlap; residual boundary / prefix / suffix checks)
  ▼
Result Set (stable offsets remapped if normalized)

3. Runtime (short patterns 1–4)

Runs in the same positional scan; short path is independent of Bloom/Hash.

Position Scan
  │
  ├── Optional boundary pre-check (skip if not at potential boundary when required)
  │
  ├── len4 array binary search (if any len4 patterns)
  ├── len3 array binary search (if any len3 patterns)
  ├── len2 array bitmap / array lookup
  └── len1 bitmap lookup
      (each successful probe emits candidate; boundary / line / prefix / suffix validated)
  ▼
Short Matches Appended
  (coalesced with long pattern matches before k‑way merge)

4. Optional Transform Path

When normalization flags are set, each chunk is transformed (case fold, punctuation stripped, whitespace elided) with a position map so match offsets are remapped back to the original haystack after matching.

5. Output Guarantees

  • Exact literal matches (no false positives beyond requested filters).
  • Offsets stable and deterministic (post normalization remap if used).
  • Ordering invariant: primary sort offset ascending, secondary length descending.
    • Enables O(n) longest-only (first at offset is longest) and O(n) no-overlap (single forward pass) post-filters.

Quick Start (C)

#include <omega/list_matcher.h>
omega_list_matcher_t *m = omega_list_matcher_create("patterns.txt", /*case_insensitive=*/0, /*ignore_punctuation=*/0, /*elide_ws=*/0, NULL);
// ... load data into buffer ...
omega_match_results_t *r = omega_list_matcher_match(m, data, len, /*no_overlap=*/1, /*longest_only=*/1, /*word_boundary=*/1, /*word_prefix=*/0, /*word_suffix=*/0, /*line_start=*/0, /*line_end=*/0);
// iterate r->matches
omega_match_results_destroy(r);
omega_list_matcher_destroy(m);

Quick Start (Python)

from omega_match.omega_match import Compiler, Matcher

# Compile once (creates matcher.olm). You can also stream patterns with Compiler().
stats = Compiler.compile_from_filename("matcher.olm", "patterns.txt", case_insensitive=True)
print("Patterns compiled:", stats.stored_pattern_count + stats.short_pattern_count)

# Auto-selects best native variant (prefers PGO when bundled)
with Matcher("matcher.olm") as matcher:
  hay = open("haystack.txt", "rb").read()
  results = matcher.match(hay, word_boundary=True, longest_only=True, no_overlap=True)
  for r in results:
    print(r.offset, r.length)

Quick Start (Python CLI)

# Compile patterns
python bindings/python/olm.py compile patterns.olm patterns.txt --ignore-case

# Match with output to file
python bindings/python/olm.py match patterns.olm haystack.txt --output results.txt --word-boundary --longest

Adoption Path

  1. Identify your static (or periodically updated) pattern list(s).
  2. Compile once (implicitly on first create) and store alongside app artifacts.
  3. Memory-map in each worker/process (startup remains lightweight).
  4. Tune thread count (OMP_NUM_THREADS or API) and chunk size for workload.
  5. Layer domain logic (e.g., entity tagging, suppression) on returned offsets.

Architectural Advantages

  • Zero‑copy pattern sharing — Multiple workers reuse the same mapped file.
  • Branch locality — Short path for most non‑matching positions (early rejection via boundary + Bloom).
  • Deterministic post‑filters — Order‑independent final semantics (idempotent longest / no‑overlap passes).
  • Incremental scalability — Add CPU cores; no specialized hardware or external indexing cluster.

Comparison (Conceptual)

ApproachTrade-offs
Naïve N×M substring scansO(N·M) work; cache-thrashing; no sharing.
Regex engine per pattern setHigher compile cost + backtracking risk; overkill for literals.
Inverted index / search engineHeavy ingestion / memory; tuned for term search, not anchored literal spans.
GPU / SIMD bespoke solutionHigher complexity & maintenance; narrower developer familiarity.
OmegaMatchPurpose-built, balanced complexity vs. performance; portable; exact semantics.

Transparency & Limits

OmegaMatch is optimized for exact literal multi‑pattern matching with structural filters. It is not a general regex engine, fuzzy matcher, or semantic classifier. For fuzzy or NLP tasks, use it as a front‑line accelerator to reduce candidate volume before heavier processing.


Licensing & Governance

  • License: Apache License 2.0 (permissive, patent grant included)
  • Commercial use: Allowed — bundle, modify, redistribute with attribution.
  • Contribution model: Standard fork + PR; CLA not required (see repository guidelines).

Roadmap Ideas (Community-Driven)

  • Incremental / delta pattern compilation.
  • Streaming window API (bounded latency on unbounded sources).
  • Pluggable custom normalization tables.
  • Optional SIMD path exploration for additional micro-gains.
  • Language bindings (Rust, Go) if demand surfaces.

Want one of these sooner? Open an issue or contribute a design proposal.


Integrate Today

GoalAction
Evaluate performanceRun perf_test.py on your corpus.
Prototype quicklyUse Python binding in a notebook or script.
Production embedLink C library, ship compiled pattern store.
Share feedbackFile issues (features, perf anomalies, datasets).
ContributeImprove docs, port bindings, add benchmarks.

Call to Action: Clone. Benchmark. Replace brittle multi‑regex loops. Ship faster.

git clone https://github.com/scholarsmate/omega-match
cd omega-match
python perf_test.py --show-status

Contact & Community

  • GitHub Issues: feature requests & bugs
  • Discussions: architecture, optimization, design questions
  • PRs: Welcome — focus on measurable performance or clarity improvements

If OmegaMatch saves you infrastructure or developer time, consider starring the repo and sharing a benchmark comparison — it helps the ecosystem grow.


Build the fastest multi‑pattern literal filter layer in your stack — with OmegaMatch.