OmegaOMG — High‑precision Entity Extraction at Scale
OmegaOMG is a compact, production‑ready grammar and engine for turning unstructured text into clean, canonical entities. It combines a simple DSL with an optimized evaluator and a robust resolver pipeline to deliver accurate, deduplicated results you can trust.
Define rules once. Extract with confidence. Ship faster.
Built on top of OmegaMatch for the list matching engine.
Why OmegaOMG
- Precision by design: longest, left‑most, non‑overlapping anchors keep noise out.
- Built for scale: fast, byte‑level evaluation with smart indexing and caching.
- Clean output: canonicalization, parent/child linking, and sentence/paragraph metadata.
- Easy to author: expressive DSL (v1.0) with imports, quantifiers, dot/char classes, named captures.
- Works anywhere: Python runtime, simple CLI, and a VS Code extension for authoring.
What makes it different
- Anchored matching: Uses curated token lists (via OmegaMatch) to pre‑anchor rules and minimize false positives.
- Deterministic resolution: A staged, auditable pipeline to deduplicate parents, attach children, and enrich with boundaries.
- Practical performance: Offset‑indexed lookups, greedy quantifier chaining, and adaptive sampling reduce scanning cost.
- Clear boundaries: No unbounded regex; bounded quantifiers only. Production behavior is predictable and fast.
Ideal for
- PII and sensitive data detection (names, phones, emails, IDs)
- Compliance & eDiscovery workflows
- Threat intel and OSINT enrichment
- Resume/CV and profile parsing
- Log and support ticket mining
Core capabilities
- DSL v1.0: literals, escapes (\d \s \w, etc.), dot
.
, character classes[...]
, grouping, alternation, named captures, bounded quantifiers, and list matches[[alias]]
. - Imports with flags:
ignore-case
,ignore-punctuation
,elide-whitespace
,word-boundary
,word-prefix
,word-suffix
,line-start
,line-end
. - Resolver methods:
exact
andfuzzy(threshold=...)
, optional tokens, and case/punctuation normalization. - Parent/child modeling: Dotted rules (e.g.,
person.surname
) with automatic parent boundary enrichment. - JSON output: Clean, line‑delimited JSON or pretty JSON for easy pipelines.
- HTML visualization: interactive highlighted HTML via
highlighter.py
for fast review and demos.
How it works (high level)
- Author rules in OMG DSL referencing curated token lists.
- Evaluate over byte inputs using an optimized engine with pre‑anchored matches.
- Apply entity resolution:
- Overlap removal with deterministic tie‑breaking
- Horizontal canonicalization (dedupe parents)
- Vertical child→parent linking
- Boundary metadata enrichment (sentence, paragraph)
Detailed algorithm: see RESOLUTION.md
.
Quick start
- Try the demo rules on a sample document:
# Windows PowerShell
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
python omg.py --output matches.json .\demo\demo.omg .\demo\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt
python highlighter.py .\demo\CIA_Briefings_of_Presidential_Candidates_1952-1992.txt matches.json CIA_demo.html
Open CIA_demo.html
in your browser to inspect the results.
Example rule (DSL)
version 1.0
import "names.txt" as given_name with word-boundary, ignore-case
import "surnames.txt" as surname with word-boundary, ignore-case
resolver default uses exact with ignore-case
person = [[given_name]] ( \s{1,4} [[given_name]] ){0,2} ( \s{1,4} \w | \s{1,4} \w "." )? \s{1,4} [[surname]]
Integrations
- CLI:
omg.py
for batch extraction and JSON output - Python API: programmatic parsing and evaluation (
dsl.omg_parser
,dsl.omg_evaluator
) - VS Code extension: OMG Language Support for syntax highlighting & IntelliSense
- List matching engine: powered by OmegaMatch
- Highlighter: render interactive HTML from matches (
highlighter.py
)
Performance mindset
- Byte‑accurate matching with pre‑compiled list anchors
- Binary‑searched offset maps and caching across nodes
- Greedy, adjacency‑enforced ListMatch quantifiers
- Adaptive start‑offset sampling for complex patterns
Performance depends on rule complexity and input size—designed to be efficient for large files and corpora.
Governance & licensing
- Open source under Apache 2.0 (
LICENSE
) - Tested with a comprehensive pytest suite
- Minimal, pinned dependencies for stability
Learn more
- README: features, usage, and CLI options
- RESOLUTION.md: deep dive into the resolver pipeline
- Demo:
demo/demo.omg
and sample lists / texts
Get started today—write a rule, run the demo, and ship accurate entity extraction with OmegaOMG.