Knowledge Engine — Concepts

Hash-first identity

An object is a blob of bytes addressed by its SHA-256. That hash is the identity: lookups, downloads, de-duplication, and processing all key off it. A bucket/key is a mutable location pointing at an object — one object can have many locations, and renaming or moving a location never changes identity. Names, tags, aliases, relations, and notes are metadata attached to the hash.

The pipeline

Every ingest path funnels into one storage service, which records identity and emits an event. A reactor turns events into jobs; workers run extractors and import their output; operations answer queries. There is no hidden orchestration — each stage is observable.

CLI ingest / REST PUT / Git push │ ▼ storage service ──► ke_object (sha256), asset, asset_location (bucket/key), storage_event │ ▼ reactor ──► selects matching plugins ──► enqueues extract jobs (deduped by plugin + sha256) │ ▼ worker ──► materializes bytes ──► runs extractor (python | ida-python | node) ──► NDJSON out │ ▼ imports (plugin SQL) ──► plugin tables operations (plugin SQL) ──► search / match

Because extraction is deduped by plugin + sha256, the same bytes under many names or repositories are analyzed once.

Plugins, extractors, operations

A plugin is a static package: a manifest plus SQL plus one or more extractors. An extractor is an isolated process with a manifest-in / files-out contract — it runs as headless IDA (ida-python), or plain python/node — and emits NDJSON rows the platform imports into plugin-owned tables. An operation is plugin-declared SQL exposed as a query (e.g. match_resource). The platform owns storage, jobs, transactions, and isolation; the plugin owns its analysis and its data contract.

KE Actions: Git as a source, not a store

A KE-hosted repository can include a .ke/actions.yml that declares how pushed files become corpus objects. The repository layout stays free-form; the config declares the projection.

version: 1
rules:
  - id: idbs
    match: "**/*.i64"          # glob; a file may match several rules
    bucket: idb-analysis       # target bucket (decoupled from the repo name)
    key: "{path}"              # template: {repo} {path} {basename} {sha256} {branch} {commit}
    tags: [ida, "{branch}"]    # templated searchable labels
    process: [kep-bbsh, kep-funcnames]   # plugins to run after storing

One repo can feed many buckets, and many repos can feed one bucket — the rules decide. An invalid config is rejected at push time.

Gitea transport (optional)

For hosted repositories, KE drives a Gitea instance behind the scenes: Gitea owns repos, HTTP transport, and storage; KE owns identity and projection. KE provisions a backing Gitea account and a per-user token, so a user authenticates once and pushes over HTTPS as themselves — not a shared admin token. A push webhook (or gitea:sync) runs the same Actions projection. Transport is HTTPS only.

One Postgres

Identity, metadata, jobs, and all plugin data live in a single Postgres (ParadeDB, which adds BM25 full-text and pgvector). Blob bytes are stored on local disk by hash. Gitea, when used, runs alongside with its own embedded SQLite. There is no separate search server, vector server, or task broker to run.

Glossary

Term	Meaning
Object	Immutable bytes addressed by SHA-256.
Asset / version	The stored entity backed by an object, with version history.
Location	A mutable `bucket/key` reference to an asset.
Bucket	A queryable collection of locations. Not a Git repo.
Tag / alias / relation	Searchable label / alternate name / inter-object edge (e.g. `idb_for`).
Plugin / extractor / operation	Analysis package / its process / its query SQL.
KE Action	A rule mapping pushed paths → bucket/key/tags/processing.