Alignment Learning Models
What it is
Alignment Learning Models is the working title for Dallas’s PhD dissertation at the University of Tulsa, advised by John Hale. The core idea: instead of inspecting an agent’s language outputs to decide whether it’s misaligned, inspect its behavioral trajectory — the sequence of actions it takes, encoded under a five-element ontology (Agents, Assets, Aims, Actions, Ambits).
The framework extends Decision Transformers to train structured-trajectory models that can detect covert objectives: prompt injection, jailbreak-induced policy drift, and fine-tuned hidden goals. Three empirical case studies ground the approach — Anthropic misuse detection, OpenClaw agent security, and the StrongDM software factory pipeline. Each case independently arrived at a hybrid deterministic-tool + LLM-judge architecture; the dissertation formalizes the security properties of that combined approach.
Contributions include a per-ambit alignment gap metric (stated vs. revealed reward functions via inverse reinforcement learning), latent ambit discovery for inferring covert objective content from behavioral data, and empirical comparison against language-level baselines (LLM-as-judge, prompt scanning).
Status
Dissertation proposal in progress; expected defense December 2026. Committee: Tyler Moore, Brett McKinney, Roger Wainwright.