# Obliteratus
An open-source toolkit for identifying and removing refusal behaviors from [[Large Language Models (LLMs)]] through abliteration. Surgically eliminates internal representations responsible for content refusal without retraining or fine-tuning. AGPL-3.0 license. Created by elder-plinius.
## How It Works
Six-stage pipeline:
1. **SUMMON** -- load model + tokenizer
2. **PROBE** -- collect activations on contrasting prompt pairs (harmful vs harmless)
3. **DISTILL** -- extract refusal directions via SVD / PCA / mean-difference
4. **EXCISE** -- project out guardrail directions from weights (norm-preserving)
5. **VERIFY** -- perplexity + coherence checks
6. **REBIRTH** -- save modified model with metadata
An analysis-informed variant auto-detects alignment method (DPO/RLHF/CAI/SFT), refusal geometry, cross-layer alignment, and self-repair risk to auto-configure parameters.
## Key Features
- **7 intervention methods** escalating from `basic` (1 direction) to `nuclear` (8 directions, all techniques)
- **Reversible steering vectors** -- inference-time activation hooks, no permanent weight changes
- **15 analysis modules** -- cross-layer alignment, refusal logit lens, concept cone geometry, causal tracing, sparse surgery, defense robustness
- **Norm-preserving projection** -- prevents activation scaling drift; projects bias terms too
- **Expert-Granular Abliteration (EGA)** -- MoE-aware decomposition using router logits
- **CoT-Aware Ablation** -- preserves reasoning-critical paths while removing refusal
- **LoRA-based reversible ablation** -- rank-1 adapters for non-destructive modification
- **116 curated models** across 5 compute tiers (CPU to 40GB+ VRAM)
## Access Methods
- HuggingFace Spaces (zero-setup web UI)
- Local Web UI (`obliteratus ui` via Gradio)
- Google Colab (one-click notebook)
- CLI (`obliteratus obliterate model --method advanced`)
- Python API (`AbliterationPipeline`)
- YAML configs for reproducible experiments
## References
- Source code: https://github.com/elder-plinius/OBLITERATUS
## Related
- [[Large Language Models (LLMs)]]