# Obliteratus An open-source toolkit for identifying and removing refusal behaviors from [[Large Language Models (LLMs)]] through abliteration. Surgically eliminates internal representations responsible for content refusal without retraining or fine-tuning. AGPL-3.0 license. Created by elder-plinius. ## How It Works Six-stage pipeline: 1. **SUMMON** -- load model + tokenizer 2. **PROBE** -- collect activations on contrasting prompt pairs (harmful vs harmless) 3. **DISTILL** -- extract refusal directions via SVD / PCA / mean-difference 4. **EXCISE** -- project out guardrail directions from weights (norm-preserving) 5. **VERIFY** -- perplexity + coherence checks 6. **REBIRTH** -- save modified model with metadata An analysis-informed variant auto-detects alignment method (DPO/RLHF/CAI/SFT), refusal geometry, cross-layer alignment, and self-repair risk to auto-configure parameters. ## Key Features - **7 intervention methods** escalating from `basic` (1 direction) to `nuclear` (8 directions, all techniques) - **Reversible steering vectors** -- inference-time activation hooks, no permanent weight changes - **15 analysis modules** -- cross-layer alignment, refusal logit lens, concept cone geometry, causal tracing, sparse surgery, defense robustness - **Norm-preserving projection** -- prevents activation scaling drift; projects bias terms too - **Expert-Granular Abliteration (EGA)** -- MoE-aware decomposition using router logits - **CoT-Aware Ablation** -- preserves reasoning-critical paths while removing refusal - **LoRA-based reversible ablation** -- rank-1 adapters for non-destructive modification - **116 curated models** across 5 compute tiers (CPU to 40GB+ VRAM) ## Access Methods - HuggingFace Spaces (zero-setup web UI) - Local Web UI (`obliteratus ui` via Gradio) - Google Colab (one-click notebook) - CLI (`obliteratus obliterate model --method advanced`) - Python API (`AbliterationPipeline`) - YAML configs for reproducible experiments ## References - Source code: https://github.com/elder-plinius/OBLITERATUS ## Related - [[Large Language Models (LLMs)]]