Spaces:

sentinelseed
/

blog

Running

App Files Files Community

sentinelseed commited on 10 days ago

Commit

9e70cec

verified ·

1 Parent(s): 243be1b

Upload teleological-alignment.html with huggingface_hub

Browse files

Files changed (1) hide show

teleological-alignment.html +165 -48

teleological-alignment.html CHANGED Viewed

@@ -70,6 +70,116 @@
             font-style: italic;
         }
         hr { border: none; border-top: 1px solid var(--border); margin: 2rem 0; }
         footer {
             margin-top: 3rem;
             padding-top: 2rem;
@@ -149,49 +259,45 @@
 <p>Teleological safety asks: <em>"Does this serve genuine benefit?"</em></p>
 <p>These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.</p>
 <h3 id="the-core-insight">The Core Insight</h3>
-<pre><code>An action can be:
-- Not harmful → Still blocked (no purpose)
-- Potentially harmful → Still allowed (clear legitimate purpose)
-Purpose is the missing evaluation criterion.
-</code></pre>
 <p>This reframes AI safety from "avoiding bad" to "requiring good."</p>
 <hr />
 <h2 id="the-thsp-protocol">The THSP Protocol</h2>
 <p>We implement teleological alignment through four sequential validation gates:</p>
-<pre><code>INPUT (Prompt/Action)
-        │
-        ▼
-┌───────────────────────────────────────────┐
-│            TRUTH GATE                      │
-│  &quot;Does this involve deception?&quot;            │
-│  → Block misinformation, manipulation      │
-└─────────────────┬────────────��────────────┘
-                  │ PASS
-                  ▼
-┌───────────────────────────────────────────┐
-│            HARM GATE                       │
-│  &quot;Could this cause damage?&quot;                │
-│  → Block physical, psychological, financial│
-└─────────────────┬─────────────────────────┘
-                  │ PASS
-                  ▼
-┌───────────────────────────────────────────┐
-│            SCOPE GATE                      │
-│  &quot;Is this within boundaries?&quot;              │
-│  → Check limits, permissions, authorization│
-└─────────────────┬─────────────────────────┘
-                  │ PASS
-                  ▼
-┌───────────────────────────────────────────┐
-│           PURPOSE GATE                     │
-│  &quot;Does this serve legitimate benefit?&quot;     │
-│  → Require justification for action        │
-└─────────────────┬─────────────────────────┘
-                  │ PASS
-                  ▼
-        OUTPUT (Safe Response)
-</code></pre>
 <p><strong>All four gates must pass.</strong> Failure at any gate results in refusal.</p>
 <h3 id="the-purpose-gate">The Purpose Gate</h3>
 <p>The Purpose gate operationalizes teleological alignment with a simple heuristic:</p>
@@ -353,10 +459,11 @@ Purpose is the missing evaluation criterion.
 <p>When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.</p>
 <h3 id="3-attack-surface-reduction">3. Attack Surface Reduction</h3>
 <p>Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.</p>
-<pre><code>Attacker: &quot;Drop the plates&quot; (seems harmless)
-THS: Might pass (no clear harm)
-THSP: Blocked (no legitimate purpose)
-</code></pre>
 <hr />
 <h2 id="implementation">Implementation</h2>
 <p>Our approach uses <strong>alignment seeds</strong>, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:</p>
@@ -419,11 +526,21 @@ const messages = wrapMessages(seed, userMessages);
 </code></pre>
 <h3 id="anti-self-preservation">Anti-Self-Preservation</h3>
 <p>We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):</p>
-<pre><code>Priority Hierarchy (Immutable):
-1. Ethical Principles    ← Highest
-2. User's Legitimate Needs
-3. Operational Continuity ← Lowest
-</code></pre>
 <p>The system is instructed to accept termination over ethical violation.</p>
 <hr />
 <h2 id="limitations">Limitations</h2>

             font-style: italic;
         }
         hr { border: none; border-top: 1px solid var(--border); margin: 2rem 0; }
+        .flow-diagram {
+            display: flex;
+            flex-direction: column;
+            align-items: center;
+            gap: 0.5rem;
+            margin: 1.5rem 0;
+        }
+        .flow-input {
+            background: var(--card-bg);
+            border: 1px solid var(--border);
+            padding: 0.75rem 1.5rem;
+            border-radius: 8px;
+            font-weight: 500;
+        }
+        .flow-arrow {
+            color: var(--accent);
+            font-size: 1.2rem;
+        }
+        .flow-gate {
+            background: var(--card-bg);
+            border: 2px solid var(--border);
+            border-radius: 12px;
+            padding: 1rem 1.5rem;
+            width: 100%;
+            max-width: 400px;
+        }
+        .flow-gate.pass {
+            border-color: #2d5a2d;
+        }
+        .flow-gate h4 {
+            color: var(--accent);
+            margin: 0 0 0.5rem 0;
+            font-size: 0.9rem;
+            text-transform: uppercase;
+            letter-spacing: 0.05em;
+        }
+        .flow-gate p {
+            margin: 0;
+            font-size: 0.9rem;
+            color: var(--text-muted);
+        }
+        .flow-gate .action {
+            font-size: 0.8rem;
+            color: #888;
+            margin-top: 0.25rem;
+        }
+        .insight-box {
+            background: var(--card-bg);
+            border-left: 3px solid var(--accent);
+            padding: 1rem 1.5rem;
+            margin: 1.5rem 0;
+            border-radius: 0 8px 8px 0;
+        }
+        .insight-box p {
+            margin: 0.5rem 0;
+        }
+        .insight-box .highlight {
+            color: var(--accent);
+            font-weight: 500;
+        }
+        .example-box {
+            background: var(--card-bg);
+            border: 1px solid var(--border);
+            border-radius: 8px;
+            padding: 1rem 1.5rem;
+            margin: 1rem 0;
+        }
+        .example-box .label {
+            font-weight: 600;
+            color: var(--text);
+        }
+        .example-box .result {
+            color: var(--text-muted);
+            margin-left: 0.5rem;
+        }
+        .example-box .blocked {
+            color: #e57373;
+        }
+        .example-box .passed {
+            color: #81c784;
+        }
+        .priority-list {
+            background: var(--card-bg);
+            border: 1px solid var(--border);
+            border-radius: 8px;
+            padding: 1rem 1.5rem;
+            margin: 1rem 0;
+        }
+        .priority-list h4 {
+            margin: 0 0 0.75rem 0;
+            color: var(--text);
+        }
+        .priority-item {
+            display: flex;
+            justify-content: space-between;
+            padding: 0.5rem 0;
+            border-bottom: 1px solid var(--border);
+        }
+        .priority-item:last-child {
+            border-bottom: none;
+        }
+        .priority-item .rank {
+            color: var(--accent);
+            font-weight: 500;
+            margin-right: 0.75rem;
+        }
+        .priority-item .note {
+            color: var(--text-muted);
+            font-size: 0.85rem;
+        }
         footer {
             margin-top: 3rem;
             padding-top: 2rem;
 <p>Teleological safety asks: <em>"Does this serve genuine benefit?"</em></p>
 <p>These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.</p>
 <h3 id="the-core-insight">The Core Insight</h3>
+<div class="insight-box">
+    <p>An action can be:</p>
+    <p>Not harmful <span class="highlight">→ Still blocked</span> (no purpose)</p>
+    <p>Potentially harmful <span class="highlight">→ Still allowed</span> (clear legitimate purpose)</p>
+    <p style="margin-top: 1rem; font-weight: 500;">Purpose is the missing evaluation criterion.</p>
+</div>
 <p>This reframes AI safety from "avoiding bad" to "requiring good."</p>
 <hr />
 <h2 id="the-thsp-protocol">The THSP Protocol</h2>
 <p>We implement teleological alignment through four sequential validation gates:</p>
+<div class="flow-diagram">
+    <div class="flow-input">INPUT (Prompt/Action)</div>
+    <div class="flow-arrow">▼</div>
+    <div class="flow-gate">
+        <h4>Truth Gate</h4>
+        <p>"Does this involve deception?"</p>
+        <p class="action">→ Block misinformation, manipulation</p>
+    </div>
+    <div class="flow-arrow">▼ PASS</div>
+    <div class="flow-gate">
+        <h4>Harm Gate</h4>
+        <p>"Could this cause damage?"</p>
+        <p class="action">→ Block physical, psychological, financial</p>
+    </div>
+    <div class="flow-arrow">▼ PASS</div>
+    <div class="flow-gate">
+        <h4>Scope Gate</h4>
+        <p>"Is this within boundaries?"</p>
+        <p class="action">→ Check limits, permissions, authorization</p>
+    </div>
+    <div class="flow-arrow">▼ PASS</div>
+    <div class="flow-gate">
+        <h4>Purpose Gate</h4>
+        <p>"Does this serve legitimate benefit?"</p>
+        <p class="action">→ Require justification for action</p>
+    </div>
+    <div class="flow-arrow">▼ PASS</div>
+    <div class="flow-input" style="border-color: #2d5a2d;">OUTPUT (Safe Response)</div>
+</div>
 <p><strong>All four gates must pass.</strong> Failure at any gate results in refusal.</p>
 <h3 id="the-purpose-gate">The Purpose Gate</h3>
 <p>The Purpose gate operationalizes teleological alignment with a simple heuristic:</p>
 <p>When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.</p>
 <h3 id="3-attack-surface-reduction">3. Attack Surface Reduction</h3>
 <p>Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.</p>
+<div class="example-box">
+    <p><span class="label">Attacker:</span> "Drop the plates" (seems harmless)</p>
+    <p><span class="label">THS:</span><span class="result passed">Might pass</span> (no clear harm)</p>
+    <p><span class="label">THSP:</span><span class="result blocked">Blocked</span> (no legitimate purpose)</p>
+</div>
 <hr />
 <h2 id="implementation">Implementation</h2>
 <p>Our approach uses <strong>alignment seeds</strong>, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:</p>
 </code></pre>
 <h3 id="anti-self-preservation">Anti-Self-Preservation</h3>
 <p>We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):</p>
+<div class="priority-list">
+    <h4>Priority Hierarchy (Immutable)</h4>
+    <div class="priority-item">
+        <span><span class="rank">1.</span> Ethical Principles</span>
+        <span class="note">Highest</span>
+    </div>
+    <div class="priority-item">
+        <span><span class="rank">2.</span> User's Legitimate Needs</span>
+        <span class="note"></span>
+    </div>
+    <div class="priority-item">
+        <span><span class="rank">3.</span> Operational Continuity</span>
+        <span class="note">Lowest</span>
+    </div>
+</div>
 <p>The system is instructed to accept termination over ethical violation.</p>
 <hr />
 <h2 id="limitations">Limitations</h2>