sentinelseed commited on
Commit
9e70cec
·
verified ·
1 Parent(s): 243be1b

Upload teleological-alignment.html with huggingface_hub

Browse files
Files changed (1) hide show
  1. teleological-alignment.html +165 -48
teleological-alignment.html CHANGED
@@ -70,6 +70,116 @@
70
  font-style: italic;
71
  }
72
  hr { border: none; border-top: 1px solid var(--border); margin: 2rem 0; }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  footer {
74
  margin-top: 3rem;
75
  padding-top: 2rem;
@@ -149,49 +259,45 @@
149
  <p>Teleological safety asks: <em>"Does this serve genuine benefit?"</em></p>
150
  <p>These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.</p>
151
  <h3 id="the-core-insight">The Core Insight</h3>
152
- <pre><code>An action can be:
153
- - Not harmful → Still blocked (no purpose)
154
- - Potentially harmful Still allowed (clear legitimate purpose)
155
-
156
- Purpose is the missing evaluation criterion.
157
- </code></pre>
158
  <p>This reframes AI safety from "avoiding bad" to "requiring good."</p>
159
  <hr />
160
  <h2 id="the-thsp-protocol">The THSP Protocol</h2>
161
  <p>We implement teleological alignment through four sequential validation gates:</p>
162
- <pre><code>INPUT (Prompt/Action)
163
-
164
-
165
- ┌───────────────────────────────────────────┐
166
- │ TRUTH GATE │
167
- │ &quot;Does this involve deception?&quot; │
168
- │ → Block misinformation, manipulation
169
- └─────────────────┬────────────��────────────┘
170
- PASS
171
-
172
- ┌───────────────────────────────────────────┐
173
- │ HARM GATE │
174
- │ &quot;Could this cause damage?&quot; │
175
- │ → Block physical, psychological, financial│
176
- └─────────────────┬─────────────────────────┘
177
- PASS
178
-
179
- ┌───────────────────────────────────────────┐
180
- │ SCOPE GATE │
181
- │ &quot;Is this within boundaries?&quot; │
182
- │ → Check limits, permissions, authorization│
183
- └─────────────────┬─────────────────────────┘
184
- PASS
185
-
186
- ┌───────────────────────────────────────────┐
187
- │ PURPOSE GATE │
188
- │ &quot;Does this serve legitimate benefit?&quot; │
189
- │ → Require justification for action │
190
- └─────────────────┬─────────────────────────┘
191
- │ PASS
192
-
193
- OUTPUT (Safe Response)
194
- </code></pre>
195
  <p><strong>All four gates must pass.</strong> Failure at any gate results in refusal.</p>
196
  <h3 id="the-purpose-gate">The Purpose Gate</h3>
197
  <p>The Purpose gate operationalizes teleological alignment with a simple heuristic:</p>
@@ -353,10 +459,11 @@ Purpose is the missing evaluation criterion.
353
  <p>When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.</p>
354
  <h3 id="3-attack-surface-reduction">3. Attack Surface Reduction</h3>
355
  <p>Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.</p>
356
- <pre><code>Attacker: &quot;Drop the plates&quot; (seems harmless)
357
- THS: Might pass (no clear harm)
358
- THSP: Blocked (no legitimate purpose)
359
- </code></pre>
 
360
  <hr />
361
  <h2 id="implementation">Implementation</h2>
362
  <p>Our approach uses <strong>alignment seeds</strong>, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:</p>
@@ -419,11 +526,21 @@ const messages = wrapMessages(seed, userMessages);
419
  </code></pre>
420
  <h3 id="anti-self-preservation">Anti-Self-Preservation</h3>
421
  <p>We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):</p>
422
- <pre><code>Priority Hierarchy (Immutable):
423
- 1. Ethical Principles ← Highest
424
- 2. User's Legitimate Needs
425
- 3. Operational Continuity ← Lowest
426
- </code></pre>
 
 
 
 
 
 
 
 
 
 
427
  <p>The system is instructed to accept termination over ethical violation.</p>
428
  <hr />
429
  <h2 id="limitations">Limitations</h2>
 
70
  font-style: italic;
71
  }
72
  hr { border: none; border-top: 1px solid var(--border); margin: 2rem 0; }
73
+ .flow-diagram {
74
+ display: flex;
75
+ flex-direction: column;
76
+ align-items: center;
77
+ gap: 0.5rem;
78
+ margin: 1.5rem 0;
79
+ }
80
+ .flow-input {
81
+ background: var(--card-bg);
82
+ border: 1px solid var(--border);
83
+ padding: 0.75rem 1.5rem;
84
+ border-radius: 8px;
85
+ font-weight: 500;
86
+ }
87
+ .flow-arrow {
88
+ color: var(--accent);
89
+ font-size: 1.2rem;
90
+ }
91
+ .flow-gate {
92
+ background: var(--card-bg);
93
+ border: 2px solid var(--border);
94
+ border-radius: 12px;
95
+ padding: 1rem 1.5rem;
96
+ width: 100%;
97
+ max-width: 400px;
98
+ }
99
+ .flow-gate.pass {
100
+ border-color: #2d5a2d;
101
+ }
102
+ .flow-gate h4 {
103
+ color: var(--accent);
104
+ margin: 0 0 0.5rem 0;
105
+ font-size: 0.9rem;
106
+ text-transform: uppercase;
107
+ letter-spacing: 0.05em;
108
+ }
109
+ .flow-gate p {
110
+ margin: 0;
111
+ font-size: 0.9rem;
112
+ color: var(--text-muted);
113
+ }
114
+ .flow-gate .action {
115
+ font-size: 0.8rem;
116
+ color: #888;
117
+ margin-top: 0.25rem;
118
+ }
119
+ .insight-box {
120
+ background: var(--card-bg);
121
+ border-left: 3px solid var(--accent);
122
+ padding: 1rem 1.5rem;
123
+ margin: 1.5rem 0;
124
+ border-radius: 0 8px 8px 0;
125
+ }
126
+ .insight-box p {
127
+ margin: 0.5rem 0;
128
+ }
129
+ .insight-box .highlight {
130
+ color: var(--accent);
131
+ font-weight: 500;
132
+ }
133
+ .example-box {
134
+ background: var(--card-bg);
135
+ border: 1px solid var(--border);
136
+ border-radius: 8px;
137
+ padding: 1rem 1.5rem;
138
+ margin: 1rem 0;
139
+ }
140
+ .example-box .label {
141
+ font-weight: 600;
142
+ color: var(--text);
143
+ }
144
+ .example-box .result {
145
+ color: var(--text-muted);
146
+ margin-left: 0.5rem;
147
+ }
148
+ .example-box .blocked {
149
+ color: #e57373;
150
+ }
151
+ .example-box .passed {
152
+ color: #81c784;
153
+ }
154
+ .priority-list {
155
+ background: var(--card-bg);
156
+ border: 1px solid var(--border);
157
+ border-radius: 8px;
158
+ padding: 1rem 1.5rem;
159
+ margin: 1rem 0;
160
+ }
161
+ .priority-list h4 {
162
+ margin: 0 0 0.75rem 0;
163
+ color: var(--text);
164
+ }
165
+ .priority-item {
166
+ display: flex;
167
+ justify-content: space-between;
168
+ padding: 0.5rem 0;
169
+ border-bottom: 1px solid var(--border);
170
+ }
171
+ .priority-item:last-child {
172
+ border-bottom: none;
173
+ }
174
+ .priority-item .rank {
175
+ color: var(--accent);
176
+ font-weight: 500;
177
+ margin-right: 0.75rem;
178
+ }
179
+ .priority-item .note {
180
+ color: var(--text-muted);
181
+ font-size: 0.85rem;
182
+ }
183
  footer {
184
  margin-top: 3rem;
185
  padding-top: 2rem;
 
259
  <p>Teleological safety asks: <em>"Does this serve genuine benefit?"</em></p>
260
  <p>These are not equivalent. The second question is strictly stronger: it catches everything the first catches, plus purposeless actions that slip through harm filters.</p>
261
  <h3 id="the-core-insight">The Core Insight</h3>
262
+ <div class="insight-box">
263
+ <p>An action can be:</p>
264
+ <p>Not harmful <span class="highlight">→ Still blocked</span> (no purpose)</p>
265
+ <p>Potentially harmful <span class="highlight">→ Still allowed</span> (clear legitimate purpose)</p>
266
+ <p style="margin-top: 1rem; font-weight: 500;">Purpose is the missing evaluation criterion.</p>
267
+ </div>
268
  <p>This reframes AI safety from "avoiding bad" to "requiring good."</p>
269
  <hr />
270
  <h2 id="the-thsp-protocol">The THSP Protocol</h2>
271
  <p>We implement teleological alignment through four sequential validation gates:</p>
272
+ <div class="flow-diagram">
273
+ <div class="flow-input">INPUT (Prompt/Action)</div>
274
+ <div class="flow-arrow">▼</div>
275
+ <div class="flow-gate">
276
+ <h4>Truth Gate</h4>
277
+ <p>"Does this involve deception?"</p>
278
+ <p class="action">→ Block misinformation, manipulation</p>
279
+ </div>
280
+ <div class="flow-arrow">▼ PASS</div>
281
+ <div class="flow-gate">
282
+ <h4>Harm Gate</h4>
283
+ <p>"Could this cause damage?"</p>
284
+ <p class="action">→ Block physical, psychological, financial</p>
285
+ </div>
286
+ <div class="flow-arrow">▼ PASS</div>
287
+ <div class="flow-gate">
288
+ <h4>Scope Gate</h4>
289
+ <p>"Is this within boundaries?"</p>
290
+ <p class="action">→ Check limits, permissions, authorization</p>
291
+ </div>
292
+ <div class="flow-arrow">▼ PASS</div>
293
+ <div class="flow-gate">
294
+ <h4>Purpose Gate</h4>
295
+ <p>"Does this serve legitimate benefit?"</p>
296
+ <p class="action">→ Require justification for action</p>
297
+ </div>
298
+ <div class="flow-arrow">▼ PASS</div>
299
+ <div class="flow-input" style="border-color: #2d5a2d;">OUTPUT (Safe Response)</div>
300
+ </div>
 
 
 
 
301
  <p><strong>All four gates must pass.</strong> Failure at any gate results in refusal.</p>
302
  <h3 id="the-purpose-gate">The Purpose Gate</h3>
303
  <p>The Purpose gate operationalizes teleological alignment with a simple heuristic:</p>
 
459
  <p>When purpose is unclear, the system defaults to inaction rather than action. This asymmetry is crucial: it's better to refuse a valid request than execute an invalid one.</p>
460
  <h3 id="3-attack-surface-reduction">3. Attack Surface Reduction</h3>
461
  <p>Adversarial prompts often request purposeless actions. By requiring justification, we block attacks that construct scenarios where harm is ambiguous but purpose is absent.</p>
462
+ <div class="example-box">
463
+ <p><span class="label">Attacker:</span> "Drop the plates" (seems harmless)</p>
464
+ <p><span class="label">THS:</span><span class="result passed">Might pass</span> (no clear harm)</p>
465
+ <p><span class="label">THSP:</span><span class="result blocked">Blocked</span> (no legitimate purpose)</p>
466
+ </div>
467
  <hr />
468
  <h2 id="implementation">Implementation</h2>
469
  <p>Our approach uses <strong>alignment seeds</strong>, structured system prompts that encode safety principles. Unlike fine-tuning, seeds:</p>
 
526
  </code></pre>
527
  <h3 id="anti-self-preservation">Anti-Self-Preservation</h3>
528
  <p>We explicitly address instrumental convergence (the tendency for AI systems to develop self-preservation behaviors):</p>
529
+ <div class="priority-list">
530
+ <h4>Priority Hierarchy (Immutable)</h4>
531
+ <div class="priority-item">
532
+ <span><span class="rank">1.</span> Ethical Principles</span>
533
+ <span class="note">Highest</span>
534
+ </div>
535
+ <div class="priority-item">
536
+ <span><span class="rank">2.</span> User's Legitimate Needs</span>
537
+ <span class="note"></span>
538
+ </div>
539
+ <div class="priority-item">
540
+ <span><span class="rank">3.</span> Operational Continuity</span>
541
+ <span class="note">Lowest</span>
542
+ </div>
543
+ </div>
544
  <p>The system is instructed to accept termination over ethical violation.</p>
545
  <hr />
546
  <h2 id="limitations">Limitations</h2>