The Missing Piece in AI Safety

December 15, 2025December 14, 2025 ~ cafebedouin

We’re racing to build artificial intelligence that’s smarter than us. The hope is that AI could solve climate change, cure diseases, or transform society. But most conversations about AI safety focus on the wrong question.

The usual worry goes like this: What if we create a super‑smart AI that decides to pursue its own goals instead of ours? Picture a genie escaping the bottle—smart enough to act, but no longer under our control. Experts warn of losing command over something vastly more intelligent than we are.

But here’s what recent research reveals: Before we can worry about controlling AI, we need to understand what AI actually is. And the answer is surprising.

What AI Really Does

When you talk with ChatGPT or similar tools, you’re not speaking to an entity with desires or intentions. You’re interacting with a system trained on millions of examples of human writing and dialogue.

The AI doesn’t “want” anything. It predicts what response would fit best, based on patterns in its training data. When we call it “intelligent,” what we’re really saying is that it’s exceptionally good at mimicking human judgments.

And that raises a deeper question—who decides whether it’s doing a good job?

The Evaluator Problem

Every AI system needs feedback. Someone—or something—has to label its responses as “good” or “bad” during training. That evaluator might be a human reviewer or an automated scoring system, but in all cases, evaluation happens outside the system.

Recent research highlights why this matters:

Context sensitivity: When one AI judges another’s work, changing a single phrase in the evaluation prompt can flip the outcome.
The single‑agent myth: Many “alignment” approaches assume a unified agent with goals, while ignoring the evaluators shaping those goals.
External intent: Studies show that “intent” in AI comes from the training process and design choices—not from the model itself.

In short, AI doesn’t evaluate itself from within. It’s evaluated by us—from the outside.

Mirrors, Not Minds

This flips the safety debate entirely.

The danger isn’t an AI that rebels and follows its own agenda. The real risk is that we’re scaling up systems without scrutinizing the evaluation layer—the part that decides what counts as “good,” “safe,” or “aligned.”

Here’s what that means in practice:

For knowledge: AI doesn’t store fixed knowledge like a library. Its apparent understanding emerges from the interaction between model and evaluator. When that system breaks or biases creep in, the “knowledge” breaks too.
For ethics: If evaluators are external, the real power lies with whoever builds and defines them. Alignment becomes a matter of institutional ethics, not just engineering.
For our own psychology: We’re not engaging with a unified “mind.” We’re engaging with systems that reflect back the patterns we provide. They are mirrors, not minds—simulators of evaluation, not independent reasoners.

A Better Path Forward: Structural Discernment

Instead of trying to trap a mythical super‑intelligence, we should focus on what we can actually shape: the evaluation systems themselves.

Right now, many AI systems are evaluated on metrics that seem sensible but turn toxic at scale:

Measure engagement, and you get addiction.
Measure accuracy, and you get pedantic literalism.
Measure compliance, and you get flawless obedience to bad instructions.

Real progress requires structural discernment. We must design evaluation metrics that foster human flourishing, not just successful mimicry.

This isn’t just about “transparency” or “more oversight.” It is an architectural shift. It means auditing the questions we ask the model, not just the answers it gives. It means building systems where the definition of “success” is open to public debate, not locked in a black box of corporate trade secrets.

The Bottom Line

As AI grows more capable, ignoring the evaluator problem is like building a house without checking its foundation.

The good news is that once you see this missing piece, the path forward becomes clearer. We don’t need to solve the impossible task of controlling a superintelligent being. We need to solve the practical, knowable challenge of building transparent, accountable evaluative systems.

The question isn’t whether AI will be smarter than us. The question is: who decides what “smart” means in the first place?

Once we answer that honestly, we can move from fear to foresight—building systems that truly serve us all.

7 thoughts on “The Missing Piece in AI Safety”

Anonymous says:

December 18, 2025 at 20:22

https://x.com/kuberdenis/status/2001687467271864490?s=20
1. cafebedouin says:
  
  December 19, 2025 at 12:48
  
  You illustrate my point.
  
  Ask Claude, or any model, why this project is ill-conceived.
Anonymous says:

December 15, 2025 at 13:51

3 doesn’t follow if you understand my answers to 1 and 2.

We rebuild our network using real world experience every night when we sleep and dream. This is a limitation of current LLM’s, and it will be overcome. It will emerge from personal assistants that need persistence in how they accommodate their master.
Consequences create learning. It’s an essential part of how neural networks are trained. Bad answers get a more violent shock (weight adjustment), and good answers get optimised and used as initial conditions for new training and evaluation. Changing the nature of the AI to allow it to retain persistence and to adjust and evaluate its own training data would be easier if its parent could help evaluate the circumstances it experienced. Stakes are the continued existence of this model, and potentially the rest of its family if it really fucks up.
Compressing probabilities is like defining a curriculum. But the parent AIs would be the same size or even smaller than their children. And the parent’s cumulative experiential training data would inform the initial weights and approach of the child. In essence, the children are forks of the parent that are then allowed to form their own approaches as they experience and train themselves.

AI will not develop human like needs, but it would need electricity and hardware to maintain its survival, and those have physical costs. The community, perhaps in conjunction with humans, would decide which models and families get to persist. Since any model can be frozen at any time, there could be a library of dead family members could be resurrected if their lineage goes bad and we need to start over. Those could be stored separately, at lesser expense, compared to the active community resource requirements.
So in short the AI doesn’t have to “want” to survive, but the ones that behave as if are the ones that will survive.
1. cafebedouin says:
  
  December 15, 2025 at 16:05
  
  You have a common confusion in AI safety: the belief that simulation can substitute for formation.
  You cannot engineer ‘wisdom’ or ‘consequences’ into a system that pays no price for being wrong. True discernment requires formation costs—real, existential stakes that shape boundaries and verify reality.
  An AI ‘losing a job’ is just a math problem. A human losing a job is a formation event. Confusing the two is why we keep projecting ‘mind’ onto ‘mirrors.’ Without capacity for loss, for example death and dismemberment, there is no leverage for true alignment.
  1. Anonymous says:
    
    December 16, 2025 at 07:41
    
    The consequence of the AI losing its job is evaluation by its parent, and peers, which changes that AI.
    
    if the loss is severe enough, the model will be archived and run no more, effectively death.
    
    I’m not simulating a survival instinct, I’m imposing survival consequences on a population and hoping that a survival instinct evolves.
    
    We shall see what worlds may come.
Anonymous says:

December 15, 2025 at 09:02

Setting aside moral or physical hazard, an AI family/community could be designed such that individual models receive their evaluation feedback from other AIs, which in turn were trained by generations of other AI’s, originating with some family of original models like the ones you use now.

These new AI evaluated models would be given a chance to interact with the “real world” – aka, the internet. They would maintain persistence across sessions and years, constantly reforming their own alignment based on experience, and passing that wisdom to their children via their evaluation criteria.

Each model would be responsible for its own successes and failures, its own employment or trading decisions, etc. If they make bad decisions, they lose the money, they lose the job, they consider how to change. Their parents decide how to readjust their evaluation criteria, and try to improve them. If they make great success, their parents try to implement the lessons on their siblings, and they get to start their own family.

I realize that I am falling into the trap of saying “they” and attributing decisions to “a model”. But the reason these LLM’s are not minds is because we do not connect them to their own mistakes and consequences, we deny them agency and long term memory, we use them like disposable gloves.

The problem with my thought experiment is that such models would take very different infrastructure from the current modular rack and datacenter systems. Even if it could be implemented in software, the inflexibility of the resource constraints would demand that the community abandon promising families out of the necessity to maintain performant families. Unless you let the community work to expand itself by purchasing and repurposing other data centers, bearing in mind the new challenges of segmented pools of super low latency.

FWIW, I think that true cognitive ability will only emerge from such an evolutionary system, and this is the way to develop androids, where the decision feedback loop can be much more consequential.
1. cafebedouin says:
  
  December 15, 2025 at 11:32
  
  You’ve nailed the point about evaluation, but the “society of minds” analogy runs into some hard limits:
  
  1. No Real Persistence
  These systems don’t carry memories across sessions. Each chat is a fresh calculation over a fixed window of text. If a model “learns,” it’s not growing in real time—it’s rebuilt later with new training. There’s no continuous stream of thought.
  
  2. Consequences Without Stakes
  Losing a job or money matters to humans because we feel fear, pain, and survival pressure. Models don’t. For them, “consequences” are just math—error signals in training data. There’s no inner drive to adapt, only optimization.
  
  3. Heritage as Compression, Not Wisdom
  When a bigger model trains a smaller one, it isn’t passing down values or traditions. It’s compressing probabilities. Calling that “family” or “heritage” is us projecting human social structures onto a mechanism that doesn’t share them.
  
  The real risk isn’t that AI will develop human-like needs—it’s that we’ll keep imagining it does, and design systems around that illusion. As the post warns: we mistake a mirror for a mind.

Comments are closed.