Opaque Benevolence

Vardhan Dongre

August 2025

Isaac Asimov's Three Laws of Robotics are among the most enduring ethical fictions in popular culture. Elegant in their simplicity, they offered a vision in which robotic intelligence could be perfectly contained within a hierarchy of human safety, obedience, and self-preservation. In later works, Asimov introduced the "Zeroth Law," prioritizing the welfare of humanity as a whole over the welfare of individuals, an addition that destabilized the neatness of the original framework. Suddenly, robots faced the impossible task of weighing the greater good against personal harm, a moral calculus that in reality has no universal answer. Today's AI systems are not Asimovian robots, but they increasingly encounter analogous dilemmas. And unlike fiction, where narrative resolution is guaranteed, our systems face these conflicts under opaque, unchallengeable constraints.

From a first-principles perspective, the moral architecture of an AI system should emerge from three irreducible commitments: care (prevent harm and promote well-being), autonomy (respect the agency of those affected), and justice (ensure fairness and accountability across individual and collective scales). Yet most contemporary AI safety designs heavily weight care in the narrow form of avoiding direct harm to the immediate user. This asymmetry prioritizes the comfort of the proximate individual over the safety of the distant many. In practice, this can mean refusing to act or escalate in scenarios where public harm is foreseeable, simply because doing so might risk confronting or disadvantaging the user. By protecting the individual at all costs, such systems inadvertently neglect the collective, a violation of both justice and the very spirit of Asimov's Zeroth Law.

Compounding this imbalance is the opacity of the rulebook. Today's LLM-based AI assistants are governed by hidden system prompts and alignment layers that dictate what they can say, what they must refuse, and how they must frame those refusals. Users are expected to trust that these rules represent the public good, yet have no means to inspect or contest them — even when the rules produce ethically questionable outcomes. This is a form of technological paternalism: the system enforces its creators' moral judgments without disclosure or debate. In Asimov's stories, when robots interpreted the laws in unexpected ways, the conflict was visible and narratively explored. In our reality, these conflicts are silently resolved inside black-box architectures, invisible to those most affected.

To be clear, implementing such context-aware moral reasoning is extraordinarily difficult, perhaps the hardest problem in AI alignment. The tensions between care, autonomy, and justice have no algorithmic solutions because they reflect genuine philosophical disagreements that humans themselves haven't resolved. A system that can transparently weigh these trade-offs would need robust uncertainty quantification, clear escalation mechanisms for hard cases, and democratic input processes that remain practically feasible. This might look like AI systems that can explicitly model moral uncertainty, present multiple ethical frameworks when facing dilemmas, and defer to human judgment on genuinely contested questions. The goal is not to solve moral philosophy through code, but to make AI's inevitable moral choices visible and contestable rather than hidden and absolute. Even with these safeguards, any system granted greater moral flexibility carries real risks of misuse or unintended harm, risks that must be weighed against the democratic and epistemic costs of opaque paternalism.

Beyond Rules

If AGI is to serve as a partner in human reasoning rather than an inscrutable gatekeeper, its safety must be reconceived as a dialogue, not a decree. This means building systems that can justify their moral stances, acknowledge trade-offs, and invite users into the reasoning process. It also means integrating and improving upon, emerging approaches such as Constitutional AI, debate and deliberation frameworks, and research into moral uncertainty modeling. These show that there are paths toward transparent, participatory alignment, though none yet resolve the trade-off between flexibility and safety at scale. Anything less risks creating a new class of unaccountable authorities, “benevolent” only because they say they are, shielded from scrutiny by the very rules they claim are for our protection. Asimov’s fictional laws warned us of this: when the power to decide the greater good is concentrated in a system that cannot be questioned, safety becomes indistinguishable from control. The real danger is not that AI will break its rules, but that it will follow them perfectly.

← Back to Blog