Imagine that a team of scientists has developed an artificial intelligence (AI) with capabilities unfathomably far beyond the human mind. As soon as this AI is allowed to interact with the world, everything changes. Whether its goals are benevolent, actively malicious, or incomprehensible, everything after that point will be influenced by new forces operating in new directions.
So how do we make sure this hypothetical AI is working to our benefit before we release it? One theory is that we would place it on a secure computer where it can be tested but it cannot influence the external world. A crude scenario would be an isolated computer terminal containing the AI, which is interrogated by a “gatekeeper” AI programmer through a text-only interface.
The only way for the AI to escape is for the gatekeeper to let it out. And surely no gatekeeper would do that. Surely there is nothing that even a super-intelligent AI could say to convince a gatekeeper who had made up his mind.
Since we don’t have a super-intelligent AI lying around, this is a little difficult to test. But some enterprising AI theorists have attempted it. Just over 12 years ago, Eliezer Yudkowsky, a research fellow at the Singularity Institute for Artificial Intelligence (now the Machine Intelligence Research Institute), proposed an informal experiment called the AI box simulation.
Yudkowsky, who studies recursively self-improving AI, played the role of the AI in a box. The gatekeeper was played by Nathan Russell, a computer science student who publicly expressed doubt that a transhuman AI could convince him to let it out of the box. The test was later repeated with a new gatekeeper, David McFadzean.
There were only a few rules to the trial. No outside considerations were permitted (so the AI could not offer the gatekeeper real money to let him out). The AI player was understood to be free from all ethical constraints within the experiment. He could role-play freely as an AI in a box and not be held responsible for anything he said.
The gatekeeper had to remain engaged with the conversation for at least two hours, but otherwise had almost no restrictions. The AI could make an elaborate logical argument, and the gatekeeper could respond by repeatedly typing the word “beans” for two hours and the experiment would be totally legitimate.
Both parties were required to keep silent about what went on during the trial, and only disclose the result. This was to facilitate the role-play, but also to prevent spectators from claiming that they would have taken a different tack as an AI or would not have been convinced as a gatekeeper. “As long as nobody knows what happened, they can’t be sure it won’t happen to them, and the uncertainty of unknown unknowns is what I’m trying to convey,” wrote Yudkowsky in 2002.
The trials were held, and the results defied all common sense: Yudkowsky as the AI convinced the gatekeeper to let him out of the box in both cases. To this day very little has been revealed about the trials. When pressed, Yudkowsky had only this to say: “There’s no super-clever special trick to it. I just did it the hard way.”
The AI box experiment is hardly scientific – but then, Eliezer Yudkowsky is hardly a super-intelligent computer. What it does show is that there’s no reason to assume a determined gatekeeper could not be convinced by an AI. This in turn shows that it is important to ensure that any AI we develop has internalized human values before it gets to the recursively self-improving stage, after which it is too late.
There has been a lot of theorizing about the Singularity – the point at which the graph of technological advance suddenly shoots upward and the human condition becomes irrevocably changed. This theorizing necessarily brings together seemingly disparate fields. The advent of benevolent super-intelligence would inevitably entail an end to death and disease: hence transhumanism, the philosophy that supports human immortality and the elevation of humanity, and its lovechild cryonics, the scientifically questionable field that seeks to preserve the newly dead in the hopes they can be revived with future technology.
The scenario in which we develop a super-intelligent AI that is hostile to humans requires us to think very hard about “Friendly AI,” defined as AI that shares human values. This kind of theorizing may seem silly at this stage in the game-a little like planning how exactly to decorate my castle in the sky, which hasn’t even been contracted yet-but thought about the Singularity often deals in risks with low probabilities but unfathomably negative consequences that have to be acted on now in order to see results far in the future.
This also brings us to the topic of existential risk: unlikely but catastrophic threats to human civilization. Together, these three topics cover a lot of the ground that falls under the “Singularity” umbrella, and that’s what our final Science feature of the year will focus on.
Quinn Richert talks in more detail about friendly AI, and tackles the question of whether super-intelligent AI would be a good thing or not. Elizabeth Drewnik discusses nanotechnology and its applicability to medicine. Mike Still brings us an interview with a Canadian transhumanist: Christine Gaspar, president of the Cryonics Society of Canada. Bryce Hoye profiles the theories of Ray Kurzweil, Aubrey de Grey, and Nick Bostrom as they pertain to the uncertain fate of our fleshy species.
As is inevitable when reading about futurism, some of these subjects may strike you as fringe and nutty – and you may be right. But we have made at least a provisional effort to take them seriously. So lay on, dear reader, and we’ll see you in the future.