AI Risk in <400 Words


Trevor Chow


June 1, 2023

Human-level AI might arrive soon

  • Human-level AI is possible because the brain is a biological machine.
  • Prediction aggregators give a median of 2032.
  • Models of compute requirements (see below) give a median of 2050.
  • Surveys of ML researchers give a median of 2059.
  • An intuition pump for expecting human-level AI by the 2070s is that itโ€™s been 50 years since the 1970s, and it sure feels like weโ€™re closer to human-level AI than to the quality of AI the 1970s.

Human-level AI might not be value-aligned

  • Models misgeneralise goals: achieves low reward on the training objective when deployed e.g. going to yellow gem because it learnt the colour rather than the shape (see below).
  • Models game rewards: achieves high reward on the training objective, but in an unintended way e.g. placing hand between camera and ball to trick human into thinking it is grasping the ball (see below).
  • It is optimal for models with different reward functions to ours to deceptively act aligned during training, in order to be deployed.
  • Current mechanisms of oversight and alignment (e.g. RLHF or mechanistic interpretability) may not be scalable.

Human-level AI might be power-seeking

  • It is likely that human-level AI will be autonomous and goal-directed entities, because these are traits which make them useful.
  • Powerseeking is optimal for goal-directed agents (see below). Intuitively, preserving optionality maximises the expected value of the future.

Power-seeking, misaligned and human-level agents are dangerous

  • Coordinated groups of humans are difficult to align e.g. corporations following the letter but not spirit of the law. Groups of humans competing against peer adversaries with divergent goals have brought us to the brink of extinction in the past.
  • Human-level AI has advantages over humans e.g. federated learning, greater I/O bandwidth, faster processing speed and more memory.
  • One heuristic for human-level AI is to consider it as a collection of at-least-human-level agents that can operate in parallel. This means its potential for harm is lower bounded by the examples above, since it would be competing against agents which are at least as weak as it.
  • It is likely to have an adversarial relationship with us because it competes with us for power and resources. Much as we exploit and kill animals when it is to our benefit, human-level AI will likely do the same to us when it is useful.