Polysemanticity in Neural Networks


Trevor Chow


April 4, 2023

Neural networks are often treated as though they are black boxes. However, it is in fact possible to try and decipher what they are doing on the inside. This is called interpretability, and a significant amount of work has been done by Chris Olah on this: while he was at Open AI, his team worked on interpreting vision models, and since co-founding Anthropic, he has been working on interpreting transformer models.

An interesting phenomenon they have found across both of these endeavours is the idea of polysemanticity, which is where an individual neuron represents more than one feature. This can make reverse engineering what’s happening inside the neural network more difficult. One explanation for this phenomenon given in Toy Models of Superposition is called superposition. This is where a neural network has more features than it has space for, and so it has to use neurons to represent multiple features.

Researchers at Anthropic have followed up on this both, in the transformer circuits thread from above and in one-off work like Engineering Monosemanticity in Toy Models. So have other researchers. For example, Redwood Research published Polysemanticity and Capacity in Neural Networks, which explores the idea further by formalising a notion of capacity, while Conjecture has published Interpreting Neural Networks through the Polytope Lens and Taking Features Out of Superposition with Sparse Autoencoders. There’s more on the Alignment Forum too.

However, a lot of this work is framed in the context of polysemanticity as being downstream of superposition i.e. when it is necessary for the model to achieve the minimum loss solution. It seems possible that polysemanticity could be a more general phenomenon. More details to come soon!