Abstract: Recent advancements in artificial intelligence have been largely driven by deep learning. However, deep neural networks (DNNs) are often characterized as inscrutable "black boxes": while we can study their performance on various tasks, we struggle to understand the internal mechanisms that drive it. Mechanistic interpretability has emerged as a promising approach to unveil the inner workings of DNNs by decoding the computations and representations underlying their behavior. While preliminary results in toy models show potential, scaling these techniques to large-scale DNNs remains a challenge. Here, I investigate a serious concern about the viability of this project: the possibility of illusory explanations that appear to reveal how DNN process information but are, in fact, misleading. I present a novel typology of such interpretability illusions, and explore potential strategies to mitigate their occurrence and impact on explanations.
Noyce Conference Room
Seminar
US Mountain Time
Speaker:
Raphaël Millière
Our campus is closed to the public for this event.
Raphaël MillièreAssistant Professor in philosophy at Macquarie University
SFI Host:
Melanie Mitchell