Speaker "Rosanne Liu" Details Back



An intriguing failing of convolutional neural networks and the CoordConv solution


Few ideas have enjoyed as large an impact on deep learning as convolution. For any problem involving pixels or spatial representations, common intuition holds that convolutional networks may be appropriate. However, in a recent work we show a striking counterexample to this intuition via the seemingly trivial \emph{coordinate transform problem}, which simply requires learning a mapping between coordinates in (x,y) Cartesian space and coordinates in pixel space. Although convolutional networks would seem appropriate for this task, we show that they fail spectacularly. We demonstrate and carefully analyze the failure first on a toy problem, at which point a simple fix becomes obvious. We call this solution CoordConv, which works by giving convolution access to its own input coordinates through the use of extra coordinate channels. Without sacrificing the computational and parametric efficiency of ordinary convolution, CoordConv allows networks to learn either perfect translation invariance or varying degrees of translation dependence, as required by the end task. CoordConv solves the coordinate transform problem 150 times faster, with 10-100 times fewer parameters, and with perfect generalization. This stark contrast leads to a final question: to what extent has this inability of convolution persisted insidiously inside other tasks, subtly hampering performance from within? A complete answer to this question will likely require much follow up work, but we show preliminary evidence that swapping convolution for CoordConv can improve models on a diverse set of tasks. We show that using CoordConv in GANs results in less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. We show small but statistically significant improvements by simply adding a CoordConv layer to ResNet-50, and we show significant improvements in the RL domain by giving agents playing Atari games access to CoordConv layers, as well as object detection domain by allowing box proposer and box regressor see coordinates.


Rosanne is a Research Scientist and one of the founding members of Uber AI Labs. She has been contributing to machine learning research for over 10 years. She obtained her PhD in Computer Science at Northwestern University in 2016. She has research background and interests in neural network theories, optimization, deep learning, object detection, language modeling, and generative models.