TL;DW Dan gives an overview of the course and begins a discussion of training dynamics by covering linear models and kernel methods.
TL;DW Dan introduces the quadratic model as a minimal model of representation learning, and use gradient descent to solve the training dynamics. This extends kernel methods to “nearly-kernel methods.”
TL;DW Sho explains how to recursively compute the statistics of a deep and finite-width MLP at initialization. Due to the principle of sparsity, the distribution of the network output is tractable.
TL;DW Sho solves the layer-to-layer recursions derived before with the principle of criticality. We learn that the leading finite-width effects scale like the depth-to-width ratio of the network.
TL;DW By combining init statistics & training dynamics we get this. Then, Dan explain how MLPs *-polate, how to estimate a network’s optimal aspect ratio, and how to think about complexity.