On Bayesian Deep Learning, an outsider's view

<aside> 📎 Jialin LU, Feb 2020 This is presented at the group meeting of Ester's lab.

</aside>

Basically this is an outsider who does not work on Bayesian methods, but somehow was volunteered to do a survey and give a presentation of the pros/cons of Bayesian Deep Learning. I will not talk about the math, but only the ideas intuitively.

Please see the typesetted pdf if you prefer reading as a short paper.

This is a supplementary version in PDF.

This is the corresponding slide for the presentation

<aside> 💡 A reminder: if you are not into the long blog, you can first look at the supplementary pdf listed left-side, which is self-contained and should provide the main idea.

</aside>

Why I am doing this

I barely know deep learning, I do not work on this and I know even much less about Bayesian methods.
But what happened is on the NeurIPS 2020, there is a tutorial called on Bayesian Deep Learning by Emtiyaz Khan (RIKEN, Japan), which I actually failed to attend.
And later, some lab members and Dear Martin seem to agree that

"wow bayesian deep learning is kind of cool"
Then I somehow was volunteered to do this presentation

Lab mates at NeurIPS 2019 (I am the one in the rightest, wearing the volunteer shirt)

<aside> 💡 There will be no math, I will try to convey only the intuitive way for understanding.

</aside>

But anyway I read some papers and organize a short survey into this topic of Bayesian Deep Learning. The outline of this post can be summarized as follows

First I give the motivations of combing Bayesian methods with deep learning.
Then in Part 2 I introduce the main theme of approximating the posterior distribution of NN’s parameter and talk about two technical approaches for tackling it.

namely, the variational-based approximation and the interpolation-based approximation (I use this term to refer to Stochastic Weight Average and related methods)
In Part 3 I show, by referring a simple experiment, that the BDL thing is a little bit frustrating in practice and does not really work yet.

Specfically, I refer to Deep Ensemble as a simple baseline, and discuss why it works.

I will also suggest that why, for fair comparison, we should use multi-mode variational inference (such as a mixture of Gaussians) and multi-trajectory Interpolation. This is because the true posterior of DNN is too complicated and it certainly has multiple modes (high-performing local minima as lottery tickets).
I will then end with some more personal opinions

There are mainly two advice, or the possible direction of research we can do.
- One is that, from a basic research perspective, we should start consider model comparison with alternative architectures, not only on the weights anymore.
- Another is, for practical merits, to consider the purpose of Bayesian Deep Learning beyond the standard setting (of supervised regression). We should care more specific usage of DNN, such as transfer learning, meta-learning. I will explain why the uncertainty from posterior plays an important role in these tasks and briefly about how to do it.

Introduction

Part 1: Bayesian Deep Learning, but why?

Motivations for combining Bayesian learning and deep learning

Bayesian Learning is great, deep learning is also great.

Bayesian Learning is such a flexibleframework, we can use to modelmany things, even the most flexiblehuman learning. There are bulks of works on computational modelling of cognition and psychology are now Bayesian, one example is the PhD thesis of [Tenenbaum,1999].
Deep Learning is also a very general and flexible framework (of course, flexible in a different way). It can process large and complex data, and more importantly the design of architecture is flexible. This flexible customization of achitectures makes especially applicable in various applications.