NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Show HN: Beating Hinton et al.'s capsule net with fewer params and less training
p1esk 1587 days ago [-]
Hey, congrats on publishing!

1. Could you briefly summarize your algorithm (novelty, how it’s better, why it’s better, etc)?

2. Since the original paper there have been dozens of published attempts to improve upon it. Do you compare your results to the latest in capsules research?

3. I personally would like to see Imagenet results. Norb a toy dataset. If you beat EfficientNet in terms of both accuracy and number of params/flops, many people will be impressed (including Hinton). Or match the performance of a good convnet using 1/10 training data.

Don’t take this the wrong way, but two years after the original paper Norb results, no matter how good, are underwhelming.

fheinsen 1587 days ago [-]
Great questions. Happy to answer them here.

First of all, this work builds on Hinton et al.’s second paper, the one about EM routing of matrix capsules, from last year: https://ai.google/research/pubs/pub46653 This work is only minimally related to his previous paper (Sabour et al.'s paper) from two years ago!

RESPONSES TO #1:

* The same algorithm also achieves SOTA in another domain, natural language. Same code. I think it’s significant that the same code, without change, produces SOTA in two domains. See the README and tables 3 and 4 in the draft paper. Don't you think this is significant?

* It requires fewer parameters: 272K instead of 310K for Hinton et al. (2018)’s model and 2.7M for the best performing CNN on record (Cireşan et al.); see table 2. That’s 10x fewer parameters than the best performing CNN on record.

* It requires an order of magnitude less training: 50 epochs instead of 300 for Hinton et al. (2018)'s model.

* It’s trained with minimal data augmentation, unlike Hinton et al.’s and Cireşan et al.’s models (the latter, in particular, uses a ton of data augmentation). Also, unlike Hinton’s model, it accepts full-size images instead of 32x32 crops that are 9 times smaller. Finally, we do not measure accuracy as a mean of multiple crops. So, the model has fewer parameters, requires less training, and has greater capacity.

* It seems to be learning a form of "reverse graphics" on its own, from only pixels and labels, without having to optimize explicitly for it. See the README, figure 4, and the 24 plots and captions in supplemental figures 6 and 7. This is rather significant, don't you think?

RESPONSES TO #2:

* As far as I know, the best attempt at recreating Hinton et al.’s work on EM routing is by Ashley Gritzman at IBM, in July of this year -- only a bit over two months ago. As far as I can tell, his model does not come close to matching Hinton’s performance:

https://arxiv.org/abs/1907.00652

https://github.com/IBM/matrix-capsules-with-em-routing

https://medium.com/@ashleygritzman/available-now-open-source...

* There have been a few other efforts, all of which seem to fall short of Hinton's performance. Gritzman does a good job of covering those other efforts in his Medium article. None of these efforts propose any new ideas, as far as I can tell.

RESPONSES TO #3:

* Me too. So does Hinton: https://openreview.net/forum?id=HJWLfGWRb ... and so does everyone else.

* Alas, as Paul Barham and Michal Isard at Google Brain showed earlier this year, currently it can be challenging to scale capsule networks to large datasets and output spaces, in some circumstances, due in part to current software (e.g., PyTorch, TensorFlow) and hardware (e.g., GPUs, TPUs) systems, which are highly optimized for a fairly small set of computational kernels, in a way that is tightly coupled with memory hardware, leading to poor performance on non-standard workloads, including basic operations on capsules. Source: Barham and Isard (2019) - https://dl.acm.org/citation.cfm?id=3321441 (the PDF is available for free download at that link).

* My draft paper mentions Barham and Isard’s work.

p1esk 1587 days ago [-]
1. Both Hinton’s capsules papers have been released at the same time (Oct 2017). You can see the first comment on OpenReview page for the EM paper is dated Nov 2017. From what I remember, the two papers appear very similar with the main difference in how the routing is implemented.

2. You cite a convnet result from 2011 (!). Don’t you think a modern convnet would do vastly better on this task?

3. Could input size play a role? Did you try feeding 96x96 inputs to the models you’re comparing against, to see if they also benefit from it?

4. I’m a bit confused as to why other implementations failed to reproduce Hinton’s results given that he open sourced their code (link in the first OpenReview comment).

5. Ok, Imagenet is too slow, how about Cifar-10? What would it take to reach, say, 95%? That would be equivalent to a well trained Resnet-18. If you can show such result, I personally would become more interested, because I worked quite a bit with Cifar-10, but not with Norb.

I think you might be onto something, but it’s still not clear that capsules approach is scalable and ultimately superior to plain convnets.

fheinsen 1587 days ago [-]
I’m surprised you did not comment on the fact that my version of EM routing also achieves SOTA on another domain, natural language. Same code.

Here are the answers to your questions:

1. The final, published version is stamped “ICLR 2018,” so I used that year.

2. I don’t know if a conventional CNN can do this with 10x fewer parameters, while also learning to do a form of “reverse graphics” without explicitly optimizing for it. (I wouldn’t know how to get a CNN to do that without explicitly making it a training objective.)

3. IIRC, the convnet model from 2011 accepts 96x96 images. As to why Hinton et al. downsample images to 9x smaller, I suspect (but don’t know for sure) they had no choice to conserve memory and computation using their version of EM routing. I was able to reduce memory and computation with my variant of EM routing (by between one and two orders of magnitude) by setting the first routing layer to accept a variable number of inputs, without regard to location in image.

4. Me too. But you asked me about work other than Hinton’s, and that’s all I could find!

5. CIFAR10 is on the to-do list (work permitting!) :-)

p1esk 1587 days ago [-]
How does a regular convnet do on another domain?

Learning to do “reverse graphics” is only useful if you can show it is the reason behind performance improvement, compared to a plain convnet. Until we have cifar-10 results it’s not clear.

What I’m saying is - no one has yet demonstrated a clear superiority of any capsules based model to the best available plain convnet. Even on cifar-10. Looking forward to your results!

fheinsen 1587 days ago [-]
> How does a regular convnet do on another domain?

As far as I know, regular convnets have failed to outperform query-key-value self-attention models (i.e., transformers based on Vaswani et al.'s work) on pretty much every sequence task, including natural language tasks.

> Learning to do “reverse graphics” is only useful if you can show it is the reason behind performance improvement.

I would strongly disagree. Building systems that can learn "reverse graphics" on their own has long been a goal of computer vision. It seems a prerequisite for building machines that can build internal representations of the state of the physical world around them. Hinton et al.'s 2018 paper has a summary of recent efforts on this front on the "Related Work" section.

> What I’m saying is - no one has yet demonstrated a clear superiority of any capsules based model to the best available plain convnet.

No one is saying otherwise. :-) Convnets are still the right tool for most production systems in visual recognition today.

That said, I don't think a convnet can achieve 99.1% accuracy on smallNORB with only 272K parameters, after training from scratch without using any additional data or metadata of any kind -- like the model using my routing algorithm. If you think you can do that with a convnet, do it and put it up online (I'd love to see it :-)

p1esk 1587 days ago [-]
You’re comparing sentence classification done using transformer embeddings to older results which use inferior embeddings. How do regular convnets do when you feed them transformer embeddings?

Re learning reverse graphics - ok, maybe it is indeed the main feature of your work. I’d need to look into that, because from skimming your paper it’s not immediately clear what’s going on there.

Re convnet accuracy on Norb - I’m willing to make that effort for cifar-10 as soon as you have the results.

fheinsen 1587 days ago [-]
> You’re comparing sentence classification done using transformer embeddings to older results which use inferior embeddings. How do regular convnets do when you feed them transformer embeddings?

Actually, I'm comparing it to recent models, including XLNet, MT-DNN, Snorkel, and (of course) BERT. AFAIK, convnets have not been able to outperform multihead self-attention, even on pretrained embeddings.

> Re learning reverse graphics - ok, maybe it is indeed the main feature of your work. I’d need to look into that, because from skimming your paper it’s not immediately clear what’s going on there.

I agree, it's not immediately clear. Nonetheless, I find it kind of unbelievable that a model with so few parameters can seem to do it. (I was shocked when I first saw the plots.)

> Re convnet accuracy on Norb - I’m willing to make that effort for cifar-10 as soon as you have the results.

That's a little disappointing... but OK.

Thank you so much for all your questions :-)

p1esk 1587 days ago [-]
Ah, I missed table 4 with the recent models. I looked closer and it does look impressive, however you should ask someone who worked on that task to review your experiments (I haven’t).

Actually, it looks like you got a solid paper. I recommend submitting either to CVPR or ICML, especially if you can get good results on cifar.

fheinsen 1586 days ago [-]
Thank you!

Yes, I think this has legs.

Maximizing "bang per bit" (a) seems truly a new idea, as opposed to some minor tweak on the same old thing, and (b) the evidence so far shows it works better than previous methods.

(FWIW, we've been using this algorithm internally at work with similar outperformance over other methods, in yet another domain that is neither vision nor language... but I cannot share those results publicly.)

Before submitting this anywhere, I'd like to get more informal feedback from other AI researchers. I've reached out to people at Google Brain, Facebook AI, DeepMind, OpenAI, and a handful of top academic institutions and research groups. So far, the response has been positive, but I expect it will take everyone at least a couple of weeks, and probably longer, to read and understand the draft paper in sufficient detail to give me more than superficial comments.

New things often look like toys at first. :-)

p1esk 1586 days ago [-]
Keep in mind that someone might still your ideas. Right now there are probably a dozen people preparing capsules related papers for CVPR (due in 2 weeks) so if one of them comes across your paper there’s a temptation.
p1esk 1586 days ago [-]
*steal
fheinsen 1585 days ago [-]
Thank you for saying that. Sometimes I forget how petty and small people can be, especially when they are under pressure, academic and otherwise.

I'll take a look at submitting it to CVPR.

In the meantime, please circulate my work. It's on record, online. The more people who are aware that others have seen it, the less likely someone will try to plagiarize it.

I'm not under any kind of academic pressure, so I don't need citations, conference slots, etc. But I do deserve credit for this, don't you think?

PS. And now that you mention it, a couple of people to whom I reached out mentioned they were under deadline over the next two weeks.

PPS. Send me an email!

fheinsen 1584 days ago [-]
FYI, I reached out to two of those individuals (one is a CVPR reviewer, it turns out) and both suggested I first upload this to arXiv, so I did that yesterday. The paper is now stamped with a date, on the queue for site-wide notification. Thank you again for your feedback!
p1esk 1584 days ago [-]
Yes, that's a good move.

Let me know when you have CIFAR-10 results, I will try to match your accuracy using the same number of parameters in a regular convnet. I actually implemented the original, vector based capsnet a while ago: https://github.com/michaelklachko/CapsNet but I haven't really explored it. Your success on CIFAR-10 would definitely provide motivation for me to do so.

fheinsen 1583 days ago [-]
Thanks. Will do (work permitting!).

FWIW, a while back I reimplemented and tinkered a bit with the Sabour et al. version too... and did not see much promise in it.

Note that the routing algorithm I've proposed generalizes to vectors (by setting the dimension of the covector space d_cov to 1).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 22:57:13 GMT+0000 (Coordinated Universal Time) with Vercel.