A ton of high-quality engineering is done based on intuition, mental models, and patterns learned over years of experience. My hunch is that deep learning will be the same.
EDIT: Just reread, and I want to clarify. I'm not saying that analog design is at the same stage of development as deep learning, or that it is anywhere near as ad hoc. Deep learning probably has a long way to go, but it could potentially end up in a similar state where years of experience is critical and intuition rules.
As an example, a bias design goes something like this: "Let's see, I'll pin the base at five volts with a resistor divider. The emitter will be 0.6V below that. Then the emitter current will be (5.0 - 0.6) divided by the emitter resistor. The collector current will be essentially the same, so I can pick the collector load resistor to give me an appropriate quiescent point and make sure the output impedance is less than a tenth of the input impedance of the following stage (so I can ignore the latter)."
Knowing a parameter in detail isn't nearly as important as knowing if that parameter matters in the scope of the final design.
The book is also pretty honest about that and goes to pretty good lengths to guide readers to deeper material if it's an area they want to understand in greater depth.
A true testament that even in there modern days the principals taught are still solid.
I have yet to get his classical mechanics book. Gonna have to pull the trigger on that soon.
Edit: Thanks for that - it's more detailed than the talks I attended (LibrePlanet).
It seems deep learning is in pretty much at that state now (with possible exception of quality). The problem is that this puts inherent limits on what can be done with it. The power of digital computing is the power of modular expansion of objects. Analog circuits and computers don't have that. And current trained deep learning models don't have combinability and modularity either.
A lot of specific engineering subfields involve this "intuition, mental models, and patterns learned over years of experience" but keeping that model of deep learning indefinitely would have to involve a vast proliferation of such subfields each with their limits as the differences in applying deep learning techniques to different subfields become evident.
And while we're on it, is there something similar to analog or digital designs?
Here was my takeaway: an engineer has to understand the domain and algorithms involved at a deep level, or they will not be productive. Or, you will need to have both an engineer and somebody with the domain knowledge and experience.
It doesn't really matter what your problem domain is. If you're an engineer, and it's your job to make changes to a system, whether code or config, you need to understand it at a deep level. And your manager needs to understand this requirement.
Otherwise, you will be guessing at changes, so your productivity will be horrible or non-existent.
Science also doesn't need to have a model that explains everything. Because if it did, we wouldn't have science.
Often, people either reuse other people's architectures, or simply try 2 or 3 and stick with the best one, only changing the learning rate and such.
I also wonder if there's a computation issue (training is long, we can only try so many things), or if it really is that we are working in the wrong hyperparameter space. Maybe there is another space we could be working in, where the HPs that we currently use (learning rate, L2 regularization, number of layers, etc.) are a projection from that other HP space where "things make more sense".
[edit:] In this analogy, deep learning currently misses any sort of a general theory (in the sense of theories explaining experiments).
I'd agree it's done in a sort-of scientific way. But I don't think you can say it's done the way natural science is done. A complex field, like oceanography or climate science, may be limited in the kind of experiments it can do and may require luck and intuition to produce a good experiment. But such science is always aiming to reproduce an underlying reality and the experiment aim to verify or not a given theory.
The process of hyperparameter optimization doesn't involve any broader theory of reality. It is essentially throwing enough heuristics at a problem and tune enough that they more or less "accidentally" work.
You use experiment to show this heuristic approximation "works" but this sort of approach can't be based on a larger theory of the domain.
And it's logical that there can't be a set theory of how any approximation to any domain works. You can have a bunch of ad-hoc descriptions of approximation each of which works with a number of common domains but it seems logical these will remain forever not-a-theory.
I mean, maybe some day, but right now, we're poking at like 0.00000000001% of the space, and that is state-of-the-art progress.
But how does it work? It's enough to outpace other implementations, alright. But the model even works on a consumer machine, if I remember correctly.
I have only read a few abstract descriptions and I have no idea about deep learning specifically. So the following is more musing than summary:
They use the Monte Carlo method to generate a sparse search space. The data structure is likely highly optimized to begin with. And it's no just a single network (if you will, any abstract syntax tree is a network, but that's not the point), but a whole architecture of networks --modules from different lines of research pieced together, each probably with different settings. I would be surprised if that works completely unsupervised; after all it took months from beating go to chess. They can run it without training the weights, but likely because the parameters and layouts are optimized already, and to the point of the OP, because some optimization is automatic. I guess what I'm trying to say is, if they extracted features from their own thought process (ie. domain knowledge) and mirrored that in code, than we are back at expert systems.
PS: Instead of letting processors run small networks, take advantage of the huge neural network experts have in their head and guide the artificial neural network into the right direction. Mostly, information processing follows insight from other fields, and doesn't deliver explanations. The explanations have to be there already. It would be particularly interesting to hear how the chess play of the developers involved has evolved since and how much they actually do understand the model.
Note that I'm not saying that Google is doing something stupid or leaving potential gains on the table. What I'm saying is that their methods make sense when you are able to perform enough experiments to actually make data-driven decisions. There is just no way to emulate that when you don't even have the budget to try more than one value for some hyperparameters.
And since you mentioned chess: The paper https://arxiv.org/pdf/1712.01815.pdf doesn't go into detail about hyperparameter tuning, but does say that they used Bayesian optimization. Although that's better than brute force, AFAIK its sample complexity is still exponential in the number of parameters.
> they used Bayesian optimization. Although that's better than brute force, AFAIK its sample complexity is still exponential in the number of parameters.
I guess the trick is to cull the search tree by making the right moves forcing the opponents hand?
Hyperparameters are things like the number of layers in a model, which activation functions to use, the learning rate, the strength of momentum and so on. They control the structure of the model and the training process.
This is in contrast to "ordinary" parameters which describe e.g. how strongly neuron #23 in layer #2 is activated in response to the activation of neuron #57 in layer #1. The important difference between those parameters and hyperparameters is that the influence of the latter on the final model quality is hard to determine, since you need to run the complete training process before you know it.
To specifically address your chess example, there are actually three different optimization problems involved. The first is the choice of move to make in a given chess game to win in the end. That's what the neural network is supposed to solve.
But then you have a second problem, which is to choose the right parameters for the neural network to be good at its task. To find these parameters, most neural network models are trained with some variation of gradient descent.
And then you have the third problem of choosing the correct hyperparameters for gradient descent to work well. Some choices will just make the training process take a little longer, and others will cause it to fail completely, e.g. by getting "stuck" with bad parameters. The best ways we know to choose hyperparameters are still a combination of rules of thumb and systematic exploration of possibilities.
And the loss surfaces vary. Even just changing the dataset or even the input size alters the loss surface and can easily break a model.
It's not called Gradient Descent by Grad Student for nothing.
Also, which other topologies were tried and failed to produce good results. It's amazing that this information is missing from most modern ML papers.
I don't believe that's true necessarily, but it will sure hamper the authors hopes.
Then you apply statistics. Which are the foundations of machine learning.
At the moment we are in a phase were, to stick to the optics metaphor, we stack up lenses until we see the object on the screen. This means we end up with models that are sprawling, instead of having models that were engineered.
Another trend that seems to start in deep learning is that layers become more constrained. I expect, that in 20 years, we will see much more constrained models and much generative models.
2. hopefully with time we'll have better approaches to engineer all things that are engineered
No, at the moment we go for the biggest and shiniest lens that we can get our hands on and hope that it's capable enough to tackle our problem. If it is we can waste time designing a smaller, more constrained, lens to ship to consumers.
There's a reason why Lecun wanted to rebrand deep learning as differentiable programming. https://www.facebook.com/yann.lecun/posts/10155003011462143
I'm not sure what wirrbel meant.
For example, a classifier that tells you cat or not can't be used with one that says running or not to get running cat.
The benefit being that you could put together more "off the shelf" models into products. Instead, you have to train up pretty much everything from the ground. And we compare against others doing the same.
@wirrbel, that accurate?
Are there fields where this is an apt description?
It's scary, but to my thinking, inevitable. It was all well and good for the early atomic scientists to say that "Math is unreasonably effective at explaining Nature,"  but our level of understanding of both mathematics and natural law is still superficial in several important areas. The universe doesn't owe us a formal theory of anything, much less everything.
It seems likely that we will soon start building -- and relying upon -- software that no human actually understands. The math community is already having to confront this dilemma to some extent, when an outlying figure like Mochizuki releases a massive work that takes months for anyone else to understand, much less prove or refute.
At some point we will have to give up, and let the machines maintain our models for us.
We don't need a theory that is perfect. Each theory was partially wrong but still lets you make useful predictions about the world. We need useful models that let you reason about the world. All models are wrong, some are useful.
> What if intelligence, both animal and machine, is purely random trial and error and "this thing seems to work"?
Evolution could be just considered random trail and error. However until we reach the singularity, we need people to speed up the evolution process by adapting and remixing pieces that worked before. We need models for what each level does so have ideas of what to try for a new application.
Maybe the useful models exist, but we can't comprehend them, because they're true outside of the set of rules we happened to get built into our minds?
generally though, i'm on board with you. all models are wrong, some models are useful.
This is stretching things a bit. Specifically, it defines truth as 'does not lead to a contradiction in (some formal system that extends) Peano arithmetic'.
Then, as there are statements that are 'true' in this sense in such a system A but not probable by that system, there are 'unproveable truths'.
But is that satisfactory as a definition of truth? It used to be because we had hope for a complete and consistent formal system, which feel very truthy. When Godel proved that cannot exist, perhaps the conclusion is that formal systems aren't the 'base' for truth.
Since our minds are more like perception-action variational inference systems, that's a hell of a lot of pretending ;-).
There’s a mass influx of newcomers to our field and we’re equipping them with little more than folklore and pre-trained deep nets, then asking them to innovate.
The message I've gotten is "try things out". Innovation isn't necessarily improving specific techniques, but applying them to new fields. To apply techniques to things that are more mundane like data processing in non-AI-focused companies, you're gonna need bodies who know how to apply these newer programming techniques to solve problems.
Not every electrician has to understand electrical engineering.
i think this is especially important if you purely want to do applications. we have a bag of tricks (dropout, batchnorm, different optimizers and learning rate schedules). we have no real theory for why any of this should work; often a proposed explanation will later turn out not to make sense.
so the choice of how to train things comes down to "folklore", the community's collective experience. and there's no guarantee that folklore will generalize to your new architecture or dataset, and no way to know whether it even should.
the presentation seems to have struck a nerve and there's papers and talks floating around now examining the performance of common architectures in very simple settings. it's probably worth paying attention to these at least in the background, as it will hopefully crystallize into a body of knowledge that will be useful for someone trying to decide on architectures and optimization techniques.
"The power of digital computing is the power of modular expansion of objects. Analog circuits and computers don't have that. And current trained deep learning models don't have combinability and modularity either."
A point I'd like to make: the brain exhibits properties of both digital and analog computers. It also exhibits repeating units in the neocortex which do vary but are uniform enough that neuroscientists are comfortable classifying them as discrete units within the brain.
I believe we must look to how the brain implements effective modularity in the context of analog computation in order to replicate the success of digital computers with deep nets.
When you're building digital circuits, they're expected not to care about what the bits mean, which patterns are more likely. It works for all possible inputs, with equal quality.
There are things in common with how you would process faces and how you would recognize other visual objects, and that's why there are design patterns such as "convolutional layers come before fully-connected layers".
In a way, the "no free lunch" theorem says that you are always paying a price when you specialize to a certain kind of patterns. It comes at the detriment to other patterns. So, any kind of stack of theories on ML/DL is going to be incomplete unless you say something about the nature of your data/patterns.
(That doesn't mean that we can't anything useful about DL, but it just puts a certain damper on those efforts.)
What I'm trying to say is that Phd's come from an academic research background, while engineers come from a product focused background. The deep learning field is still dealing with a lot of unknowns, counter-intuitive responses to modifications, and pure experimentation. The engineers might just not realize the need for continued experimentation, and, for them, it may just feel like an undesirable waste of time to fiddle with parameters (as in, taking away time from developing the actual product).
It's an alternate point of view, but something that I experienced.
The only thing I've found really useful until now, is to put 2-fully connected layers if the classifier does now handle well classification... just because you needed a hidden perceptron layer for the XOR case.
I hope to find more examples like that. If you know them, please share!!
PS: The parallel between DL and optics is (if viewed historically) a bit misleading, because for building lenses we first had a theory.
... which references an even better one:
We've been having a solid laughfest in the office for the past 10 minutes or so.
This reminds me of my bioinformatics class. The final project was to reproduce the results of a famous paper in the field.
All of us spent _weeks_ trying to do it. Nobody succeeded. The more we dug into the paper, the more holes appeared. There were variables missing in the paper, assumptions not covered, datasets not properly specified, etc. It made reproduction nearly impossible; like winning the lottery. Imagine trying to recreate the results of a deep learning paper without the paper specifying _any_ information about the layers used, their sizes, or any hyperparameters.
The professor was equally mystified.
Years later I learned this kind of pseudo-science is rife in the field of bioinformatics. I felt both a sense of relief in knowing we weren't crazy, and disappointment. I actually really liked that class; the field of bioinformatics fascinated me. But realizing what a cesspool it was, left me disappointed.
I'm glad machine learning as a field has taken proactive steps to avoid these exact kinds of issues. It's now common practice in ML to publish code and models alongside your papers, and most ML libraries allow deterministic training. This makes reproduction of results easy. It's a breath of fresh air. That doesn't obviate all problems. Methodologies and conclusions are still up for debate in any given paper. But at least the experiments themselves are reproducible. And if you question the methodology or some aspect of the experiment, you can go in and augment the experiment yourself.
A friend of mine spent a good chunk of his PhD trying to reproduce an experiment involving growing primary cells in serum-free medium (the idea was to use that experiment as a starting point, and explore more aspects of it). The protocol was:
1. Grow some regular immortal cells in serum-based media in a dish, so they coat the dish with extracellular matrix
2. Use trypsin to detach the cells from the dish and remove them
3. Wash the dish carefully to remove all traces of serum, but leaving the extracellular matrix
4. Plate the primary cells onto the dish and grow them in serum-free medium
He tried for months and couldn't get the cells to grow. Then he got sloppy, didn't wash the dishes as carefully as he should, and bingo, the primary cells grew fine, as described in the original paper.
After some subtle digging, the inescapable conclusion was that the original authors had not washed their plates all that carefully either, and the serum-free medium was not exactly that. The whole premise of the experiment was flawed.
Yeah about that, I've got some bad news...
Dieselgate started with a team of students attempting to reproduce VW's claimed emission numbers.
This was a decade ago. Looking at the paper again I believe we only tried to reproduce a small portion of it; the phylogeny tree from the paper and its supplemental material.
Sounds like a site dedicate to "My Ass" results would be extremely popular with grad students and real world researchers. Being able to know "it's not just me" and maybe even avoid some of the stumbling blocks others have run into, or to not just blindly use some approach that happened to work for one experiment, but seems to fail for many others.
I agree in general, but I'd also love to see those published as actual beautiful papers, not just ugly formatted websites. (Okay, the website in case isn't that bad. At least it's clearly structured and readable.)
These would be mostly short papers, for sure. But there could be a separate section in the journals for them - just like the "outtakes" section at the end of a movie.
But if this encourages other people to write a follow-up paper that fixes the issue, it would still serve an important purpose.
Medicine has a good tradition of adverse clinical writeups. "Patient presented with X symptoms, I administered Y treatment as recommended by [Z], and the patient got worse." One such writeup isn't conclusive evidence against Y, but suggests an issue to look into.
> Following the popularity of MapReduce, a whole ecosystem of Apache Incubator Projects has emerged that all solve the same problem. Famous examples include Apache Hadoop, Apache Spark, Apache Pikachu, Apache Pig, German Spark and Apache Hive 
Looking at his resume, he did wisen up and did his master's thesis in computer science. I trust he's happier now than as a undergrad student.
The standard technique is to set up a "Kelvin probe", with four contacts on the Ge sample. Pass a current from a constant current source (an IC or FET these days) between the outer two contacts and measure the voltage across the inner ones.
It doesn't sound like his lab assistant set up something at which he could succeed, and that's a shame. He couldn't even repeat the room temperature reading.