Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲Deep Learning and Its Implications for Computer Architecture and Chip Design (arxiv.org)

126 points by godelmachine 1609 days ago | 24 comments

londons_explore 1608 days ago [-]

Badly formatted paper with handwavey abstract with no real focus and dropping buzzwords aplenty... I'll pass...

Oh - it's written by Jeff Dean, inventor of Mapreduce, Bigtable, tensorflow, and practically a god... Yeah, I'll read it!

londons_explore 1608 days ago [-]

Read it. Worth a read, especially for those not closely following the machine learning world.

The last section, focussing on having a single large sparsely activated model which can accomplish thousands of different tasks by using a selection of internal 'experts' interests me the most.

I suspect this type of model isn't used much today simply because each company using ML only typically has a few problems to solve. If someone like Google, with far more different problems to solve, can get this type of model to work and demonstrate its effectiveness, I think it would be a big step towards solving artificial general intelligence.

Jeff Dean has a lot of respect and influence inside Google, and his ideas tend to get implemented. I'm looking forward to it!

sanxiyn 1608 days ago [-]

Sparsely activated multitask model is a kind of Jeff Dean's hobby horse. It was published in 2017: https://arxiv.org/abs/1701.06538. My assessment is that it is an intriguing but ultimately failed experiment, like Geoffrey Hinton's capsule network.

acollins1331 1608 days ago [-]

Capsule networks are not failed experiments! Where is this coming from? They merely haven't been applied as much as CNNs or FCNs but there are lots of papers out there where capsule networks outperform those architectures.

Source: my thesis using capsule networks for semantic segmentation of aerial imagery

XuMiao 1608 days ago [-]

I like the Capsule idea too. In some way, capsule network is very similar to sparse attention network. It's just the attention normalization is different. Attention is normalized on the inputs, the capsule is normalized on the output. Potentially capsule can yield much cleaner patterns, while patterns generated by attention networks can be overlapping. It's just that capsule is much harder to solve.

jeffshek 1608 days ago [-]

Risky to say, considering deep learning by Hinton was also once called a failed experiment ...

fizixer 1608 days ago [-]

Short of TeX/LaTeX, I can put up with a Google docs paper.

I absolutely cannot stand a MS-Word paper. It's crippling.

zwieback 1608 days ago [-]

Are you saying it's impossible to format a paper in Word to please you? Seems like there's a lot of options to get the look you want.

fizixer 1608 days ago [-]

Well folks who prepare documents in Word, esp. formal documents, more often than not, pick the default rendering of Times New Roman (and I know the default in MS-Word is not TNR).

And a standard MS-Word TNR document looks like crap. I'm sorry.

But I have to admit I have an over-reaction to seeing an MS-Word document after spending more than a decade working with TeX/LaTeX exclusively. So part of the blame goes to me, and actually hurts my ability to keep up with the literature when I start avoid MS-Word papers.

zwieback 1608 days ago [-]

Yeah, the default Word style definitely isn't anything like a typeset document. For fun I downloaded the "Latex.dot" template and some computer modern fonts, with those you can make a first approximation of the Latex look in Word but it's still not the same.

I've never been a fan of the Computer Modern font but the page layout and formating in Latex is certainly nice and just having a standard for scientific papers is a plus.

Veedrac 1608 days ago [-]

> Figure 2 shows this dramatic slowdown, where we have gone from doubling general-purpose CPU performance every 1.5 years (1985 through 2003) or 2 years (2003 to 2010) to now being in an era where general purpose CPU performance is expected to double only every 20 years [Hennessy and Patterson 2017].

This isn't true. CPU performance has stagnated very recently due to Intel's struggles with 10nm, but we look to be leaving that behind us. Even if we weren't, it's still not relevant for ML, since within the last decade GPUs have improved a factor of ~10—thus the terrifying Figure 2 is proven false.

amelius 1608 days ago [-]

Does anyone happen to have a link to a paper or book describing the state of the art in placement and routing algorithms? I'd like to read up on that topic.

kernyan 1608 days ago [-]

These two books might be useful,

1) https://www.oreilly.com/library/view/electronic-design-autom... 2) https://www.crcpress.com/Electronic-Design-Automation-for-IC...

On the first book, see chapter 10 - 12 (on floorplanning, placement, and routing). End of chapter 11 points you to some literature survey as well. But the book itself is somewhat dated (published in 2009)

I haven't read the second book but it's much more recent (published in 2018), it also has chapters on placement, and routing.

solidasparagus 1608 days ago [-]

This is a really interesting paper on using RL for device placement - https://ai.google/research/pubs/pub46646.

sorenn111 1608 days ago [-]

Potentially noob question, with Moore's law slowing down, are there enough specializations/hardware modifications available like those mentioned in the paper such that progress in ML will continue to progress rapidly? or will these advancements simply forestall an inevitable asymptote.

retrac 1608 days ago [-]

It's little more than an educated guess on my part, but I figure there's about two orders of magnitude in improvements in processing speed exploitable with current processes, if a big-budget chip were designed specifically for ML training. GPUs are architecturally not very optimal for the task.

You want something like a chip with a huge mesh of small independent cores with their own local storage, quite possibly with non-digital circuits that can very quickly approximate the functions with analog electronics, rather than actually doing all of the calculations digitally. Some variation on that is the approach both Intel and IBM have taken with their "neural chips" in the last few years.

It seems that analog computers are finally getting their revenge.

solidasparagus 1608 days ago [-]

This doesn't seem to match my experience with ML and GPUs/ASICs.

TPU is the main ML ASIC in use. A major goal of the original TPU design seems to be reducing the number of memory accesses. The other top-end ML device is NVIDIA's GPUs with Tensor Cores. Both of those chips are designed around fast matrix multiplication, which right now seems to be the most important operation in deep learning - see how RNNs have started to fall out of favor to CNN-based networks with attention heads.

The TPU is not faster than NVIDIA's GPUs, but it is cheaper. Right now the future seems to be cheaper ML devices designed to be horizontally scalable.

From the CPU perspective, it appears that the major ML effort is related to vectorizing instructions via advanced instruction sets.

Everyone who creates silicon is focused very heavily on using smaller and smaller numeric types - float16 is standard and there is work being done for even smaller int based work.

I haven't seen any analog-based ML devices in use. Can you share an example? Is there even a way to approximate the results of a matmul using analog devices?

It's impossible to guess how much more speed we can get with current approaches, but everything from silicon to networking stack to libraries to network architectures are in their infancy so I would expect dramatic improvements in performance on a regular basis (but not as regular as other areas of software because silicon development is slow)

sanxiyn 1607 days ago [-]

TPU v3 is rated 420 teraflops, while V100 GPU is rated 125 teraflops.

solidasparagus 1607 days ago [-]

What does TPU v3 mean there? tpu v3.8? In which case you are comparing 8 cores/4 chips to a single GPU which hardly seems fair. It's hard to compare across ASICs. In practice the the largest readily available amount of compute seem to be 8 V100s vs one 'Cloud TPU' (tpu v3.32). Those two have relatively similar performance in practice (FLOPs seem to be a very poor way to compare across ASICs), although TPU is typically several times less expensive.

buboard 1608 days ago [-]

brain floating point ... cool name

i guess brain's synaptic precision could go way lower, as low as 26 distinct synapse weights: https://elifesciences.org/articles/10778

> A particularly interesting research direction puts these three trends together, with a system running on large-scale ML accelerator hardware, with a goal of being able to train a model that can perform thousands or millions of tasks in a single model. Such a model might be made up of many different components of different structures,

yup, he is building a brain

gnode 1608 days ago [-]

> many different components of different structures

> yup, he is building a brain

The limbic system of the brain is made of many different structures. It handles much of the ancient fixed function instinctive operation of the brain -- heavily involved in sleep, reflexes, appetite, and motivation, for example.

However, our general intelligence and learning capability is mostly due to our neocortex, which has a highly regular structure. The neocortex also subsumes the roles of much of the limbic system as development progresses, overriding the specific structures.

This suggests to me that intelligence / learning doesn't benefit from specific structures, but from general structures, capable of encoding behaviour as data (by long-term potentiation of synaptic weights in the brain's case).

quotemstr 1608 days ago [-]

> i guess brain's synaptic precision could go way lower, as low as 26 distinct synapse weights: https://elifesciences.org/articles/10778

Thanks for the link. Artificial neural networks all the way down to binary weights [1] although this approach doesn't seem like the most efficient one. It's interesting how we're still seeing a ton of variability in ML architectures: it suggests we haven't stumbled on the right area yet. It reminds me how early aviation had a huge diversity of aircraft plans, but now, after a lot of optimization, we've settled on that one standard airliner shape everyone uses everywhere.

[1] https://arxiv.org/abs/1602.02830

buboard 1608 days ago [-]

A formal theory of deep learning is proving to be much more elusive that avionics. Interesting times though

the8472 1608 days ago [-]

> yup, he is building a brain

Plus applying ML to improving the underlying hardware and software. On the other hand it's not recursive yet and moore's law ending throws a wrench into exponential self-improvement, but it's still a little concerning.

1608 days ago [-]

Rendered at 18:01:36 GMT+0000 (Coordinated Universal Time) with Vercel.