I couldn't find the actual youtube video where a google employee shows the 40% energy cost, if you do, please link it. Meanwhile here is another graph that is somewhat similar: https://perspectives.mvdirona.com/2008/11/cost-of-power-in-l...
Intel microarchitectures tend to be heavily optimized around a workload that assumes a CPU core is doing heavy scalar computation all the time. If your workload and code actually looks like that -- and many kinds of server workloads do -- you can't beat them for performance per watt. But this comes at the cost of things like power consumption when idle. Most personal computing, on the other hand, spends most of its time idle and may benefit significantly from architectures that optimize more for power consumption at low utilization levels. There are many ways to attack these problems at the architectural level.
This becomes complicated to optimize in intermittent compute-intensive environments, since over-optimizing for the idle time effectively increases the amount of time you spend in a less efficient compute-intensive state (see also: Apple's big/little core hybrid ARM architecture). There was a lot of empirical work done on this in the supercomputing world, which is sensitive to power costs and had built ARM clusters. The consensus seemed to be in that environment, for example, that throughput per watt under max load was the driver of power costs, which favored Intel.
Modeling this is difficult because it is a complex interaction between specific pieces of code and specific architecture implementations. Even just comparing Intel and AMD, they have different tradeoffs for real code because the architectures are optimized differently. For example, while some codes run very efficiently on AMD's Epyc, I have a couple old codes that have terrible performance on Epyc that can be traced to design decisions made by AMD. You can't design a CPU that is the best at everything or even most things.
Do you think the industry will move to specialized processors dependant on workload type? For example, for serverless, cloud providers could move workloads to the most efficient microarchitecture.
There's a paper from a few years ago the looked into this in some detail – ftp://doc.nit.ac.ir/cee/computer/alinejhad.saeedeh/old/95-96-1/MS/Advanced%20Architecture/project/paper/6-ISA%20Wars%20Understanding%20the%20Relevance%20of%20ISA%20being%20RISC%20or%20CISC.pdf – there's no immediate reason to think the conclusions should have changed.
Both Intel and AMD have demonstrated they can scale these effectively. The A12X isn’t even playing in the same league when it comes to uncore functionality. I’m sure if you created a version of the A12X with the same cache, memory, PCIE, core count, and interconnections the power picture looks a lot different.
This is part of the reason why “datacenter arm” CPUs never get much of a foothold: as soon as you start ramping up your uncore transistor count, your power advantage erodes significantly, to the point where Intel and AMD begin to yield better performance-per-watt in typical server workloads due to a tightly optimized microarch and an “uncore” to feed it.
For example, could one not argue that an x86 binary / code size is on average smaller and more dense than the ARM equivalent, therefore on x86 cache misses will be less common, less ram needs to be used etc etc. Perhaps a denser binary is better for power consumption?
Although this effect would be very marginal even if it is present at all.
But CPUs are just part of the equation - you can make significant gains in other parts of the computer and memory is a good target. When we got to the point PSUs live as independent things plugged into the same rack as your servers, it means we are really concerned about not feeding one amp more than we need to.
A lot of the low-power ARM processors do that - just avoiding speculative execution like you pointed out is a huge win if power is your main concern.
Another interesting take are barrel processors that can avoid scheduling stalled contexts, but they are only effective when we can feed them enough threads the time it spends waiting is less than a "normal" processor would. Early Xeon Phi's (and Cell's PPUs) behaved like barrel processors - each core could dispatch 4 (2 on the PPY) threads, but it did so round-robin. Scheduling only one thread per core didn't make them run any faster.
Linux supports arm and we have massive companies where economies of scale and a Linux/OSS plus their code could all be cross compiled, I mean if your the average biggish none tech company then yeah you might be stuck on x86 because of windows but that isn't everyone.
That depends on the software. Many linux distributions can be compiled for x86, ARM, PPC, RISC-V, and even some others. Cross compiling is not difficult if your software is written with even a little of that in mind. Windows itself used to be available for 3 or 4 instruction sets - and an OS requires low level hardware support. Applications should be largely target independent.
Hell, a CortexM3 lets you have unaligned loads and stores too.
That being said, server cores with their long me or pipelines are the most likely to not have an issue with unaligned access, so it's probably a moot point.
A decade ago, PUE numbers were much more commonly closer to 2, (Google's were in the 1.2-1.3 range even in 2008, but things had improved quite a bit since 2005 already, and MSFT and YHOO were working on it), but modern datacenter design has improved that substantially.
In my understanding, 100% of the power consumed is converted to heat. Measuring efficiency should have a unit like Gigaflops per Megawatt or something, not a fictional percentage. That way you can also account for hardware getting more energy efficient over time, which is now just barely represented in the numbers. It's hard to get a right unit though, because for instance archive.org has a big data center but focuses on storage instead of processing power.
Gflops per watt is also really important, but that's more in Intel/and/Nvidia's hands.
Application ops/w is also also important, and that combines the system efficiency and the code efficiency.
A key thing is that each of these evolves on different time scales. When you build a datacenter it lasts for multiple generations of processors. Hence PUE.
I would expect that right now, Google is 1. seeing to the re-engineering of certain math libraries they use internally for Google Search et al, and 2. like AWS, intending to create separate cloud-VM instance classes for Intel vs. AMD processors, where each workload type can “pay its own bills.” If AMD continues to dominate in server TDP, then I would expect this to lead to the AMD instance classes getting cheaper while the Intel classes remain the same; and so most customers switching to AMD instances, save for the customers who themselves have workloads optimized for Xeon-specific SIMD instructions.
But you would need to compare total watts for enough systems to get the throughput you need, which I suspect is not the same and is likely workload dependent.
For home use, where you are likely to only have a single system, you can directly compare idle usage, but for Google how many idle systems depends on your thrroughput needs --- and maybe Google can orchestrate shutdown / reuse of idle servers off-peak, so it might be moot.
At one point they were taking chips that had failed manufacturers QA, sticking them on DIMMs themselves in a way so that the ECC would cover those errors, and just running jobs on many machines to make up for the reduced ability of ECC to make guarantees.
And the make their own motherboards too, so just soldering down LPDDR isn't out of the question.
Do you have a link where I can read more about this?
For a while I thought I was going crazy, but a comment from someone on this article implies he saw the same slide deck. https://arstechnica.com/information-technology/2009/10/dram-... I haven't been able to get anyone to reconfirm it from Google's end, my sense is that it's information that wasn't supposed to be released.
Is this something you are looking to build for your own system? I assume the ASIC was proprietary but I could reach out to the vendor and see where they had the modules made (or if they were built in house).
Years ago Google complained that they couldn’t fill up their data centers because they had hit the limits of electrical code. Servers plus A/C took them into no man’s land as far as electrical code was concerned. That also means wasted real estate.
Probably less than you think. To be precise, for every x watts of power Google needs about y = 0.12*x watts of cooling (and other overhead) on average. That's based on the PUE number at <https://www.google.com/about/datacenters/efficiency/internal....
> Years ago Google complained that they couldn’t fill up their data centers because they had hit the limits of electrical code. Servers plus A/C took them into no man’s land as far as electrical code was concerned. That also means wasted real estate.
I don't remember that, but I'm guessing it was from the days Google's servers were in datacenters built by other companies. They probably outfitted so many circuits at x volts / y amps each, and there's only so much you can do with that.
(To be precise, Google still runs some servers in non-Google datacenters, as part of running a CDN. But this isn't the bulk of Google's processing power.)
"Power shift: Data centers to replace aluminum industry as largest energy consumers in Washington state"
Large custom data centers draw on the order of 300MW, a non-scientific survey of aluminum smelters on the web has their draw around 400-700MW, so I'd be careful about 'ever will'.
Smelting issues a lot of power in a very small space resulting in a lot of heat, where computing it not getting anywhere close to that in terms of density, just in aggregate over giant data centers.
But I’m not sure I understand your point. I look Google Maps images of the Wenatchee Alcoa works and the Google Data center in the Dalles and they seem order-of-magnitude the same size to my eyeballs.
You seem to think I don’t appreciate how much energy aluminum smelting requires - I do. But imagine the heating element of your aluminum crucible broken into a million pieces, each piece with a cooling system to keep its temperature down. That’s a data center.
Large data centers are huge building, the density is very much limited by heat issues and while the building might get bigger the power demand per square foot does not. We can move lots of power through very small areas consider individual steam turbines are up at 600 MW.
Even if we start talking about GW for a single datacenter, the single Kashiwazaki-Kariwa plant is already 8GW.
A data center is a much more domesticated space and are often near commercial and residential areas not Industrial.
Wikipedia has a list of aluminum smelters and I notice none of the biggest are in the US. Is that just because, or due to code limitations?
Not sure if it was changes in these contracts or other market forces that caused the decline.
The last active NW aluminum smelters closed in 2016. Most of the rest were victims of Enron and the 2000 energy crisis.
I would imagine the odd size could be due to yields of the new memory.
EDIT: thanks Child post for correcting my typo.
This reminds me of triple-core processors that were "inconceivable" right until they shipped.
Why? Why not 8? Whatever, I doubt I've ever seen a phone with more than 2 GB RAM. I wish my phone was as powerful as my PC (which is very old, has 4 GB RAM and can't handle more because of the chipset limitations) so I could just attach peripherals I want, boot a desktop Linux (directly or in a VM) and use it instead of the PC.
So they can sell you 8 next year. And 10-12 the year after that. And 16 after that. And so on and so forth...
> I doubt I've ever seen a phone with more than 2 GB RAM.
- The first phone with more than 2GB of RAM was the Samsung Note 3 back in 2013 - with 3GB.
- Almost every "flagship" phone since 2015/16 has had 4GB of RAM or more. So if you've seen people using phones on the street, subway, office, etc in the last 3 years, you've seen phones with more than 2GB of RAM.
> I wish my phone was as powerful as my PC...so I could just attach peripherals I want, boot a desktop Linux (directly or in a VM) and use it instead of the PC.
That would be awesome and is my ideal view of the mobile computing future.
Samsung Note 3 is exactly the model of my phone. I've always believed it has just 2 GiBs. Let me check...
Yes. free -g in Termux says 2. However...
... free -m says 2834 which means ~2.77 GiB. Well, this indeed is "more than 2GB of RAM" but not much and what a weird number...
> That would be awesome and is my ideal view of the mobile computing future.
Seems like it's the past already. There were numerous attempts (successful in a way or another) years ago and despite the fact smartphones become more and more powerful and add more RAM this approach still doesn't show signs of becoming popular and the most hyped projects get either discontinued or stagnant.
A number of workstations that I have designed has used 2 sticks of 4GB RAM, and 2 sticks of 2GB RAM. This has saved money at little to no cost to performance.
Perhaps Micron is doing something similar here?
"For non-binary memory densities, only a half of the row address space is valid. When the MSB address bit is HIGH, then the MSB-1 address bit must be LOW."
They basically cut off the upper 1/4 of the address decoder.
Once you hit the ceiling a whole bunch of disciplines have to change overnight. Better to give them a 25-50% bump and start the story about horizontal scalability before you’re entirely out of other options.
I've been running 16-32GB for a very long time now. Servers can go into the multiple TB ranges.
Fortunately a search at Micron's website reveals that this is the "Z2BM" with two part numbers, MT29VZZZ7D8DQFSL-046 (containing 2 Z2BM) and MT29VZZZBD9DQKPR-046 (containing 4 Z2BM). No public datasheets (yet), unfortunately, I was curious to read a little more.
What Intel is promising is LPDDR4 packaged in it's own chip(s) on the motherboard.
2. I don't know that it would make much difference, IIRC energy-efficient RAM has a huge impact on idle/sleeping power, but when everything's running full tilt other components way outdraw RAM's needs