Cloudflare switching to ARM Server, Intel "free" by Q4.

ksec

Senior member
Mar 5, 2010
346
4
91
#1
http://www.datacenterknowledge.com/...rm-servers-it-expands-its-data-center-network

https://twitter.com/eastdakota/status/976560820611031040

I have to admit i was surpassed. Because I remember Anandtech doing an Article on Centriq and it wasn't at all impressive. So the number from Cloudflare, and Qualcomm TCO [1] seems strange. Then i realise I had it wrong, the article Anandtech reviewed was ARMv8 but from Cavium, this one from Qualcomm seems impressive enough. And it is much cheaper then Intel.

The most interesting is the last 2 paper. Redis smells like Baidu to me, which has the largest number of Redis instance in production. HHVM could be Facebook, Wikipedia, Wordpress or all three.

All these companies are running thousands of servers.

I never thought the day ARM's attack on server would come so quick, judging from Ampere and Cavium. But it seems Qualcomm has something right. And they have more improvement coming next year, as compared to Intel which is more of the same. No wonder why they have uArch and Node First Strategy for their DC Segment.



[1]
https://www.qualcomm.com/documents/tirias-spec-cpu2017-tco-paper-qualcomm-centriq
https://www.qualcomm.com/documents/tirias-redis-tco-paper-qualcomm-centriq
https://www.qualcomm.com/documents/tirias-hhvm-tco-paper-qualcomm-centriq
 

beginner99

Diamond Member
Jun 2, 2009
4,004
124
126
#2
Probably ideal use case:

Many light-weight threads -> ARM

threads that actually need computational power -> Intel
 

moinmoin

Senior member
Jun 1, 2017
696
196
96
#3
Probably ideal use case:

Many light-weight threads -> ARM

threads that actually need computational power -> Intel
And after Meltdown, threads that rely on heavy i/o -> AMD
 

ksec

Senior member
Mar 5, 2010
346
4
91
#5
Probably ideal use case:

Many light-weight threads -> ARM

threads that actually need computational power -> Intel
That was what I thought as well, until we factor in TDP. The current Centriq chip is not running at maximum TDP compared to the Turbo Boost Intel had which means it is running at max all the time under load.

https://twitter.com/thecomp1ler/status/976617883164921857

Golang, Linux patch, all landing soon.

Do note though, Qualcomm is one of the Cloudflare Investors. But i dont think this makes any different in those benchmarks numbers.
 

beginner99

Diamond Member
Jun 2, 2009
4,004
124
126
#6
That was what I thought as well, until we factor in TDP. The current Centriq chip is not running at maximum TDP compared to the Turbo Boost Intel had which means it is running at max all the time under load.

https://twitter.com/thecomp1ler/status/976617883164921857

Golang, Linux patch, all landing soon.

Do note though, Qualcomm is one of the Cloudflare Investors. But i dont think this makes any different in those benchmarks numbers.
My main point was that for certain stuff you want it to be fast for a single request and not in the total picture. Cutting response time from 2 seconds to 0.5 seconds can be a huge thing in terms of user experience. And if the chip you use just ins't fast enough for this, well then you are out of luck.

the above comparison obviously ignores EPYC completely which already is way better than intel in term of performance/$.
 

coercitiv

Diamond Member
Jan 24, 2014
3,111
387
136
#7
the above comparison obviously ignores EPYC completely which already is way better than intel in term of performance/$.
Cloudflare argue they do not care for perf/$, only perf/watt. Their workload is so specific they claimed they would not choose Intel's products over Qualcomm's even if Xeons were given for free.



Personally I think the fact that Qualcomm is a direct investor in Cloudflare cannot be overlooked, and I'd rather see more tests from other parties.
 

Nothingness

Golden Member
Jul 3, 2013
1,881
32
106
#8
Personally I think the fact that Qualcomm is a direct investor in Cloudflare cannot be overlooked, and I'd rather see more tests from other parties.
I agree, but do you think they'd put themselves in danger just to please one of their investors? No sane company would do that unless it's fully subsidized.
 

coercitiv

Diamond Member
Jan 24, 2014
3,111
387
136
#9
I agree, but do you think they'd put themselves in danger just to please one of their investors? No sane company would do that unless it's fully subsidized.
Ask yourself this: what sane company CEO publicly burns bridges with his lifelong CPU supplier by touting their competitor's supremacy? It's one thing to (entirely) switch suppliers, another thing to post kill-a-watt pics and suggest Intel's products are more expensive even if given for free. These are 100% PR moves and I doubt Cloudflare needs the publicity.
 
Apr 8, 2002
40,924
65
126
#10
This is specific use case stuff. For what they are doing it makes sense.
 

NTMBK

Diamond Member
Nov 14, 2011
8,250
227
126
#11
Ask yourself this: what sane company CEO publicly burns bridges with his lifelong CPU supplier by touting their competitor's supremacy? It's one thing to (entirely) switch suppliers, another thing to post kill-a-watt pics and suggest Intel's products are more expensive even if given for free. These are 100% PR moves and I doubt Cloudflare needs the publicity.
It's a good way to get a good price out of Intel next time they need to buy servers...
 

Nothingness

Golden Member
Jul 3, 2013
1,881
32
106
#12
It's a good way to get a good price out of Intel next time they need to buy servers...
For this to work, the product has to be competitive or else Intel would just laugh at them when asked for a price reduction.

Anyway I want to see independent benchmarks. Or find time to do it myself.
 

NTMBK

Diamond Member
Nov 14, 2011
8,250
227
126
#13
For this to work, the product has to be competitive or else Intel would just laugh at them when asked for a price reduction.
In order for Intel to fear the competition, you need to prove that you are willing and able to switch. That's exactly what Cloudflare is doing.
 

Dolan

Junior Member
Dec 25, 2017
13
0
41
#14
This is specific use case stuff. For what they are doing it makes sense.
It is not only this specific use case. I can't go into details, but there are applications, where Centriq advantage is much bigger than here.

Basically, Centriq has chance to beat Xeons in every field, where Xeons managed to beat big iron architectures (zX, SPARC). And off course, Centriq lags where Xeons failed too.

Anyway, Cloudflare's announcement seems little rushed to me too, although they hold software stack so they must know what they are doing. For any wider deployment there is still lot of work to do on both software and hardware. First thing what Qualcomm should do is push 7nm version out ASAP. 2 nodes advantage would be something what can't be ignored even by companies not willing to switch architectures.
 
Feb 6, 2011
1,771
88
136
#15
It is not only this specific use case. I can't go into details, but there are applications, where Centriq advantage is much bigger than here.
How is inter core latency both in the cross ring bus and in the shared L2 situations? Those two things always looked the worst to me, large number of ring stops, write combining etc into your "fast" L2(hello bulldozer).

The start of 7nm is probably tricky for QC, if they stick to Samsung its going to be a while and i imaging TSMC and GF 7nm capacity is going to be in high demand.
 

NostaSeronx

Platinum Member
Sep 18, 2011
2,312
104
126
#16
write combining etc into your "fast" L2(hello bulldozer).
Bulldozer is write coalescing which is two addresses in L1D0 and L1D1 gets coalesced into a single address in L2. Write combining is used in pretty much all P6 on wards for Intel. Which reduces the multiple transactions between L1 <-> L2 into fewer transaction and speeds up the processor. All WCB/WCC enhancements can be defeated, the same with OoOE can be defeated or speculative execution can be defeated, or branch prediction can be defeated, etc.

fyi; both cores in Bulldozer have WCBs which buffer into the WCC.
2 x 4 64-byte WCBs go into 1 x 64 64-byte of the WCC's WCB of the WB type.

P6/Pentium 4 => 4? 32 byte or 64 byte WCBs
Core onwards => 6? 64 byte WCBs
Nehalem onwards => 10? 64 byte WCBs

Write combining isn't an issue. So, in the Centriq design it isn't an issue.
 
Last edited:
Feb 6, 2011
1,771
88
136
#17
Bulldozer is write coalescing which is two addresses in L1D0 and L1D1 gets coalesced into a single address in L2. Write combining is used in pretty much all P6 on wards for Intel. Which reduces the multiple transactions between L1 <-> L2 into fewer transaction and speeds up the processor. All WCB/WCC enhancements can be defeated, the same with OoOE can be defeated or speculative execution can be defeated, or branch prediction can be defeated, etc.

fyi; both cores in Bulldozer have WCBs which buffer into the WCC.
2 x 4 64-byte WCBs go into 1 x 64 64-byte of the WCC's WCB of the WB type.

P6/Pentium 4 => 4? 32 byte or 64 byte WCBs
Core onwards => 6? 64 byte WCBs
Nehalem onwards => 10? 64 byte WCBs

Write combining isn't an issue. So, in the Centriq design it isn't an issue.
You know all this could be completely avoid if you did a simple occam's razor. I think its obvious what i am talking about, The L1D's are write through to a inclusive shared L2, this has an impact on maximum write throughput as well as latency. I'll believe what you have to say on the matter as soon as tunnel borer appears........
 

NostaSeronx

Platinum Member
Sep 18, 2011
2,312
104
126
#18
I think its obvious what i am talking about, The L1D's are write through to a inclusive shared L2, this has an impact on maximum write throughput as well as latency.
AMD's Bulldozer is more efficient in regards to memory operations. So, the decrease in bandwidth and increase in latency is mostly alleviated by the various SCB, CB, WCB, WCC, etc. SCB is the store coalescing buffer, it buffers stores going to the L1ds. So, it only needs to be get to the buffer once, then broadcasts to both cores.

Write-through;
L1d -> WCB->
Write-back;
WCC -> L2 -> Memory or L3.

WCC is fully inclusive with L2. What doesn't exist in L2, doesn't exist in WCC.
What doesn't exist in L2, does exist in L1d. L1d is mostly inclusive with L2.

Centriq is heavily compressed in those 128B interleaved cache lines. Which means less memory bandwidth is actually needed...
I'll believe what you have to say on the matter as soon as tunnel borer appears
It should be Higon/Hygon(THATIC-AMD JV)'s first custom core. Also, it may or may not be x86 by the way. Origin core is Broadcom's Vulcan. Is a pure server solution, which is perfect for something that is only on paper related to Bulldozer.

American trees => Zen cores // Aspen core = Zen1
Chinese/Canadian trees => Tunnelborer/Harvester cores (Harvester was meant to cut down Aspen, it was internally projected to being superior to Zen.)
 
Last edited:

ksec

Senior member
Mar 5, 2010
346
4
91
#19
It is not only this specific use case. I can't go into details, but there are applications, where Centriq advantage is much bigger than here.

Basically, Centriq has chance to beat Xeons in every field, where Xeons managed to beat big iron architectures (zX, SPARC). And off course, Centriq lags where Xeons failed too.

Anyway, Cloudflare's announcement seems little rushed to me too, although they hold software stack so they must know what they are doing. For any wider deployment there is still lot of work to do on both software and hardware. First thing what Qualcomm should do is push 7nm version out ASAP. 2 nodes advantage would be something what can't be ignored even by companies not willing to switch architectures.
As mention they only care about pref / watt. Or more like their Workload/Watt. Since the numbers are so in flavour of Qualcomm, Cloudflare is just doing its investor a flavour. Given Cloudflare is now tuning Linux, LuaJIT, Go, and all the Open Sources software stack it is using. It is likely the software investment now into Centriq would level out all the saving they have. But in the long term it benefits Qualcomm and everyone else.

Qualcomm already has improved Core and IPC for 7nm Centriq in the pipeline likely coming in 2019. Intel will have 10nm ( A node that should be slightly better then Samsung's 7nm ) at around the same time. So it really isn't a advantage.
 

XavierMace

Diamond Member
Apr 20, 2013
4,307
6
126
#20
Cloudflare argue they do not care for perf/$, only perf/watt. Their workload is so specific they claimed they would not choose Intel's products over Qualcomm's even if Xeons were given for free.



Personally I think the fact that Qualcomm is a direct investor in Cloudflare cannot be overlooked, and I'd rather see more tests from other parties.
While I agree that the fact Qualcomm is a direct investor can't be overlooked, I think people are severely underestimating how much of a factor electricity costs are in large scale server farms. If we use that above graph for power usage and look at some scenarios...

With a density of 42 CPU's per rack, that's about $4,000/yr of electricity savings per rack based off average electricity. $4,000, not a big deal. Scale that out though and figure they have 1,000 racks across the world. That's $4,000,000/yr in electricity for a 7% performance gain. High density and that number is just going to keep getting higher. More importantly that's just the pure power savings from the CPU. More power means more heat. More heat means more cooling. More cooling means more electricity. Here's a little blurb on that subject from Cisco:

Relationship between Heat and Power

All power that is consumed by IT equipment is converted to heat. Though power is typically reported in watts (W) and heat is typically reported in British Thermal Units (BTUs) per hour (BTU/hr), these units are in fact interchangeable. Although power is almost always reported in watts, heat load is commonly reported in watts or BTU/hr. The conversion from watts to BTU/hr is 1W = 3.412 BTU/hr. So, for example, a server that consumes 100W produces approximately 341.2 BTU/hr of heat energy.

Energy Savings in Cisco’s Facilities

To carefully study the effects of best practices to promote energy efficiency, Cisco underwent a data center efficiency study in the Cisco research and development laboratories. As part of this study, the following best practices were applied:

● Redundant power was disabled where possible

● Power savings programs were used

● Computational fluid dynamics (CFD) modeling was used

● Virtualization was applied

● Blanking panels were used

● The floor grilles were rearranged

● The chilled water temperature was raised from 44°F to 48°F (7°C to 9°C)

This study demonstrated major improvements in data center power and cooling efficiency. Even though an increase in hardware installations caused the IT load to increase slightly (from 1719 to 1761 kilowatts [kW]), the overhead for cooling the data center dropped (from 801 to 697 kW). The overall power usage effectiveness (PUE) dropped from 1.48 to 1.36. The payback from the proof of concept was 6 to 12 months. The ideas from this pilot project are being applied to all Cisco facilities and are projected to save US$2 million per year.
$2M per year from efficiency changes without changing equipment. That's not chump change. If Qualcomm can provide 90% of Intel's performance with only 60% of the heat production in Cloudflare's specific case then yes, performance/$ is basically moot.
 

PliotronX

Diamond Member
Oct 17, 1999
8,886
2
106
#21
While I agree that the fact Qualcomm is a direct investor can't be overlooked, I think people are severely underestimating how much of a factor electricity costs are in large scale server farms. If we use that above graph for power usage and look at some scenarios...

With a density of 42 CPU's per rack, that's about $4,000/yr of electricity savings per rack based off average electricity. $4,000, not a big deal. Scale that out though and figure they have 1,000 racks across the world. That's $4,000,000/yr in electricity for a 7% performance gain. High density and that number is just going to keep getting higher. More importantly that's just the pure power savings from the CPU. More power means more heat. More heat means more cooling. More cooling means more electricity. Here's a little blurb on that subject from Cisco:



$2M per year from efficiency changes without changing equipment. That's not chump change. If Qualcomm can provide 90% of Intel's performance with only 60% of the heat production in Cloudflare's specific case then yes, performance/$ is basically moot.
This makes sense, also the reason FB dumped all those 2670s in favor of the PHIs. Huge upgrade expense but pays for itself in scale PDQ!
 

ksec

Senior member
Mar 5, 2010
346
4
91
#22
I forgot to ask,

Does anyone know if these Centriq Core, ARM64 only, are coming to Desktop? Microsoft has emulator that could run Windows software on ARM64, I wonder how this will perform.
 
Apr 30, 2015
112
0
41
#23
ARM based SoCs should be available in production delivery CRAY HPCs this month.
See http://gw4.ac.uk/isambard/ for some background.
See also https://www.cray.com/products/computing/xc-series?tab=technology.
The top-ten UK Met Office / University HPC applications have been run on the CRAY-CAVIUM-ARM machine, and ported very quickly; this includes the standard weather-forecasting application for the UK Meteorological Office. Eight out of ten applications were in FORTRAN. The software-stack is there. The CRAY machine uses CAVIUM Thunder X2, and the CRAY ARIES fabric. It is scaleable to very large scale. CRAY have developed an ARM-specific FORTRAN compiler. The CAVIUM Thunder X2 is very effective, as it has superior I/O performance, which is a prime requirement for almost all the top-end HPC applications.
Also, it is worth searching YouTube using "charbax hpc arm" for a more general picture of ARM-based HPC designs. U.S. Government labs are researching the use of ARM-based HPC solutions.
ARM have arrived in HPC.
 
Last edited:
Feb 5, 2006
33,018
212
126
#24
Intel is not having a good year:
AMD Ryzen is competitive
ARM servers at the edge
Power9 is getting traction in datacenters for AI
Apple is ditching them
RISC-V is getting interest
Start of transition of AI inference off Intel CPUs to accelerators.
 

ksec

Senior member
Mar 5, 2010
346
4
91
#25
Intel is not having a good year:
AMD Ryzen is competitive
ARM servers at the edge
Power9 is getting traction in datacenters for AI
Apple is ditching them
RISC-V is getting interest
Start of transition of AI inference off Intel CPUs to accelerators.
You forgot to add TSMC 7nm beats Intel 10nm in HVM.
 


ASK THE COMMUNITY

TRENDING THREADS