Info 64MB V-Cache on 5XXX Zen3 Average +15% in Games

Page 62 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Kedas

Senior member
Dec 6, 2018
355
339
136
Well we know now how they will bridge the long wait to Zen4 on AM5 Q4 2022.
Production start for V-cache is end this year so too early for Zen4 so this is certainly coming to AM4.
+15% Lisa said is "like an entire architectural generation"
 
Last edited:
  • Like
Reactions: Tlh97 and Gideon

Mopetar

Diamond Member
Jan 31, 2011
8,436
7,631
136
They did compare it to 5900x - so on average 15% faster than 5900x. When they introduced their v-cache they say 15% more performance with same cpu @ iso clocks but gains seems to be much more as reduced clock 5800x is still 15% faster than 5900x.

That would change the numbers, but only by a small amount. Using the same benchmarks as before, the Tom's numbers show that the 5900X is 1.8% faster (0.6% faster when both chips are using PBO) and the TPU numbers put the 5900X only 0.2% ahead of the 5800X. It's likely an even small difference for 1440p.

12600 does not have E-Cores and the 12400 is on par with a 5600X in gaming and a small amount ahead for productivity. The 12600 will be at best a few % faster due to higher clocks.

Various tests have shown that the efficiency cores contribute very little to gaming performance.

Based on UK prices the 12600 is £230 + £200 for a B660M Mortar. OTOH you could go 5600X + GB Aurous Elite for around £350. Sure the 12600 will be a bit faster but it also costs 20% more for CPU + Mobo.

I'm not sure why anyone who's going to do a new build wouldn't wait until Zen 4 at this point. If you're just going to make something for a specific cost then just go with the best deal you can get. Intel did announce some new chipsets so there should be less expensive boards on the market in the next few months.

The advantage of going with the cheap Alder Lake CPU now is the intent to replace it later with a more expensive high-end CPU. If you buy in to AM4 at this point you don't have that option since AM4 is at the end of its life.

If someone bought a good AM4 board years ago and paired it with a 2600X (coincidentally the same $220 back in the day) then they're now in the situation where instead of having to move to a new Intel platform or AM5 and do a new build, they can get a 5800X3D which is going to offer a substantial jump in performance. These two are the same thinking, only they're at different points in the timeline.
 

DrMrLordX

Lifer
Apr 27, 2000
22,700
12,652
136
Uh, that was a pun. Also, I haven't noticed @DrMrLordX ever worrying about CPU cost. He's one of the last of us that upgrades frequently (well, amongst current posters).

I was considering a 5950x3d but now I guess I'm back to waiting for Zen4. As for the price of the 5800x3d: AMD hasn't even announced a price yet, so ya know. Let's all relax a little. Honestly I think it'll be like $50 more than 5800x @ launch. $499.
 

Hitman928

Diamond Member
Apr 15, 2012
6,642
12,245
136
Just to follow up on the cache since I was curious, I found a paper (https://ieeexplore.ieee.org/document/7275655) that uses 32 nm Finfet models to check SRAM power consumption during the hold state with and without power gating. Without power gating, each cell consumes 2.23 uw of power compared to 27.67 pw with power gating during the hold period. This is obviously a huge difference and suggests that power gating is necessary during hold periods under normal operation let alone power gating during extended periods of no writes when talking about very large cache arrays. Unless I'm misunderstanding their findings. I believe AMD is also using 8T cells versus 6T in the paper, but the core cell design is the same in 8T, it just adds essentially a buffer for read out. Leakage should be more of an issue at 7nm compared to the 32 nm used in the paper. AMD is probably using this and extending it to gating all the clocks and such in the full SRAM as well to completely turn off the V-cache when not needed.
 
Last edited:

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
I was considering a 5950x3d but now I guess I'm back to waiting for Zen4. As for the price of the 5800x3d: AMD hasn't even announced a price yet, so ya know. Let's all relax a little. Honestly I think it'll be like $50 more than 5800x @ launch. $499.
The 5900X3D with total of 192 Mb showed the same gaming performance as the 5800XD with 96 MB L3. The 5950X would be too expensive(two very costly 3D V-cache)as it would provide same gaming performance as the 5800X3D.
 
Nov 26, 2005
15,189
401
126
The best upgrade path would be to wait to see how AM5 Raphael does, in regards to 3D V-cache. Whether or not I wait to do that remains to be seen...
 
  • Haha
Reactions: Zucker2k

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
The best upgrade path would be to wait to see how AM5 Raphael does, in regards to 3D V-cache. Whether or not I wait to do that remains to be seen...
That 5 Ghz All core demo while gaming is very encouraging to say the least, because it's an early production unit(or even ES) it has such a high Ghz with a big IPC bump expected on top of that
 

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
So what? 16 cores with that much L3 would be amazing for more than just games.
I agree, but Amd has no incentive to do that, they can sell those two chiplets for more than $1,000 each as Milan-X and with Raphael coming up soon after the 7950X will be the next gaming and productivity king for a while
 

Schmide

Diamond Member
Mar 7, 2002
5,712
978
126
thats not the way it works......

what a useless post.

Yes there is nuance because the hierarchy is partially shared, constantly shifting data up and down, and is out of order.

But you can easily translate clocks to frequency by division. Unless there is some new math the youngins are using?
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
what a useless post.

Yes there is nuance because the hierarchy is partially shared, constantly shifting data up and down, and is out of order.

But you can easily translate clocks to frequency by division. Unless there is some new math the youngins are using?

By dividing frequency by latency clocks result is latency in time. But that's just latency of cache subsystem, that cache/memory subsystem is still working at it's full frequency and can provide data for each of it's working frequency cycle. So comment was right - there was calculated something that doesn't exists.
 

Saylick

Diamond Member
Sep 10, 2012
3,923
9,142
136
By dividing frequency by latency clocks result is latency in time. But that's just latency of cache subsystem, that cache/memory subsystem is still working at it's full frequency and can provide data for each of it's working frequency cycle. So comment was right - there was calculated something that doesn't exists.
This is the right answer. Latency does not equal clock rate. The previous calculation using the latency in clock cycles to determine the cache clock rate would be like using the latency to RAM to determine how fast the RAM was operating. For example, it would be like saying if the latency to RAM was 80ms, the RAM was making transfers at 12500 Hz... and we all know that's not true.
 
  • Like
Reactions: Vattila

nicalandia

Diamond Member
Jan 10, 2019
3,331
5,282
136
This is the right answer. Latency does not equal clock rate. The previous calculation using the latency in clock cycles to determine the cache clock rate would be like using the latency to RAM to determine how fast the RAM was operating. For example, it would be like saying if the latency to RAM was 80ms, the RAM was making transfers at 12500 Hz... and we all know that's not true.
Do you know the speed(as in Ghz) of the L3 Cache on Zen 3?
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
Do you know the speed(as in Ghz) of the L3 Cache on Zen 3?

AMD chips run L3 syncronously with core clock. Or with better expression - L3 slice is a part of core.

And correction : AMD zen chips. BD-derivates and Phenoms have L3 as part of uncore so it ran with northbridge clock.
 

Hitman928

Diamond Member
Apr 15, 2012
6,642
12,245
136
AMD chips run L3 syncronously with core clock. Or with better expression - L3 slice is a part of core.

And correction : AMD zen chips. BD-derivates and Phenoms have L3 as part of uncore so it ran with northbridge clock.

Are they run synchronously at full rate? If so, how does that work when cores are running at different frequencies? Does the L3 run at different speeds per slice or does it match the fastest core?
 
  • Like
Reactions: Mopetar

Saylick

Diamond Member
Sep 10, 2012
3,923
9,142
136
Are they run synchronously at full rate? If so, how does that work when cores are running at different frequencies? Does the L3 run at different speeds per slice or does it match the fastest core?
Just a guess, but I would imagine the L3 slice's clockrate is tied to the core it is associated with.

Edit: You know, now that I remembered that Zen 3's L3 cache operates on a ring bus, the entire L3 likely runs the same clock rate. Therefore, if one core in the complex was down clocked but another ran at full speed, I would guess that the entire L3 runs at the speed dictated by the highest clocked core.

Courtesy of WikiChip, for the original Zen core:
650px-zen_soc_clock_domain.svg.png

Zen is divided into a number of clock domains, each operating at a certain frequency:

  • UClk - UMC Clock - The frequency at which the Unified Memory Controller's (UMC) operates at. This frequency is identical to MemClk.
  • LClk - Link Clock - The clock at which the I/O Hub Controller communicates with the chip.
  • FClk - Fabric Clock - The clock at which the data fabric operates at. This frequency is identical to MemClk.
  • MemClk - Memory Clock - Internal and external memory clock.
  • CClk - Core Clock - The frequency at which the CPU core and the caches operate at (i.e. advertised frequency).
For example, a stock Ryzen 7 1700 with 2400 MT/s DRAM will have a CClk = 3000 MHz, MemClk = FClk = UClk = 1200 MHz.
 

naukkis

Golden Member
Jun 5, 2002
1,004
849
136
Are they run synchronously at full rate? If so, how does that work when cores are running at different frequencies? Does the L3 run at different speeds per slice or does it match the fastest core?

I don't know. Intel did also run ring and cache slices syncronously with Sandy Bridge and after that they change it to different clock domains - I haven't found any info on technical details from Intel either.
 
  • Like
Reactions: Hitman928

jamescox

Senior member
Nov 11, 2009
644
1,105
136
They go through the substrate, they don't have to go through the full die. Some do, some don't and depending on the process, the via won't be really a TSV until the wafer is complete and then thinned.

2560px-Through-Silicon_Via_Flavours.svg.png





I had forgotten about the direct bonding method. This way apparently allows for pitch matching the TSVs which is pretty great.




Yeah, the direct bonding makes the dual flip chip design work. I still have a hard time believing the die is 20 microns thick. The FEOL for a 10+ layer process is going to be ~15 um alone, that doesn't even include the top layer they add for flip chips or the device implants/wells. Maybe for a process with a relatively few, short metal layers, but a CPU wouldn't use such a stackup. I am willing to be wrong on this, but it's hard for me to see the math working out at 20 um. The more I think about it though, I agree that they will make the base die as thin as possible, they want to keep the TSV as short as possible to reduce any signal delays and parasitics associated with it.

1. What do you think “substrate” means in this context?

2. How would a “Through Silicon Via” be of any use if it didn’t go all of the way through the silicon (the “die”) to connect to the die above?


Your post fails basic logic. The TSVs are likely as short as possible due to yield being affected by how deep they etch the holes and fill with metal. Some may not work, so they could have some redundancy. There is supposedly some circuitry associated with the TSV. Even the 0.9 micron pitch (900 nm) is quite large comped to other structures on the chip, but aligning the die to that level of precision is incredible actually.


Edit: also, if you look at the stacking images in the wikichip link, the silicon actually appears thinner than the device and metal layers. Those are not SEM images, but I have seen similar SEM images somewhere, so I believe they are actually representative.
 
Last edited:

jamescox

Senior member
Nov 11, 2009
644
1,105
136
Using that video and assuming detailed knowledge, compares to getting a degree by reading Scientific American.

Those SEM shots from the anandtech article?

Thanks for that. Forgot those exist.

Look at the following (Sub-micron CoW interconnect demonstrated). The boundaries clearly show a very slight (nm ?) penetration into the next level. Notice the gentle saucer shaped deformation? That's material displacement from the center due to pressure by the slimmer shaft. You can even just see the slim pillars penetrating slightly into the conical ones at the ends.

Definitely the Cu pillars are standing a few nm above the silicon plane before bonding.

I'm more convinced now that some sort of pressure welding (+ vacuum & heat ?) is used to fuse the Cu pillars. Clears up a big part of the question I had. How is this actually being manufactured?

The other related topic is the silicon blanks.

Seeing that the cache die are bonding the Cu interconnects by allowing a given pillar to slightly penetrate it's mating pillar, we can reasonably assume that the silicon is NOT vacuum bonded just by the common assumption of being perfectly flat ( :rolleyes: ) and being placed next to each other.

Those silicon blanks over the cores must be using a thermal glue/paste for attachment. This might have a negative effect on heat transfer to a larger degree that might have been thought.


View attachment 55597

I don’t know what you are referring to. The shape of the TSV is not going to be a perfect cylinder since it is created by etching the hole in the silicon and filling with metal. Right across the middle of the image horizontally you can see the boundary between the two die. These die must also be incredibly thin if the pitch is 0.9 micron. This is likely a SEM image of a test chip without any devices or metal layers on top. The copper is slightly different size so it looks almost like a tiny misalignment, but it is just slightly different sized pillars. There is no appearance of “pancaking” at that boundary. I think copper is a lot harder than silicon (would need to look it up; don’t have time now), so putting it together with pressure is likely not doable. If you actually deformed the copper from a perfect cylinder to the rounded shape seen there, you likely would destroy the whole thing. There is no solder balls to melt, so heat doesn’t work either.

I don’t see why you are so dubious of cold welding. It has been known for a long time. They had to solve issues with it when going to the moon. A lot of lubricants will sublimate at the temperatures and vacuum conditions on the moon. Without any lubricant or oxygen to form an oxide layer, moving parts in the lunar rover would weld themselves together. We mostly never experience this on earth since most metals will form an oxide layer of some form, aren’t that flat, and aren’t in a vacuum. Also, they just say that it is a cold weld in the video from AMD. If you don’t believe what all of the articles have said about it, then i guess that is where you are at. I don’t have time to search for more data.
 

Hitman928

Diamond Member
Apr 15, 2012
6,642
12,245
136
1. What do you think “substrate” means in this context?

2. How would a “Through Silicon Via” be of any use if it didn’t go all of the way through the silicon (the “die”) to connect to the die above?


Your post fails basic logic. The TSVs are likely as short as possible due to yield being affected by how deep they etch the holes and fill with metal. Some may not work, so they could have some redundancy. There is supposedly some circuitry associated with the TSV. Even the 0.9 micron pitch (900 nm) is quite large comped to other structures on the chip, but aligning the die to that level of precision is incredible actually.

I thought I had made it pretty clear in my post but I guess not. Some TSVs go through the substrate and attach to (stop at) the FEOL, some go thorough substrate and attach to (stop at) the BEOL, and some go fully through the die including substrate, FEOL, and BEOL. There isn't just one way of doing TSVs though my understanding that attaching to BEOL has been more or less adopted as the industry standard as via first (FEOL stop) forces the TSV material to go through high heat when the FEOL is processed which limits the types of material you can use for the TSV and via last makes it difficult to align with the rest of the circuit but can be easier when you want to go through a full die. In every case the TSV goes through the substrate and is exposed at the bottom of the die but how far up the die it goes can vary.

I really don't think going decent depth with a TSV is a yield issue but I don't have any data on that specifically. I do know that there are papers showing you can do 200 um+ TSVs without issue but papers don't usually have high volume yield in mind and this is most likely more a limitation in relation to TSV diameter/pitch. I'm also sure processing wafers that are thinned to absolute thinnest possible also has introduces its own issues. I'll have to read up on how they handle these issues over the weekend and see what the gives and takes are.
 
Last edited:

maddie

Diamond Member
Jul 18, 2010
5,147
5,523
136
I don’t know what you are referring to. The shape of the TSV is not going to be a perfect cylinder since it is created by etching the hole in the silicon and filling with metal. Right across the middle of the image horizontally you can see the boundary between the two die. These die must also be incredibly thin if the pitch is 0.9 micron. This is likely a SEM image of a test chip without any devices or metal layers on top. The copper is slightly different size so it looks almost like a tiny misalignment, but it is just slightly different sized pillars. There is no appearance of “pancaking” at that boundary. I think copper is a lot harder than silicon (would need to look it up; don’t have time now), so putting it together with pressure is likely not doable. If you actually deformed the copper from a perfect cylinder to the rounded shape seen there, you likely would destroy the whole thing. There is no solder balls to melt, so heat doesn’t work either.

I don’t see why you are so dubious of cold welding. It has been known for a long time. They had to solve issues with it when going to the moon. A lot of lubricants will sublimate at the temperatures and vacuum conditions on the moon. Without any lubricant or oxygen to form an oxide layer, moving parts in the lunar rover would weld themselves together. We mostly never experience this on earth since most metals will form an oxide layer of some form, aren’t that flat, and aren’t in a vacuum. Also, they just say that it is a cold weld in the video from AMD. If you don’t believe what all of the articles have said about it, then i guess that is where you are at. I don’t have time to search for more data.
1)Pure Cu is a very "soft" metal, much more ductile than Si. Easily possible
2)If you magnify the image, you'll see the the uniformly cylindrical pillars slightly penetrating the cone shaped top & bottom connectors.
3)The boundary of the bond is definitely not flat. All of them have a saucer shaped depression.
4)Cold welding doesn't have to mean cold temps, just lower temps than the melting/crystallization temps point. Pressure is often used.
4)The sides of the pillars are supported by the surrounding material, preventing compressive buckling failure and distortion.
5) All of the articles do not give details, only giving the illusion of understanding.

I am interested in the actual details of how this is done
However, one is free to believe that you just rest them together and the weld happens. No problem.
 
  • Like
Reactions: lobz

Mopetar

Diamond Member
Jan 31, 2011
8,436
7,631
136
Just to follow up on the cache since I was curious, I found a paper (https://ieeexplore.ieee.org/document/7275655) that uses 32 nm Finfet models to check SRAM power consumption during the hold state with and without power gating. Without power gating, each cell consumes 2.23 uw of power compared to 27.67 pw with power gating during the hold period. This is obviously a huge difference and suggests that power gating is necessary during hold periods under normal operation let alone power gating during extended periods of no writes when talking about very large cache arrays. Unless I'm misunderstanding their findings. I believe AMD is also using 8T cells versus 6T in the paper, but the core cell design is the same in 8T, it just adds essentially a buffer for read out. Leakage should be more of an issue at 7nm compared to the 32 nm used in the paper. AMD is probably using this and extending it to gating all the clocks and such in the full SRAM as well to completely turn off the V-cache when not needed.

Although I'm not en electrical engineer I looked over it. I'm not quite sure what your conclusion is, but this sounds like something that can be applied whenever you aren't accessing the cache to cut down on leakage as opposed to some way to facilitate turning on or off part of the cache. The design appears to be applied at the level of the individual SRAM cells which means it's operating on a bit by bit basis.

Some of the results presented are so substantial that it wouldn't make sense for anyone to have some kind of similar design. In one table the results without gating go from a number in microwatts to a number in picowatts when gating is used. That's a six order of magnitude jump. That should set off some alarm bells in anyone's brain.

I looked a bit further and I don't think that this paper is important in any way or providing any kind of substantial information that wasn't being previously considered or techniques that weren't being used. This paper only has 3 citations, which suggests it wasn't any kind of monumental discovery. The first two authors have only 1 or 2 publications total. The third author has a few dozen. This was a student paper.

I'm also not sure if there's been anything posted as to the way the additional v-cache is integrated with the existing L3, but assuming that it can just be turned off at a whim, it would suggest that it's just providing a larger set associativity for the existing L3 cache. You'd also need logic to make sure that it's okay to turn off or that anything that would need to be written back to memory is done before turning it off. If you can just turn it off, it can't be functioning as additional cache lines because you'd need to be able to reindex the entire cache to turn it off, never mind dealing with collisions. That's too complex. It could act in a way to lengthen existing cache lines, but it's also weird to turn that off and would require the existing logic to be aware of that. Completely shutting off 66% of your L3 cache isn't something that can just be done on a whim.
 

tomatosummit

Member
Mar 21, 2019
184
177
116
So far these are the lists of games with 15%+ performance improvements(3D-V Cache) over a stock 5900X and 12900KS Alder Lake. Pay attention to the Tie/Even games where it does the same with the 12900K so it must be a frame cap or something. Also CS:GO Three of the CPUS ties, the 5900X, 12900K and 5800X3D so the issue is on the game.

DOTA 2: 18% Over 5900X - No info on 12900K
Monster Hunter World: 25% over 5900X - No info on 12900K
League of Legends: 4% Over 5900X - No info on 12900K
Fornite: 17% Over 5900X - No info on 12900K
Final Fantasy XIV: 20% Over 5900X - 20% Over 12900K
Shadow of the Tomb Raider: 10% Over 5900X - 10% Over 12900K
Far Cry V: 20% Over 5900X - 11% Over 12900K
Gears V: 12% Over 5900X - Tie/Even with 12900K
Watch Dogs Legion: 40% Over 5900X - Ties/Even with 12900K
CS:GO: Tie/Even with 5900X - Tie/Even with 12900K
Some of these results are not suprising really.
cs:go seems to flip flop between intel and amd's favourite. zen2/coffeelake era it was staunchly in intel's camp, then zen3 comes along and amd had a considerable lead. If alder and zen3d are at the same performance maybe the engine is finally tapped out, a little short of 1000fps.

But farcry and watchdogs don't suprise me getting such gains.(and I think other ubisoft openworld games such as ghost recon and newer assassins creed will fit the same bill)
I've had the chance to mess with some of those games editor's and seeing the level editor asset lists for that kind of open world is eye watering. Various of ubisofts open world games just seem to throw so much trash into the game world which shows in all the game reviews of relatively poor 1% fps in these games, I'm assuming the larger l3 is being put to work and could even scale considerably futher for the more urban sections in some games.
I think I'll be looking at what kind of game/engine gets gains from more cache, some of the previous zen3 gains seemed peculiar to me in past.

On the otherhand dota2 and league are relatively simple with such a very small amount of dynamic assets.
 
  • Like
Reactions: lightmanek

Hitman928

Diamond Member
Apr 15, 2012
6,642
12,245
136
Although I'm not en electrical engineer I looked over it. I'm not quite sure what your conclusion is, but this sounds like something that can be applied whenever you aren't accessing the cache to cut down on leakage as opposed to some way to facilitate turning on or off part of the cache. The design appears to be applied at the level of the individual SRAM cells which means it's operating on a bit by bit basis.

Some of the results presented are so substantial that it wouldn't make sense for anyone to have some kind of similar design. In one table the results without gating go from a number in microwatts to a number in picowatts when gating is used. That's a six order of magnitude jump. That should set off some alarm bells in anyone's brain.

I looked a bit further and I don't think that this paper is important in any way or providing any kind of substantial information that wasn't being previously considered or techniques that weren't being used. This paper only has 3 citations, which suggests it wasn't any kind of monumental discovery. The first two authors have only 1 or 2 publications total. The third author has a few dozen. This was a student paper.

I'm also not sure if there's been anything posted as to the way the additional v-cache is integrated with the existing L3, but assuming that it can just be turned off at a whim, it would suggest that it's just providing a larger set associativity for the existing L3 cache. You'd also need logic to make sure that it's okay to turn off or that anything that would need to be written back to memory is done before turning it off. If you can just turn it off, it can't be functioning as additional cache lines because you'd need to be able to reindex the entire cache to turn it off, never mind dealing with collisions. That's too complex. It could act in a way to lengthen existing cache lines, but it's also weird to turn that off and would require the existing logic to be aware of that. Completely shutting off 66% of your L3 cache isn't something that can just be done on a whim.

The paper does present two methods, first is per cell gating which is most effective but adds significant area. The other is done by cell clusters which doesn't offer as good of power gating but has very little area penalty. Either way, I wasn't trying to imply it was a novel technique and did say that if the numbers are even remotely true, then AMD has to be using this, or a similar, technique as the power with 32 MB of cache would be too much to handle without it. It was more just to get an idea of how much power SRAM cells burn in hold mode on a FinFET process. I'm guessing when off, AMD is just doing whatever gating mechanisms they are already implementing for the cells plus maybe some additional clock/power gating for all the supporting circuitry in the SRAM arrays.

For turning on and off effect, AMD said that the extra V-cache is striped with the existing L3 cache so as long as the addressing logic knows if it's on or off, it shouldn't be a problem but again, not a digital guy so I'm just guessing here.
 

Hitman928

Diamond Member
Apr 15, 2012
6,642
12,245
136
1)Pure Cu is a very "soft" metal, much more ductile than Si. Easily possible
2)If you magnify the image, you'll see the the uniformly cylindrical pillars slightly penetrating the cone shaped top & bottom connectors.
3)The boundary of the bond is definitely not flat. All of them have a saucer shaped depression.
4)Cold welding doesn't have to mean cold temps, just lower temps than the melting/crystallization temps point. Pressure is often used.
4)The sides of the pillars are supported by the surrounding material, preventing compressive buckling failure and distortion.
5) All of the articles do not give details, only giving the illusion of understanding.

I am interested in the actual details of how this is done
However, one is free to believe that you just rest them together and the weld happens. No problem.

From my understanding, the TSV pillars are ever so slightly exposed above the silicon in the end, but I don't know how they are doing it with this new direct bonding technique.