look at the future of chips from Intel's first Chiplet design.

Homepage    行业新闻    look at the future of chips from Intel's first Chiplet design.

Source: The content is compiled by Semiconductor Industry Watch (ID:icbank) from 「anandtech", thank you.

Intel in the competition for its server platforms is the number of cores-other companies are implementing more cores in one of two ways: smaller cores, or individual chiplet connected together.
On Architecture Day 2021, Intel disclosed the features of its next-generation Xeon scalable platform, one of which is the shift to a tiled architecture. Intel will combine four tiles/chiplet through its fast embedded bridge to achieve better CPU scalability at a higher number of cores.
As part of the disclosure, Intel has also expanded its new Advanced Matrix Extension (AMX) technology, CXL 1.1 support, DDR5, PCIe 5.0 and accelerator interface architecture, which may make it possible to customize Xeon CPU in the future.

Sapphire Rapids:

Sapphire Rapids (SPR) is built on Intel 7 technology, which will become Intel's next generation Xeon scalable server processor for its Eagle Stream platform. Using the latest Golden Cove processor core we detailed last week, Sapphire Rapids will bring many key technologies to Intel: acceleration engine, native semi-precision FP16 support, DDR5, 300 series Optane DC persistent memory, PCIe 5.0, CXL 1.1, broader and faster UPI, its latest bridging technology (EMIB), new QoS and telemetry (telemetry), HBM, and workload-specific acceleration.
Sapphire Rapids will be launched in 2022, which will be Intel's first modern CPU product designed with a multi-chip architecture designed to minimize latency and maximize bandwidth through its embedded multi-chip interconnect bridging technology. This design will allow the integration of more high-performance cores (Intel has not disclosed the exact number), focusing on "a very important indicator for its customer base, such as node performance and data center performance". Intel called the SPR "the biggest leap in DC capability in a decade".
PCIe 5.0 is an upgrade to the previous generation Ice Lake PCIe 4.0. We migrated from 6 64-bit memory controllers in DDR4 to 8 64-bit memory controllers in DDR5. But the bigger improvements are in the kernel, accelerator and packaging.

Golden Cove: High-performance kernel with AMX and AIA

saw some of the same synergies in the early 2000s, when Intel did the same thing. Here's a quick review of Alder Lake:
According to Intel, compared with Cypress Cove, the new kernel will have more than 19% IPC gain in single-threaded workloads, and Cypress Cove is Intel's reverse migration to Ice Lake. This boils down to some major core changes, including:
  • 16B → 32B length decode
  • 4-wide → 6-wide decode
  • 5K → 12K branch targets
  • 2.25K → 4K μop cache
  • 5 → 6 wide allocation
  • 10 → 12 execution ports
  • 352 → 512-entry reorder buffer
the goal of any kernel is to handle more things faster, and the latest generation of kernels are trying to do better than before. Many of Intel's changes make sense.
Alder Lake and the server core in Sapphire Rapids. The most obvious one is that the consumer version does not AVX-512, and SPR will enable it. SPR also has a 2 MB private L2 cache per kernel, while the consumer version is only 1.25MB. In addition to this, we also discuss the Advanced Matrix Extension (AMX) and the new Accelerator Interface Architecture (AIA).
So far, in Intel's CPU core, we have scalar operations (normal) and vector operations (AVX, AVX2, AVX-512). The next stage is a dedicated matrix solver, or something similar to a tensor core in a GPU. This is what AMX does, by adding new extensible register files with dedicated AMX instructions in the form of TMUL instructions.
AMX uses 8 1024-bit registers for basic data operators, and through memory references, TMUL instructions will use these block registers to operate on data blocks. TMUL is supported by a dedicated engine coprocessor (one for each core) built into the core, and the foundation behind AMX is that TMUL is just such a coprocessor. Intel designed AMX to be broader, not just that-if Intel implemented its multi-chip strategy more deeply, at some point we could see custom accelerators enabled through AMX.
Intel confirms that we should not see any worse frequency drop than AVX-when calling vector and matrix instructions, each core has a new fine-grained power controller.
this is very suitable for discussing the new accelerator interface AIA. In general, when using additional accelerator cards, commands must navigate (navigate) between kernel space and user space, set memory, and boot any virtualization between multiple hosts. Intel described its new acceleration engine interface in a way similar to talking to a PCIe device, as if it were just an accelerator on the CPU board, even if it was connected via PCIe.
initially, Intel will have two powerful AIA hardware bits.
Intel Rapid Assistance Technology (QAT) is a technology we have seen before, because it shows a special variant of Skylake Xeon chipset (PCIe 3.0 x16 link is required) and an additional PCIe card-this version will support up to 400 Gb/s Symmetric encryption, or up to 160 Gb/s compression plus 160 Gb/s decompression is performed simultaneously, twice as much as the previous version.
another is Intel's Data Stream Accelerator (DSA). Since 2019, Intel has been providing documentation about DSA on the network, calling it a high-performance data replication and conversion accelerator for streaming data from storage and memory or other parts of the system through DMA remapping hardware units/IOMMU. DSA is a request from specific hyperscale customers who want to deploy it in their own internal cloud infrastructure. Intel is keen to point out that some customers will use DSA, some will use Intel's new infrastructure processing units, and some will use both, depending on the level of integration or abstraction they are interested in. Intel told us that DSA is an upgrade to the Crystal Beach DMA engine on the Purley (SKL + CLX) platform.
most importantly, Sapphire Rapids also supports semi-precision AVX512_FP16 instructions, mainly used for artificial intelligence workloads as part of its DLBoost strategy. In addition to INT8 and BF16 support, these FP16 commands can also be used as part of AMX. Intel now also supports CLDEMOTE cache line management.

an adverb about CXL

in the demonstration of Sapphire Rapids, Intel has been keen to emphasize that it will support CXL 1.1 when it is released. CXL is a connection standard designed to handle more things than PCIe does-in addition to simply transferring data from the host to the device, CXL also supports three branches, called IO, cache, and memory. As defined in CXL 1.0 and 1.1 standards, these three standards form the basis of a new method of connecting hosts and devices.
Of course, we expect all CXL 1.1 devices to support all three standards. It was not until Hot Chips a few days later that we learned that Sapphire Rapids only supports some CXL standards, especially CXL.io and CXL.cache, but CXL.memory will not be part of SPR. We are not sure to what extent this means that SPR does not meet the CXL 1.1, or what this means for CXL 1.1 devices-without CXL.mem, as shown in the picture above, Intel has lost only Type-2 support. Perhaps this shows more that CXL 2.0 better serve the market around CXL, which will undoubtedly appear in future products.
next, we will learn about Intel's new tiled architecture for Sapphire Rapids.

Forward to More Silicon: Connectivity Is Important

so far, all of Intel's leading Xeon scalable processors are monolithic, that is, a piece of silicon. Having monolithic silicon has its advantages, namely fast intra-silicon interconnection between cores and a single power interface that needs to be managed.
However, as we turn to smaller and smaller process nodes, having a large piece of silicon also has disadvantages: they are difficult to mass manufacture without defects, and if you want a high-core version, it will increase costs, and eventually this will be limited.
large monolithic designs is to cut it into smaller silicon wafers and connect them together. The main advantage here is better silicon yield, and it can also be configured with different silicon for different functions as needed.
using a multi-chip design, you will end up with more silicon than a single-chip design can provide-the reticle (manufacturing) of a single silicon chip is limited to ~ 700-800mm ², while a multi-chip processor requires a few smaller silicon chips that can be put together to easily push more than 1000mm.². Intel says it has about 400mm per silicon chip², with a total area of about 1600mm.². However, the main challenges faced by multi-chip design are connectivity and power consumption.
the simplest way to package two chips in one substrate is to connect them within the substrate, or essentially equivalent to PCB traces. This is a high yield approach, but it has two disadvantages listed above: connectivity and power consumption. Compared with a silicon connection, sending a little through a PCB connection requires more power consumption, but the bandwidth is also much lower because the signal cannot be densely packed. Therefore, if there is no careful planning, multi-chip connection products must know how far the data is at any time, which is a rare problem for single-chip products.
solution to this problem is to use faster internal interconnection. Instead of passing this connectivity through the substrate, packaging, it is better to achieve it through silicon? By placing these connected dies on a silicon wafer, such as an interposer, the connection traces have better signal integrity and better power. Using the mediation layer, this is usually called a 2.5D package. Its cost is a bit higher than standard packaging technology (there is also room for an active interposer with logic), but we have another limitation, that is, the interposer must be greater than the sum of all silicon wafers. But in general, this is a better choice, especially if you want your multi-chip product to behave like the whole.
Intel believes that the best way to overcome the shortcomings of the interposer but still benefit from an effective monolithic silicon design is to create an ultra-small interposer located inside the substrate. By pre-embedding them in the right place and using the right packaging tools, two chips can be placed on this small embedded multi-chip interconnect bridge (EMIB). Look, this is a system that is physically as close as possible to a single chip design.
Intel has been committed to EMIB technology for more than ten years. From our point of view, the development has three major milestones:(1) the ability to embed the bridge into the package with high yield;(2) the ability to place large silicon wafers on the bridge with high yield;(3) The ability to place two high-power dice side by side on the bridge. I think the most difficult thing for Intel to solve is the third part-placing two high-power dies side by side, especially if the chip has different thermal expansion coefficients and different thermal characteristics, it may weaken the substrate around the bridge or the bridge itself Connection.
So far, almost all Intel products using EMIB have focused on connecting CPU/GPU to high-bandwidth memory, which is an order of magnitude lower than the power consumption it is connected. Because of this, I don't believe it is possible to put two high-performance tiles together until Intel connected the two high-performance FPGA tiles into multi-chip FPGA with EMIB in late 2019. Since then, Intel has enabled this technology on its CPU product stack only, and we have finally seen this on Sapphire Rapids.

Sapphire Rapids

Sapphire Rapids will use four tiles connected by 10 EMIB connections through a 55 micron connection pitch. In general, you may think that in a 2x2 tiels array, each tile-to-tile connection may require equal EMIB, so in this case, each connection has 2 EMIB, that is 8. But why did Intel quote 10 here? This starts with the design of Sapphire Rapids.
because Intel wants SPR to be single for each operating system, Intel has basically cut its inter-kernel grid horizontally and vertically. In this way, each connection through EMIB is regarded as the next step on the grid. But Intel's monolithic design is not symmetrical in both dimensions-usually features like PCIe or QPI are on the edge, not in the same place in every corner. Intel told us that the same is true in Sapphire Rapids, where one dimension uses 3 EMIBs per connection and the other dimension uses 2 EMIBs per connection.
by avoiding strict rotational symmetry in its design and without a central IO hub, Intel is very inclined to regard this product as a monolithic chip. As long as the EMIB connection between tiles is consistent, the software does not have to worry, although before we get more details here, if you don't try to figure out Intel's grid design and additional parts are connected together. If it makes sense, SPR sounds like a monolithic design, not a completely new multi-chip design.
Intel announced earlier this year that it will use four HBM tiles to make HBM versions of Sapphire Rapids. These will also be connected via EMIB, with each tile having an EMIB.

All about Tiles

Intel did delve into what is inside each individual Tile:
, in each tile, there are:
  • Core, Cache, and Grid
  • Memory Controller with 2x 64-bit DDR5 Channel
  • UPI link
  • accelerator link
  • PCIe link
In this case, in the entire presentation, it seems that all four tiles are equivalent and have the rotational symmetry I mentioned above. Fabrication of silicon with this function in the way presented is not as easy as mirroring the design and printing it onto a silicon wafer. The crystal plane of the silicon wafer limits how the design is built, so any mirror image must be completely redesigned. Therefore, Intel confirmed that it must use two different sets of masks to build Sapphire Rapids, one for each set of two dies it must manufacture. It can then rotate each of these two molds to build a 2x2 tile grid, as shown in the figure.
we believe that it is worth comparing Intel's design with AMD's first-generation EPYC. The latter also uses a small chip design of 2x2, although connected through packaging. AMD avoids the need for multiple silicon chip designs through rotational symmetry-AMD builds four die-to-die interfaces on silicon chips, but only uses three per rotation. This is a cheaper solution at the expense of chip area (and suitable for AMD's financial situation at the time), but it also achieves a certain degree of simplicity. AMD's central IO chip approach in the newer EPYC completely gets rid of this problem. From my point of view, if Intel wants to expand beyond SPR, but for different reasons, they will have to move in this direction.
For now, each tile has a 128-bit DDR5 memory interface, and all four tiles have a total of 512 bits. Physically, this means that we will see 8 64-bit memory controllers with 8 or 16 memory modules per slot in the system (technically, DDR5 places two 32-bit channels on a single module, but currently there is no term in the industry to distinguish between a module with one 64-bit memory channel and a module with two 32-bit memory channels on it. So far, the term "channel" has often been interchanged with "memory slot", but this must change). For Sapphire Rapids versions with all four calculation blocks, this is no problem at all.

Add some HBM and Optane

to understand Sapphire Rapids is that they provide a version with HBM. Intel announced the news in June, but did not have many details. As part of Architecture Day, Intel stated that the HBM version of Sapphire Rapids is also public and is compatible with standard Sapphire Rapids. The first customer of the SPR HBM version was Argonne National Laboratory as part of its Aurora Exascale supercomputer.
This figure shows four HBM connections, one for each computing block. However, from the perspective of packaging, I think there is actually not enough space, unless Intel commissioned some new HBM that are long and narrow as shown in the picture.
Although Intel said that the HBM variant would be in the same slot, even their own slides from Hot Chips would be different.
here, the package size of HBM is 100x 57mm, while SPR is 78x 57mm. Therefore, unless Intel plans to provide a reduced version for the 78x57mm slot, it will be in a different slot.
important to note that HBM will play a role in capacity similar to Optane-either as HBM flat mode, DRAM is equivalent to both, or as HBM cache (caching) mode, similar to L4 cache before accessing main memory. Optane on this can also be in flat mode, cache mode or as a separate storage volume.
HBM will increase the power consumption of the package, which means that if HBM exceeds the slot limit, we are unlikely to see the best CPU frequency paired with HBM. Intel has not announced how much HBM stack or capacity SPR will use, but said they will be located under the heat sink. If Intel intends to adopt a non-standard HBM size, then anyone can guess what the capacity is. But we do know that it will be connected to tile through EMIB.
side note on Optane DC persistent memory-Sapphire Rapids will support the new 300 series Optane design. We asked Intel if this is the 200 series but using the DDR5 controller, and were told no, this is actually a new design. Please pay attention to more details.

UPI link

Each Sapphire Rapids processor will have up to four x24 UPI 2.0 links to connect to other processors in the multi-socket design. With SPR, Intel is targeting up to 8-slot platforms, and in order to increase bandwidth, it has upgraded from three links in ICL to four (technically, CLX is 2 x3) and moved to UPI 2.0 design. Intel will not elaborate on what this means, but they will have a new eight-slot UPI topology.
current Intel hypercube
's current eight-slot design uses a twisted-pair hypercube topology (twisted hypercube topology): two groups of four form a box, one pair is connected to the same vertex of the other group of four, and the second pair is the opposite.
Essentially, each CPU is directly connected to the other three CPUs, while the other four CPUs are two hops away. With the new topology, each CPU can be directly connected to another CPU, which makes the design more inclined to a fully connected topology, but Intel has not yet explained which CPU should be connected.

security

Intel said it would announce a full security update for SPR later, features such as MKTME and SGX are key priorities.

conclusion

For me, the improved kernel, the upgraded PCIe/DDR and the "appear as a monomer" approach are the highlights so far. However, there are still some very obvious questions to be answered-how the number of cores, power consumption, lower number of cores work (even suggesting that the LCC version is actually monolithic), and what the HBM-enabled version will look like. The HBM version adds EMIB, and the cost will be high, which is not good when AMD's pricing structure is very competitive.
expect that when Sapphire Rapids is released, AMD will still enter the market with Milan (or, as some people speculate, Milan's 3D V-Cache version, but it has not been confirmed), and it will not be until the end of 2022 when AMD launches Zen 4. If Intel can execute and bring SPR to market, it will have a small time advantage in attracting potential customers. Ice Lake's selling point is its specific accelerator advantage, not its original core performance. We will have to wait and see Sapphire if Rapids can bring more advantages.
years, people have been expecting Intel to switch to the tile/chiplet strategy in the enterprise-at least on this side of the fence. Since AMD made it work and exceeded the standard silicon limit, no matter what kind of adhesive it uses and uses it between silicon chips, Intel has to go this way. It has been delayed, mainly due to manufacturing and optimizing things like EMIB, which also takes time. EMIB as a technology is really impressive, but the more chips and bridges you put together, even if you have a 99% success rate, this will reduce the yield. But this is exactly what Intel has been working on. For the enterprise market, Sapphire Rapids is the first step.
Created on:2021年8月31日 15:11