Despite the fact that Cypress based products are almost put upon the shelf, yet they are the ones to compare the new architecture. While AMD Barts die did inherited main advantages of the previous architecture, its performance level is still not high enough to stand against AMD Cypress. Basically, they could not — Barts based graphic solution are positioned slightly below Cypress. Now Cayman is a different story — new GPUs should be faster compared to Cypress, in order to put the last one out of AMD production line completely.
Taking into consideration the fact that Cayman dies will be produce using 40 nm process logic, it was a hard task for AMD engineers to give maximum performance while having limitation in dies size, which also means limited number of transistor to use.
After hard work, AMD has finally found the most efficient way in computing pipeline organization of graphics processor:
New architecture was completely re-built:
- GPU has dual graphics engines
- The central element of the processor, Stream processors, (SP) are using VLIW4 architecture instead of VLIW5, shader block features 24 SIMD and 96 Texture Mapping Units (TMU)
- In the back-end Raster Operation Processors (ROP) have been upgraded
The width of memory interface of Cayman was inherent from its predecessor — AMD decided not to increase it, so there are still same old four 64-bit memory controllers, which give a total of 256-bit memory bus. In order to compensate possible losses of the memory interface throughput, AMD Cayman graphic cards will have faster GDDR5 memory. The controller itself has been slightly upgraded too, the details are described below.
One of the main drawback in Cypress architecture is weak Tessellation module. Tessellation performance has been significantly improved in AMD Barts, however, that was still not enough to battle AMD main rival’s graphics cards of even greater performance. This has lead to some drastic measures to be taken during Cayman graphics processor development: engineers went with dual graphics engine instead of increasing the size of a single one:
Such approach is fully justified — the efficiency of front-end has increased together with data throughput. In addition, the number of upgraded Tessellation modules (of 8th generation) has been doubled as well.
The body of graphics processors has faced noticeable changes too. New VLIW4 architecture of stream processors allowed to save area of each SIMD by 10%, while performing the same compared to previous VLIW5 architecture :
At first sight, it looks that SPs of new architecture have been simply cut:
However, it is not that simple. The functionality of each module has been increased, as well as all modules are equivalent now. To compare, here is a picture of SP of VLIW5 architecture:
One of the main reasons to switch to new SP architecture was limitation in die size/transistor count and the fact that very often application did not use asynchronous VLIW 5 architecture at full capacity due to difficulties in code optimization. Beside the changes in size, VLIW4 also brings theoretical performance boost over double precision operations, again — several improvements for developers, for example, decrease in complexity of scheduler and register management for the compiler and some simplification concerning architecture symmetry.
The back end, featuring upgraded ROPs has doubled the compute power over 16-bit Integer operations while performance over 32-bit Floating point operation has increases 2-4 times, write operation are grouped now:
As we se, the bottleneck of graphics processor has been extended. In theory, time window for each block has been increased, allowing the graphics processor to be utilized more efficiently. However, changes in Cross-bar and buffer organization has been hidden and not presented in slides, and with out them its seems that all those enchantments made can be just a theory. Whether there were any changes or not — only AMD engineers can answer that, but we will just discuss the information that is left.
As we already mentioned, Cayman cards are equipped with even faster memory to compensate the throughput of memory interface. The plans for memory controller upgrade has existed for a long time, and finally the were implemented in Cayman, but the reason is different…
As a result of recent affairs, AMD decided to take advantage of buying ATI for its processor division. It was mainly targeted at General Computing (GPGPU). According to AMD, the crisis hit the world of CPUs. Further CPU performance increase is limited by technological aspects of production. And at a time when all were wondering what to do next — AMD found the solution “under its nose.” According to AMD plans, graphics cards can play key role in CPU performance boost. It is true that GPU architecture shows some benefit over arithmetic operations. NVIDIA has been spreading its graphics accelerators based on its own platform of General Purpose computing, called CUDA, which is quite a successful one too. By loosing the market share at that time, AMD had concentrated its powers on industrial platforms. On one side, the company strongly supports and promotes an open standard OpenCL, while on the other side, its does its best to use all the possibilities of General Purpose computing, that are implemented in DirectX11 API — DirectCompute. The GPGPU wars should by discussed separately. Now, back to graphic cards.
So, Cayman graphics processors have quite serious non-graphic computing improvements:
- The dispatcher is asynchronous now. This has resulted in multi-thread compute execution as well as in allowing each kernel to have its own command queue and protected virtual address domain which is important too.
- The memory controller features dual bidirectional DMA engines that has manly triggered by using a 256-bit interface considering the last one to be a bottle neck.
- Coalescing of shader read ops
- Fetch direct to LDS
- Stream dispatcher improvements
- Faster double precision ops. But what for? In the words of simple Graphics double precision ops are used very rarely…. On the other hand, in the world of professional graphics these ops are useless without ECC. Let’s us consider this point as a delicate hint to support error correction in future generations of AMD graphics
Actually We just happy for AMD having a success in General Purpose computing, the only thing to do is to implement it as software part and parcel. Well fellow programmers, are you ready for General Purpose computing? Then we are coming after you!
Leaving irony aside, we move to changes in rendering quality improvement.