Original Link: https://www.anandtech.com/show/8959/ddr4-haswell-e-scaling-review-2133-to-3200-with-gskill-corsair-adata-and-crucial



For any user interested in performance, memory speed is an important part of the equation when it comes to building your next system. This can apply to any user, from integrated graphics throughput to gaming and prosumer environments such as finance or oil and gas. Individuals with an opinion on memory speed fall into two broad camps, from saying faster memory has no effect, to the ‘make sure you get at least XYZ’. Following on from our previous Haswell DDR3 scaling coverage, we have now secured enough memory kits to perform a thorough test of the effect of memory speed on DDR4 and Haswell-E.

DDR4 vs. DDR3

On the face of it, direct comparisons between DDR4 and DDR3 are difficult to make. With the switch over from DDR2 to DDR3, there were some platforms that could use both types of memory and we could perform tests on both in the same environment. The current situation with DDR4 limits users to the extreme platform only, where DDR3 is not welcome (except for a few high minimum-order-quantity SKUs which are rarer than hens teeth). The platform dictates the memory compatibility, and the main characteristics of DDR4 are straightforward.

DDR4 brings to the table a lower operating voltage, down from 1.5 volts to 1.2 volts. This is the main characteristic touted by the memory manufacturers and those that use DDR4. It does not sound like a lot, especially when we can be dealing with systems from 300W to 1200W quite easily under Haswell-E. The quoted numbers are a 1-2W saving per module per system, which for a fully laden home-user desktop might approach 15W at the high end of savings over DDR3, but for a server farm with 1000 CPUs, this means a 15kW saving which adds up. The low voltage specification for DDR4L comes down from DDR3L as well, from 1.35 volts to 1.05 volts.

DRAM Comparison
  Low
Voltage
Standard
Voltage
Performance
Voltage
DDR 1.80 V 2.50 V  
DDR2   1.80 V 1.90 V
DDR3 1.35 V 1.50 V 1.65 V
DDR4 1.05 V 1.20 V 1.35 V

The lower voltage is also enhanced by voltage reference ICs before each memory chip in order to ensure that a consistent voltage is applied across each of them individually rather than the whole module at once. With DDR3, a single voltage source was applied across the whole module which can cause a more significant voltage drop, affecting stability. With this new design any voltage drop is IC dependent and can be corrected.

The other main adjustment to make from DDR3 to DDR4 is the rated speed. DDR3 JEDEC specifications started at 800 MTs and moved through to 1600 MTs, while some of the latest Intel DDR3 processors moved up to 1866 and AMD up to 2133. DDR4’s initial JEDEC for most consumer and server platforms is set at 2133 MHz, coupled with an increase in latency, but is designed to ensure that persistent transfers are quicker but overall latency is comparable to that of DDR2 and DDR3. Technically there is a DDR4-1600 specification for scenarios that want the bargain basement memory and are unfazed by actual performance.

As a result of this increase in speed, overall bandwidth is increased as well.

Bandwidth Comparison
  Bus Clock Internal Rate Prefetch Transfer Rate Channel Bandwidth
DDR 100-200 MHz 100-200 MHz 2n 0.20-0.40 GT/s 1.60-3.20 GBps
DDR2 200-533 MHz 100-266 MHz 4n 0.40-1.06 GT/s 3.20-8.50 GBps
DDR3 400-1066 MHz 100-266 MHz 8n 0.80-2.13 GT/s 6.40-17.0 GBps
DDR4 1066-2133 MHz 100-266 MHz 8n 2.13-4.26 GT/s 12.80-25.60 GBps

Latency moves from DDR3-1600 at CL 11 to DDR4-2133 at CL 15, which was an expected jump as JEDEC tends to increase CL by 2 for a jump in frequency. While having a latency of 15 clocks might come across as worse, the fact that the clocks are at 2133 MTs ensures that the overall performance is still comparable. At DDR3-1600 and CL11, time to initiate a read is 13.75 nanoseconds, compared to 14.06 nanoseconds for DDR4-2133 at CL15, which is a 2% jump.

One of the things that will offset the increase in latency is that CL15 seems to be a common standard no matter what frequency the memory is. Currently on the market we are seeing modules range from DDR4-2133 CL15 up to DDR4-3200 CL15 or DDR4-3400 CL16, marking a read latency down to 9.375 nanoseconds. With DDR3, we saw kits of DDR3-2400 CL10 for 8.33 nanoseconds, showing how aggressive memory manufacturing over the lifetime of the product can increase the efficiency.

Another noticeable difference from DDR3 to DDR4 is the design of the module itself.

DDR3 (top) vs DDR4 (bottom)

As with most technology updates notches are shifted in order to ensure that the right product fits in the right hole, but DDR4 changes a bit more than that. DDR4 is now a 288-pin package, moving up from 240-pin in DDR3. As the modules are the same length, this means a reduction in pin-to-pin distance from 1.00 mm to 0.85 mm (with a ±0.13 tolerance), decreasing the overall per-pin contact.

The other big design change is the sticky-out bits in the middle. Moving from pin 35 to pin 47, and back from pin 105 to pin 117, the pin contacts get longer as well as the PCB by 0.5 mm.

This is a gradient change rather than a full quick change:

Initially when dealing with these modules, I had the issue of not actually placing them in the slot correctly when using a motherboard with single sided latches. Over the past couple of weeks it has started to make more sense to place both ends in at the same time due to this protruding design, despite the fact it can be harder to do when on your hands and knees in a case.

Along with the pin size and arrangement, the modules are ever so slightly taller than DDR3 (31.25 mm rather than 30.35mm) to make routing easier, and the PCB is thicker (1.2 mm from 1.0 mm) to allow for more signal layers. This has implications for future designs, which we will mention later in the review.

There are other non-obvious benefits and considerations baked into the DDR4 design to mention.

DDR4 supports a low-power auto self-refresh (listed in the documentation as LPASR) which does the standard thing of refreshing the contents of memory but uses an adaptive algorithm based on temperature in order to avoid signal drift. The refreshing modes of each module will also adjust each array independently as the controller must support a fine-grained optimization routine to also coincide which parts of the memory are being used. This has power as well as stability implications for the long term future of DDR4 design.

Module training when the system boots is also a key feature of DDR4. During the start-up routine, the system must sweep through reference voltages to find a maximum passing window for the speeds selected rather than just apply the voltage in the options. The training will go through the voltage reference in steps from 0.5% of the VDDQ (typically 1.2V) to 0.8% and the set tolerance of the module must be within 1.625%. Calibration errors are plausible at one step size (9.6 mV at 1.2V) but also the slew margin loss due to calibration error must also be considered. This is due to the greater implication of losses due to margins and tolerances and ensures stable operation during use. The downside to the user is that the number of modules in the system effects the boot time of the device. A fully laden quad-channel Haswell-E system adds another 5-8 seconds to perform this procedure, and it is something that cannot be circumvented through a different routine without disregarding part of the specifications.

Source: Altera

DDR4 is also designed with the future in mind. Current memory on the market, except what we saw with Intelligent Memory, is a monolithic die solution. The base JEDEC specification will allow for 3D stacking of dies with through-silicon-vias (TSVs) should any memory manufacturer wish to go down this route to increase module density. To support this adjustment there are 3 chip select signals, bringing the total of bank select bits to 7 for a total of 128 possible banks. At current UDIMM specifications, there is provision for up to 8 stacked dies, however DDR4 is listed only to support x4/x8/x16 ICs with capacities of 2, 4, 8 and 16 Gibit (gibibit). This would suggest that the stacked die configuration is more suited to devices where x-y dimensions are a premium, or in the server markets. When it comes to higher capacity modules, we have already reported that 16GB UDIMMs should be coming to market, representing an 8*16Gb dual rank arrangement. We are working to make sure we can report on these as soon as they land, however when it comes to higher density UDIMM parts (i.e. not RDIMM or LRDIMM) we might have to start looking at newer technologies.

There are a significant number of other differences between DDR4 and DDR3, but most of these lie in the electronic engineer/design role for the memory and motherboard manufacturers, such as signal termination, extra programmable latencies and internal register adjustment. For a more in-depth read into these, a good Google search can yield results, although a thorough understanding of Rajinder Gill’s AnandTech piece about ‘Everything You Always Wanted To Know About SDRAM But Were Afraid To Ask’ is a great place to start about general memory operation. I still go back and refer to that piece more frequently than I admit, and end up scratching my head until I reach bone.



The Kits and The Markets

In our Haswell DDR3 scaling article, we introduced the concept of a Performance Index in order to compare memory kits of different frequencies and latencies against overall performance. The Performance Index is calculated by the rated speed divided by the CAS latency such that:

Performance Index = Frequency / CL

At the time it came across as a good indicator of performance when buying from the shelf, although most companies do not particularly advertise the latencies on the package. Our conclusion for DDR3 on Haswell was that the higher the Performance Index the better, although with two potential options close together, the one with the higher frequency is the better choice. So for example, when given DDR3-2133 C10 (PI of 213) against DDR3-1866 C10 (PI or 187), the first one should be chosen. However with DDR3-2133 C10 (PI of 213) and DDR3-2400 C12 (PI of 200) at the same price, the results would suggest the latter is a better option.

One of the big issues that the Performance Index does not take into account is the price, which fluctuates weekly depending on DRAM supply but also the capabilities of a kit through rarity. When purchasing memory ICs from the big three (Samsung, SK Hynix, Micron), the chips themselves are only rated at a basic speed and it is up to the module manufacturer to do further binning to find the best silicon for the high speed memory kits. As a result many of these companies will bid on certain batches with a history of high performance, and then the extra time required to separate the good chips from a batch adds into the high cost of the top frequency kits. Usually frequency defines the difficulty, rather than the sub-timings and latency because frequency is a defining factor.

When it comes to DDR4, we will be taking a similar broad approach to kit designation, taking the Performance Index of each memory kit in each benchmark and attempting to find a correlation.

The Market

At launch, DDR4 kits had the obvious premium of being a new technology as well as being is short supply due to Intel moving up the date for release. The prices at this time, as we reported, were the equivalent to $213 for a 4x4GB 2133 C15 kit going through $300 for a 4x4 GB 2666 C15 up to $413 for 4x4GB 3000 C15. Based on these numbers it would seem that the high end modules have come down in price quickly, but the lower range products have stayed similar. We took a range of pricing from Newegg to see the effect of being at market for just over six months has done.

The cheapest standard kit of 2133 C15 4x4 GB comes in at $200. The best kit in this layout would be towards the bottom left, as indicated by the performance index in each square:

Here I have marked four areas, representing the low end memory in orange, the more standard in white, the performance modules in green and the super-high performance in dark green. There are currently no modules in that last group, going through all the pricing from 2x8GB kits to 8x8GB kits:

The lowest price per GB is at $387 for 4x8GB of DDR4-2133 C15, at $12.09 per GB, compared to $1800 for 8x8GB of 2800 C16 which makes out to $28.13 per GB.

The interesting segments based on price alone that catch my in that case are:

4x4GB DDR3-2133 C13 for $213
4x8GB DDR3-2133 C13 for $400
At this point, a CAS latency below 15 is something of a novelty. It seems a little more esoteric than usual, as none of the manufacturers we spoke to even considered sampling us something of this nature. I would be interested to actually see how it performs.
4x4GB DDR4-2666 C13 for $290
4x4GB DDR4-3000 C15 for $300
I picked both of these based on the closeness of price and on Performance Index. The latter kit is something we have in for review, but similar to the previous kits listed, a CAS latency of 13 is an interesting element to the equation.
4x8GB DDR4-2800 C15 for $510 Measuring up at nearly $16 per GB, this kit mixes the elements of on-paper specifications, density and price for a nice X99 system.

The Kits

For this roundup and subsequent reviews, we received kits from almost every major memory manufacturer. G.Skill, Corsair, ADATA and Crucial were all willing to send various speeds and densities of memory, ranging from the basic 4x4 GB of DDR4-2133 C15 to 8x8GB of DDR4-2133 C15, or 8x8GB of DDR4-2400 C16 to 4x4GB of DDR4-3200 C16.

The main conclusions from this testing will be from the 4x4 GB modules in order to keep consistency, however the 4x8 GB and 8x8 GB results will be included for comparison. Larger modules (and more modules in a kit) tend to lead to relaxed sub-timings in order to ensure full compatibility with all CPUs in all motherboards. This means that in synthetic testing we may end up with some slightly different results, although this may differ in real-world tests.

Another point to note is module compatibility. When DDR4 was first launched with Haswell-E in September 2014, compatibility issues were a problem. Intel had moved up the time of the launch from mid-September to early September very late in the day, leaving memory vendors to scramble kits to market. This gives them a shorter time to work with ASUS, GIGABYTE, MSI and ASRock in order to insure no issues when working with motherboards, especially with high end memory. Due to this shortened timeframe there were some issues to begin with but these should have been ironed out since. Also on the high speed memory front, it would seem that early motherboard BIOSes were also unable to cope with the higher speed, higher density memory kits. Therefore it is important to make sure that all BIOSes are up to date when buying the expensive memory sets.

DDR4 Module Comparison
  SKU Kit Size Kit Speed Sub-Timings Voltage PI
Corsair CMD16GX4M4B3200C16 16 GB (4x4) DDR4-3200 16-18-18 2T 1.35 V 200
G.Skill F4-3000C15Q-16GRR 16 GB (4x4) DDR4-3000 15-15-15 2T 1.35 V 200
G.Skill F4-2800C16Q-16GRK 16 GB (4x4) DDR4-2800 16-16-16 2T 1.20 V 175
G.Skill F4-2666C15Q-16GRR 16 GB (4x4) DDR4-2666 15-15-15 2T 1.20 V 177
Crucial BLS8G4D240FSA 16 GB (4x4) DDR4-2400 16-16-16 2T 1.20 V 150
G.Skill F4-2133C15Q-16GRR 16 GB (4x4) DDR4-2133 15-15-15 2T 1.20 V 142
ADATA AX4U2400W8G16-QRZ 64GB (8x8) DDR4-2400 16-16-16 2T 1.20 V 150
Corsair CMV8GX4M1A2133C15 64GB (8x8) DDR4-2133 15-15-15 2T 1.20 V 142
 

The memory in this test is as follows, starting with the 4x4GB kits and fastest/most expensive:

DDR4-3200 16-18-18 2T 4x4 GB 1.35 V Corsair CMD16GX4M4B3200C16, PI of 200
$746 on Amazon
$685 on Newegg

DDR4-3000 15-15-15 2T 4x4 GB 1.35 V G.Skill F4-3000C15Q-16GRR, PI of 200
$436 on Amazon
$300 on Newegg

DDR4-2800 16-16-16 2T 4x4 GB 1.2 V G.Skill F4-2800C16Q-16GRK, PI of 175
$305 on Amazon
$270 on Newegg

DDR4-2666 15-15-15 2T 4x4 GB 1.2 V G.Skill F4-2666C15Q-16GRR, PI of 177
$290 on Amazon
$250 on Newegg

DDR4-2400 16-16-16 2T 4x4 GB 1.2 V Crucial BLS8G4D240FSA.16FAD, PI of 150
$180 on Amazon

DDR4-2133 15-15-15 2T 4x4 GB 1.2 V G.Skill F4-2133C15Q-16GRR, PI of 142
$315 Amazon
$250 Newegg

For good measure, we also have the following kits for testing:

DDR4-2400 16-16-16 2T 4x8 GB and 8x8 GB 1.2 V ADATA AX4U2400W8G16-QRZ, PI of 150
$400 for 4x8GB

DDR4-2133 15-15-15 2T 4x8 GB and 8x8GB 1.2 V Corsair CMV8GX4M1A2133C15-ESM, PI of 142
$120 per module

We also have more kits from Crucial, Corsair, G.Skill and KLEVV incoming for when we tackle individual reviews. The purpose of this scaling piece is merely to demonstrate the general effect of speed across the modules currently on the market. As mentioned, some of those kits with a CL of less than 15 look interesting, so I will have to give Mushkin a call or get a contact at Kingston.

Test Setup

Test Setup
Processor Intel Core i7-5960X ES, 8C/16T overclocked to 4.0 GHz
Motherboards ASUS X99 Deluxe
Cooling Cooler Master Nepton 140XL
Power Supply OCZ 1250W Gold ZX Series
Memory Corsair DDR4-3200 16-18-18 4x4GB, CMD16GX4M4B3200C16
G.Skill DDR4-3000 15-15-15 4x4GB, F4-3000C15Q-16GRR
G.Skill DDR4-2800 16-16-16 4x4GB, F4-2800C16Q-16GRK
G.Skill DDR4-2666 15-15-15 4x4GB, F4-2666C15Q-16GRR
Crucial DDR4-2400 16-16-16 4x4GB, BLS8G4D240FSA
G.Skill DDR4-2133 15-15-15 4x4GB, F4-2133C15Q-16GRR
ADATA DDR4-2400 16-16-16 8x8GB, AX4U2400W8G16-QRZ
Corsair DDR4-2133 15-15-15 8x8GB, CMV8GX4M1A2133C15-ESM
Memory Settings XMP
Video Cards MSI GTX 770 Lightning 2GB (1150/1202 Boost)
Hard Drive OCZ Vertex 3 256GB
Case Open Test Bed
Operating System Windows 7 64-bit SP1


Enabling XMP

By default, memory should adhere to specifications set by JEDEC (formerly known as the Joint Electron Device Engineering Council). These specifications state what information should be stored in the memory EEPROM, such as manufacturer information, serial number, and other useful information. Part of this is the memory specifications for standard memory speedswhich a system will adhere to in the event of other information not being available. For DDR4, this means DDR4-2133 15-15-15 at 1.20 volts.

An XMP, or (Intel-developed) Extreme Memory Profile, is an additional set of values stored in the EEPROM which can be detected by SPD in the BIOS. Most DRAM has space for two additional SPD profiles, sometimes referred to as an ‘enthusiast’ and an ‘extreme’ profile; however most consumer oriented modules may only have one XMP profile. The XMP profile is typically the one advertised on the memory kit – if the capability of the memory deviates in any way from specified JEDEC timings, a manufacturer must use an XMP profile.

Thus it is important that the user enables such a profile!  It is not plug and play!

As I have stated since reviewing memory, at big computing events and gaming LANs there are plenty of enthusiasts who boast about buying the best hardware for their system. If you ask what memory they are running, then actually probe the system (by using CPU-Z), more often than not the user after buying this expensive memory has not enabled XMP.  It sounds like a joke story, but this happened several times at my last iSeries LAN in the UK – people boasting about high performance memory, but because they did not enable it in the BIOS, were still running at DDR3-1333 MHz C9.

So enable XMP with your memory!

Here is how for most motherboards except the ASUS X99-Deluxe, which uses an onboard XMP switch:

Step 1: Enter the BIOS

This is typically done by pressing DEL or F2 during POST/startup. Users who have enabled fast booting under Windows 8 will have to use motherboard vendor software to enable ‘Go2BIOS’ or a similar feature.

Step 2: Enable XMP

Depending on your motherboard manufacturer, this will be different. I have taken images from the major four motherboard manufacturers to show where the setting is on some of the latest X99 motherboard models.

On any ASUS X99 board, the setting is on the EZ-Mode screen. Where it says ‘XMP’ on the left, click on this button and navigate to ‘Profile 1’:

If you do not get an EZ mode (some ROG boards go straight to advanced mode), then the option is under the AI Tweaker tab, in the AI Overclock Tuner option, or you can navigate back to EZ mode.

For ASRock motherboards, depending on which model you have, navigate to OC Tweaker and scroll down to the DRAM Timing Configuration. Adjust the ‘Load XMP Setting’ option to Profile 1.

For GIGABYTE motherboards, press F2 to switch to classic mode and navigate to the MIT tab. From here, select Advanced Frequency Settings.

In this menu will be an option to enable XMP where this arrow is pointing:

Finally on MSI motherboards, we get a button right next to the OC Genie in the BIOS to enable XMP:

I understand that setting XMP may seem trivial to most of AnandTech’s regular readers, however for completeness (and the lack of XMP being enabled at events it seems) I wanted to include this mini-guide. Of course different BIOS versions on different motherboards may have moved the options around a little – either head to enthusiast forums, or if it is a motherboard I have reviewed, I tend to post up all the screenshots of the BIOS I tested with as a guide.



CPU Real World Performance

A small note on real world testing against synthetic testing – due to the way that DRAM affects a system, there can be a large disconnect between what we can observe in synthetic tests against real world testing. Synthetic tests are designed to exploit various feature XYZ, usually in an unrealistic scenario, such as pure memory read speeds or bandwidth numbers. While these are good for exploring the peak potential of a system, they often to not translate as well as CPU speed does if we invoke some common prosumer real world task. So while spending 10x on memory might show a large improvement in peak bandwidth numbers, users will have to weigh up the real world benefits in order to find the day-to-day difference when going for expensive hardware. Typically a limiting factor might be something else in the system, such as the size of a cache, so with all the will in the world a faster read speed won’t make much difference. As a result, we tend to stick to real world tests for almost all of our testing (with a couple of minor suggestions). Our benchmarks are either derived from areas such as transcoding a film or come from a regular software format such as molecular dynamics running a consistent scene.

Handbrake v0.9.9

For HandBrake, we take two videos (a 2h20 640x266 DVD rip and a 10min double UHD 3840x4320 animation short) and convert them to x264 format in an MP4 container.  Results are given in terms of the frames per second processed, and HandBrake uses as many threads as possible.

HandBrake v0.9.9 LQ Film

HandBrake v0.9.9 HQ Film

The low quality conversion is more reliant on CPU cycles available, while the high resolution conversion seems to have a very slight ~3% benefit moving up to DDR4-3000 memory.

WinRAR 5.01

Our WinRAR test from 2013 is updated to the latest version of WinRAR at the start of 2014. We compress a set of 2867 files across 320 folders totaling 1.52 GB in size – 95% of these files are small typical website files, and the rest (90% of the size) are small 30 second 720p videos.

WinRAR 5.01

The biggest difference showed a 5% gain over DDR4-2133 C15, although this seemed at random.

FastStone Image Viewer 4.9

FastStone Image Viewer is a free piece of software I have been using for quite a few years now. It allows quick viewing of flat images, as well as resizing, changing color depth, adding simple text or simple filters. It also has a bulk image conversion tool, which we use here. The software currently operates only in single-thread mode, which should change in later versions of the software. For this test, we convert a series of 170 files, of various resolutions, dimensions and types (of a total size of 163MB), all to the .gif format of 640x480 dimensions. Results shown are in seconds, lower is better.

FastStone Image Viewer 4.9

No difference between the memory speeds in FastStone.

x264 HD 3.0 Benchmark

The x264 HD Benchmark uses a common HD encoding tool to process an HD MPEG2 source at 1280x720 at 3963 Kbps. This test represents a standardized result which can be compared across other reviews, and is dependent on both CPU power and memory speed. The benchmark performs a 2-pass encode, and the results shown are the average frame rate of each pass performed four times. Higher is better this time around.

x264 HD 3.0, 1st Pass

x264 HD 3.0, 2nd Pass

The faster memory showed a 2.5% gain on the first pass, but less than a 1% gain in the second pass.

7-Zip 9.2

As an open source compression tool, 7-Zip is a popular tool for making sets of files easier to handle and transfer. The software offers up its own benchmark, to which we report the result.

7-Zip 9.2

At most a 2% gain was shown by 3000+ memory.

Mozilla Kraken 1.1

One of the more popular web benchmarks that stresses various codes, we run this benchmark in Chrome 35.

Mozilla Kraken 1.1

Kraken seemed to prefer the fast 1.2V memory, giving a 4.8% gain at DDR4-2800 C16, although this did not translate into the faster memory.

WebXPRT

A more in-depth web test featuring stock price rendering, image manipulation and face recognition algorithms, also run in Chrome 35.

WebXPRT

The DDR4-3200 gave an 11% gain over the base JEDEC memory, although this seemed to be more of a step than a slow rise.



Professional Performance: Windows

Agisoft Photoscan – 2D to 3D Image Manipulation: link

Agisoft Photoscan creates 3D models from 2D images, a process which is very computationally expensive. The algorithm is split into four distinct phases, and different phases of the model reconstruction require either fast memory, fast IPC, more cores, or even OpenCL compute devices to hand. Agisoft supplied us with a special version of the software to script the process, where we take 50 images of a stately home and convert it into a medium quality model. This benchmark typically takes around 15-20 minutes on a high end PC on the CPU alone, with GPUs reducing the time.

Agisoft Photoscan 1.0.0

Photoscan, on paper, would offer more possibilities for faster memory to make a difference. However it would seem that the most memory dependent stage (stage 3) is actually a small part of the overall calculation and was absorbed by the natural variation in the larger stages, giving at most a 1.1% difference between times.

Cinebench R15

Cinebench R15 - Single Thread

Cinebench R15 - MultiThread

Cinebench is historically CPU dependent, giving a 2% difference from JEDEC to peak results.

3D Particle Movement

3DPM is a self-penned benchmark, taking basic 3D movement algorithms used in Brownian Motion simulations and testing them for speed. High floating point performance, MHz and IPC wins in the single thread version, whereas the multithread version has to handle the threads and loves more cores.

3D Particle Movement: Single Threaded

3D Particle Movement: MultiThreaded

3DPM is also relatively memory agnostic for DDR4 on Haswell-E, showing that DDR4-2133 is good enough.

Professional Performance: Linux

Built around several freely available benchmarks for Linux, Linux-Bench is a project spearheaded by Patrick at ServeTheHome to streamline about a dozen of these tests in a single neat package run via a set of three commands using an Ubuntu 14.04 LiveCD. These tests include fluid dynamics used by NASA, ray-tracing, molecular modeling, and a scalable data structure server for web deployments. We run Linux-Bench and have chosen to report a select few of the tests that rely on CPU and DRAM speed.

C-Ray: link

C-Ray is a simple ray-tracing program that focuses almost exclusively on processor performance rather than DRAM access. The test in Linux-Bench renders a heavy complex scene offering a large scalable scenario.

Linux-Bench c-ray 1.1 (Hard)

Natural variation gives a 4% difference, although the faster and more dense memory gave slower times.

NAMD, Scalable Molecular Dynamics: link

Developed by the Theoretical and Computational Biophysics Group at the University of Illinois at Urbana-Champaign, NAMD is a set of parallel molecular dynamics codes for extreme parallelization up to and beyond 200,000 cores. The reference paper detailing NAMD has over 4000 citations, and our testing runs a small simulation where the calculation steps per unit time is the output vector.

Linux-Bench NAMD Molecular Dynamics

NAMD showed little difference between our memory kits, peaking at 0.7% above JEDEC.

NPB, Fluid Dynamics: link

Aside from LINPACK, there are many other ways to benchmark supercomputers in terms of how effective they are for various types of mathematical processes. The NAS Parallel Benchmarks (NPB) are a set of small programs originally designed for NASA to test their supercomputers in terms of fluid dynamics simulations, useful for airflow reactions and design.

Linux-Bench NPB Fluid Dynamics

Despite the 4x8 GB results going south of the border, the faster memory does give a slight difference in NPB, peaking at 4.3% increased performance for the 3000+ memory kits.

Redis: link

Many of the online applications rely on key-value caches and data structure servers to operate. Redis is an open-source, scalable web technology with a b developer base, but also relies heavily on memory bandwidth as well as CPU performance.

Linux-Bench Redis Memory-Key Store, 100x

When tackling a high number of users, Redis performs up to 17% better using 2800+ memory, indicating our best benchmark result.



Single GTX 770 Gaming

The normal avenue for faster memory lies in integrated graphics solutions, but as Haswell-E does not have integrated graphics we are testing typical gaming scenarios using relatively high end graphics cards. First up is a single MSI GTX 770 Lightning in our Haswell-E system, running our benchmarks at 1080p and maximum settings. We take the average frame rates and minimum frame rates for each of our tests.

Dirt 3: Average FPS

Dirt 3 on GTX 770: Average FPS

Dirt 3: Minimum FPS

Dirt 3 on GTX 770: Minimum FPS

Bioshock Infinite: Average FPS

Bioshock Infinite on GTX 770: Average FPS

Bioshock Infinite: Minimum FPS

Bioshock Infinite on GTX 770: Minimum FPS

Tomb Raider: Average FPS

Tomb Raider on GTX 770: Average FPS

Tomb Raider: Minimum FPS

Tomb Raider on GTX 770: Minimum FPS

Sleeping Dogs: Average FPS

Sleeping Dogs on GTX 770: Average FPS

Sleeping Dogs: Minimum FPS

Sleeping Dogs on GTX 770: Minimum FPS

Conclusions at 1080p/Max with a GTX 770

The only real deficit observed throughout our testing is the DDR4-2133 C15 4x4GB kit dropping down to 121 FPS in F1 2013 from a 126 FPS average from the other kits, resulting in a less-than 5% drop by choosing the default JEDEC kit in the 4x4 configuration. Moving up to the 4x8 and 8x8 produces 125 FPS, but anything above 2133 C15 gets around the top result from 125-127.



2x GTX 770 SLI Gaming

Next up is a pair of MSI GTX 770 Lightning graphics cards in SLI, which may be more akin to the typical Haswell-E system. Our goal here is to provide enough frames for a full on 120 Hz or 144 Hz refresh rate, ideally at the minimum frame rate level in modern games while still attempting maximum quality settings at 1080p. Even for this system it will be a hard task, and it will be interesting to see how the different memory configurations help with this.

Dirt 3: Average FPS

Dirt 3 on 2xGTX 770: Average FPS

Dirt 3: Minimum FPS

Dirt 3 on 2xGTX 770: Minimum FPS

Bioshock Infinite: Average FPS

Bioshock Infinite on 2xGTX 770: Average FPS

Bioshock Infinite: Minimum FPS

Bioshock Infinite on 2xGTX 770: Minimum FPS

Tomb Raider: Average FPS

Tomb Raider on 2xGTX 770: Average FPS

Tomb Raider: Minimum FPS

Tomb Raider on 2xGTX 770: Minimum FPS

Sleeping Dogs: Average FPS

Sleeping Dogs on 2xGTX 770: Average FPS

Sleeping Dogs: Minimum FPS

Sleeping Dogs on 2xGTX 770: Minimum FPS

Conclusions at 1080p/Max with two GTX 770s

Similarly to the single GPU arrangement, the only deficit worth mentioning is that of the minimum frame rate in F1 2013. Here we see 114-115 FPS on all the DDR4-2133 C15 kits, compared to 124-126 FPS on everything else except DDR4-2400 4x8 which had 120 FPS. This is a bigger 10% boost from choosing something other than the JEDEC standard.



Comparing DDR3 to DDR4

Moving from a standard DDR3-2133 C11 kit to DDR4-2133 C15, just by looking at the numbers, feels like a downgrade despite what the rest of the system is. Ideally we want the first number, the frequency, to be high and the second number, the latency, to be low. After spending several years dealing with DDR3, moving to DDR4 feels a bit of a backwards step when you look at solely the numbers on paper.

As part of this review we have covered many different areas where DDR4 is the upgrade of DDR3, not only in terms of voltage but some of the underlying concepts as well. This puts DDR4 in a position for upgradability in the future, especially when it comes to density and future technologies (see the next page for more information). But an ultimate question still remains: at the same frequency and latency, do they perform the same?

The only way to perform an identical comparison would be to have a platform that could probe both DDR3 and DDR4 while keeping the same CPU. If one comes along, we will test that, but in the mean time we can do some broad comparisons with near-identical systems.

For this test we took two Haswell based systems and compared them against each other. The first contains the Haswell-E i7-5960X processor, cut it down to run at four cores with no HyperThreading, fixed the CPU speed at 4 GHz and placed the memory into DDR4-2133 14-14-14 350 2T timings. We did the same with the second system, a Haswell based i7-4770K moved it to 4 GHz and making sure it was in 4C/4T mode. The OS was placed into a unique high performance profile and we ran our test suite. The only difference that remained between the two setups was the L2 and L3 cache, which we cannot change unfortunately.

In our non-gaming tests, there is one situation where DDR3 is more than 3% better and two where DDR4 is +3%. It is worth noting that most of the numbers, especially with things like the Web and Cinebench are actually slightly negative.

In the gaming tests, similarly there are more +3% on the side of DDR4. If we do a direct comparison regardless of the percentage, DDR4 wins 11 times compared to DDR3 getting 8, and almost of DDR3’s wins are minor except for two-way SLI. It would seem that for two-way SLI DDR4 at least brings up some of the minimum frame rates.

Pulling out the >3% difference numbers, just to see what the numbers exactly are:

On the face of it, the Hybrid result does not seem that different, whereas a full minute on Photoscan or 10 seconds in our WinRAR test feels like a difference. In the gaming tests moving nearer to 120 FPS or 60 FPS, especially in both of the minimum frame rate tests, is an important jump which happens with DDR4.

Overall, comparing DDR4 to DDR3, there is little difference to separate the two. In a couple of small instances one is better than the other, but on those edge cases it might be prudent to say that we cannot make a final decision until we can synchronize the rest of the system, such as the size of CPU caches. When we can perform such tests, we will run some more numbers.



The Future of DDR4

DDR4 was first launch in the enthusiast space for several reasons. On the server side, any opportunity to use lower power and drive cooling costs down is a positive, so aiming at Xeons and high-end consumer platforms was priority number one. Any of the big players in the datacenter space most likely had hardware in and running for several months before the consumer arms got hold of it. Being such a new element in the twisting dynamic of memory, the modules command a premium and the big purchasers got first pick. The downside of when that shifts to consumer where budgets are tighter and some of the intended benefits of DDR4 are not that important, such as lower power, it causes problems. When we first launched our Haswell-E piece, the cost of a 4x4GB kit of JEDEC DRAM for even a basic eight-core system was over $250, and not much has changed since. Memory companies have lower stock levels, driving up the cost, and will only make and sell more if people start buying them.  At this point, Haswell-E and DDR4 is really restricted to early adopters or those with a professional requirement to go down this route.

DDR4 will start to get interesting when we see it in the mainstream consumer level. This means when regular Core i3/i5 desktops come into being, and eventually SO-DIMM variants in notebooks. The big question, as always, is when. If you believe the leaks, all arrows point towards a launch with Skylake on the Intel side, after Broadwell. Most analysts are also in this category, with the question being on how long the Broadwell platform on desktops is to last. The 14nm process node had plenty of issues, meaning that Q1 2015 is when we have started to see more Core-M (Broadwell-Y) products in notebooks and the launch of Broadwell-U, aiming at the AIO and mini-PC (such as the NUC and BRIX) market as well as laptops. This staggered launch would suggest that Broadwell on desktops should be due in the next few months, but there is no official indication as to when Skylake will hit the market, and in what form first. As always, Intel does not comment on unreleased product when asked.

On the AMD side of the equation, despite talks of a Kaveri refresh popping up in our forums and discussions about Carrizo focusing only on the sub-45W market with Excavator cores, we look to the talk surrounding Zen, K12 and everything that points to AMD’s architecture refresh with Jim Keller at the helm sometime around 2016. In a recent round table talk, Jim Keller described Zen as scaling from tablet to desktop but also probing servers. One would hope (as well as predict and imagine) that AMD is aiming for DDR4 with the platform. It makes sense to approach the memory subsystem of the new architecture from this angle, although for any official confirmation we might have to wait a few months at the earliest when AMD start releasing more information.

When DDR4 comes to desktop we will start to see a shift in the proportions of the market share that both DDR4 and DDR3 will get. The bulk memory market for desktop designs and mini-PCs will be a key demographic which will shift more to an equal DDR3-DDR4 stage and we can hope to achieve price parity before then. If we are to see mainstream DDR4 adoption, the bulk markets have to be interested in the performance of the platforms that require DDR4 specifically but also remain price competitive. It essentially means that companies like G.Skill that rely on DRAM sales for the bulk of their revenue have to make predictions on the performance of platforms like Skylake in order to tell their investors how quick DDR4 will take the market. It could be the difference between 10% and 33% adoption by the end of 2015.

One of the questions that sometimes appears with DDR4 is ‘what about DDR5?’. It looks like there are no plans to consider a DDR5 version ever for a number of reasons.

Firstly, but perhaps minor, is the nature of the DRAM interface.  It relies on a parallel connection and if other standards are indicative of direction, it should probably be upgraded to a serial connection, similarly as how PCI/PCI Express and PATA/SATA has evolved in order to increase throughput while at the same time decreasing pin counts and being easier to design for the same bandwidth.

Secondly, and more importantly, are the other memory standards currently being explored in the research labs. Rather than attempt to copy verbatim a piece from ExtremeTech, I’ll summarize it here. The three standards of interest, whilst mostly mobile focused, are:

Wide I/O 2: Designed to be placed on top of processors directly, abusing a larger number of I/O pins by TSVs, keeping frequencies down in order to reduce heat generation. This has benefits in industries where space is at a premium, saving some PCB area in exchange for processor Z-height.

Hybrid Memory Cube (HMC): Similar to current monolithic DRAM dies but using a stacked slice over a logic base, allowing for much higher density and much higher bandwidth within a single module. This also increases energy efficiency, but introduces higher cost and requires higher power consumption per module.

High Bandwidth Memory (HBM): This is almost a combination of the two above, specifically aimed more at graphics by offering multiple DRAM dies stacked on or near the memory controller to increase density and bandwidth. It is more described as a specialized implementation of Wide I/O 2, but should afford up to a 256GB/s bandwidth on a 128-bit bus with 4-8 stacks on a single interface.

Image from ExtremeTech

Moving some of the memory power consumption onto the processor directly has thermal issues to consider, which means that memory bandwidth/cost might be improved at the expense of operating frequencies. Adding packages onto the processor also introduces a heavy element of cost, which might leave these specialist technologies to the super-early adopters to begin with.

Given the time from DDR4 being considered to it actually entering the desktop market, we can safely say that DDR4 will become the standard memory option over the next four years, just as DDR3 is right now. Beyond DDR4 is harder to predict, and depends on how Intel/AMD want to approach a solution that offers higher memory bandwidth, depending at what cost. Both companies will be looking at how their integrated graphics are performing, as that will ultimately be the best beneficiary to the design. AMD has some leverage in the discrete GPU space and will be able to transfer any knowledge used over to the CPU side, but Intel has a big wallet. Both Intel and AMD has experimented with eDRAM/SRAM as extra level caches with Crystal Well and PS4/XBone, which puts less stress on the external memory demands when it comes to processor graphics, which leads me to the prediction that DDR4 will be here in the market longer than DDR3 or DDR2.

If any of the major CPU/SoC manufacturers want to invest heavily in Wide I/O 2, HBM or HMC, we will have to wait. If it changes what we see on the desktop, the mini-PC or the laptop, we might have to wait even longer.



Conclusions on Haswell-E DDR4 Scaling

When we first start testing for a piece, it is very important to keep an open mind and not presuppose any end-results. Ideally we would go double blind, but in the tech review industry that is not always possible. We knew the results from our DDR3 testing showing that outside of integrated graphics, there are a few edge cases where upgrading to faster memory makes sense but avoiding the trap of low base memory can actually have an overall impact on the system - as long as XMP is enabled of course. 

Because Haswell-E does not have any form of integrated graphics, the results today are fairly muted. In some ways they mirror the results we saw on DDR3, but are more indicative of the faster frequency memory at hand.

For the most part, the base advice is: aim for DDR4-2400 CL15 or better.

DDR4-2133 CL15, which has a performance index of 142, has a few benchmarks where it comes out up to 3-10% slower than the rest of the field. Cases in point include video conversion (Handbrake at 4K60), fluid dynamics, complex web code and minimum frame rates on certain games.

For professional users, we saw a number of benefits moving to the higher memory ranges, although for only very minor performance gains. Cinebench R15 gave 2%, 7-zip gave 2% and our fluid dynamics Linux benchmark was up +4.3%. The only true benchmark where 2800+ memory made a significant difference was in Redis, which is a scalable database memory-key store benchmark. Only users with specific needs would need to consider this.

There is one other group of individuals where super-high frequency memory on Haswell-E makes sense – the sub-zero overclockers. For these people, relying on the best synthetic test results can mean the difference between #5 and #20 in the world rankings. The only issue here is that these individuals or teams are often seeded the best memory already. This relegates high end memory sales to system integrators who can sell it at a premium.

Personally, DDR4 offers three elements of interest. Firstly is the design, and finding good looking memory to match a system that you might want to show off can be a critical element when looking at components. Second is density, and given that Haswell-E currently supports four memory channels at two modules per channel, if we get a whiff of 16GB modules it could be a boon for high memory capactiy prosumers. The third element to the equation is integrated graphics, where the need for faster memory can actually greatly improve performance. Unfortunately we will have to wait for the industry to catch up on that one.

At this point in time, our DDR4 testing is not yet complete. Over the next couple of weeks, we will be reviewing these memory kits individually, comparing results, pricing, styling and overclockability for what it is worth. Our recent array of DDR4-3400 news from Corsair and G.Skill has also got some of the memory manufacturers interested in seeing even higher performance kits on the test bed, so we are looking forward to that. I also need to contact Mushkin and Kingston and see if those CL12/CL13 memory kits could pose a threat to the status quo. 
Edit: Mushkin actually emailed me this morning about getting some product for review.

We have a couple of updates for our testing suite in mind as well, particularly the gaming element and are waiting for new SSDs and GPUs to arrive before switching some of our game tests over to something more recent, perhaps at a higher resolution as well. When that happens, we will post some more numbers to digest.

 

Log in

Don't have an account? Sign up now