CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   Hardware (http://www.cfd-online.com/Forums/hardware/)
-   -   ECC vs. non ECC ram: My opinion (http://www.cfd-online.com/Forums/hardware/124113-ecc-vs-non-ecc-ram-my-opinion.html)

ghost82 September 27, 2013 10:00

ECC vs. non ECC ram: My opinion
 
Hi cfd users!
I would like to share my opinion about ecc vs non ecc ram.

I recently bought a new workstation:

- double intel xeon E5-2630
- Asus Z9PE-D8 WS
- Nvidia quadro 600
- 64 gb ram (I got ecc and non ecc to test them)

Non-ecc ram: Corsair valueselect 8x8gb (cmv8gx3m1a1333c9)
ecc ram: Samsung 8x8gb (M393B1K70CH0-CH9)

Both types are ddr3 and work at 1333 Mhz (PC3-10600).

I read in this forum that non ecc ram works good for cfd and ecc is not a must.

In internet I read where ecc is usefull, I read about cosmic rays..so my first feeling was that ecc is not so usefull compared to non ecc.

But in my opinion, and from my tests, ecc ram is a must:
with my system and latest ansys 14.7, working in parallel with all real cores (12) with a mesh of about 1.5 million cells, fluent crashes every 2-3 hours; in the log file errors were very generic.
However, a couple of hours of test running memtest86+ on non ecc ram shows no error.

Then I changed to ecc ram: same mesh and same cores; no errors at all after 3 continuous days.

So, in my opinion, if you buy a new worstation: go for ecc ram!!!

Daniele

HMN September 27, 2013 11:07

Quote:

Originally Posted by ghost82 (Post 453893)
Hi cfd users!
I would like to share my opinion about ecc vs non ecc ram.

I recently bought a new workstation:

- double intel xeon E5-2630
- Asus Z9PE-D8 WS
- Nvidia quadro 600
- 64 gb ram (I got ecc and non ecc to test them)

Non-ecc ram: Corsair valueselect 8x8gb (cmv8gx3m1a1333c9)
ecc ram: Samsung 8x8gb (M393B1K70CH0-CH9)

Both types are ddr3 and work at 1333 Mhz (PC3-10600).

I read in this forum that non ecc ram works good for cfd and ecc is not a must.

In internet I read where ecc is usefull, I read about cosmic rays..so my first feeling was that ecc is not so usefull compared to non ecc.

But in my opinion, and from my tests, ecc ram is a must:
with my system and latest ansys 14.7, working in parallel with all real cores (12) with a mesh of about 1.5 million cells, fluent crashes every 2-3 hours; in the log file errors were very generic.
However, a couple of hours of test running memtest86+ on non ecc ram shows no error.

Then I changed to ecc ram: same mesh and same cores; no errors at all after 3 continuous days.

So, in my opinion, if you buy a new worstation: go for ecc ram!!!

Daniele

How can you be sure that the problem comes from the non-ECC memory modules?
ECC memory modules needs extra storage for parity bits that ckeck the integrity of the data and can correct some errors......

Is it really necesary? I use ansys 14.5.7 in a computer without ECC memory without errors.


By the way, you cannot have ansys 14.7. I think you mean 14.5.7. :D

ghost82 September 27, 2013 11:24

Quote:

Originally Posted by HMN (Post 453896)
How can you be sure that the problem comes from the non-ECC memory modules?
ECC memory modules needs extra storage for parity bits that ckeck the integrity of the data and can correct some errors......

Is it really necesary? I use ansys 14.5.7 in a computer without ECC memory without errors.


By the way, you cannot have ansys 14.7. I think you mean 14.5.7. :D

Yes, ansys 14.5.7 :)
I'm sure because I run same case with same hardware several times, by change only memory modules.
I noticed that in serial mode I haven't any errors with non ecc modules, but problems begin with parallel calculation.
For that particular case ecc for me is a must as I cannot restart simulation every 2-3 hours.

Daniele

kyle September 27, 2013 21:25

I run a cluster with 15 quad core i7 CPUs, and it seems like 1 crash a week is of the "random" variety. These are crashes that don't happen again when you restart the run. I have about 50% utilization.

Even if all of those crashes are due to non-ECC memory, it still isn't enough to justify the additional cost and slower speed of ECC memory.

wyldckat September 28, 2013 10:12

Greetings to all!

Quote:

Originally Posted by kyle (Post 453954)
Even if all of those crashes are due to non-ECC memory, it still isn't enough to justify the additional cost and slower speed of ECC memory.

It's just a matter of weighing the costs with the benefits. The experience on our office is that the results are always needed with the utmost urgency, so if there is a crash overnight or over the weekend, that's simply unacceptable.

And it's bad enough when machines can crash on their own for some hardware reason or other (example: http://whatif.xkcd.com/63/, section "10 Exabytes"). Having non-ECC RAM being the cause of additional frequent crashes, that might not be acceptable for some situations.

But hey, few are those that know that the quality of the electricity can play a very important role in cluster environments. ;)

As for the original post: the problem might have been something that wasn't properly configured on the BIOS or perhaps the RAM modules simply were not compatible with the motherboard (yes, that can happen!).
And memtest86+ is no longer an accurate way to assess if RAM is OK or not. This is why Google has made available the stressapptest utility: http://code.google.com/p/stressapptest/

Best regards,
Bruno

kyle September 28, 2013 14:20

You could just have an extremely simple script to restart from the last save file. If it crashes on the same iteration as before, then give up.

If your runs are urgent then that is all the more reason not to buy ECC memory and the incredibly expensive CPUs and motherboards you need to use it. For any given hardware budget you can, conservatively, get at least double the speed if you do not purchase enterprise class hardware.

This starts to break down once you get to a massive system where data is hopping across multiple switches, but unless you are Boeing or Lockheed, you probably aren't working at that scale. <400 cores, I'd stick with i7's and overclocked low-latency non-ECC memory.

HMN September 30, 2013 11:19

Quote:

Originally Posted by kyle (Post 454076)
You could just have an extremely simple script to restart from the last save file. If it crashes on the same iteration as before, then give up.

If your runs are urgent then that is all the more reason not to buy ECC memory and the incredibly expensive CPUs and motherboards you need to use it. For any given hardware budget you can, conservatively, get at least double the speed if you do not purchase enterprise class hardware.

This starts to break down once you get to a massive system where data is hopping across multiple switches, but unless you are Boeing or Lockheed, you probably aren't working at that scale. <400 cores, I'd stick with i7's and overclocked low-latency non-ECC memory.

Sorry for the newbye question, but how does the script should look like?
Is it something that you can set up for every project automatically? I am still a newbye and don't use scripts. :o

Can this code be in the calls from my visual basic/excel application?

Thanks

JBeilke September 30, 2013 16:34

Quote:

Originally Posted by ghost82 (Post 453898)
Yes, ansys 14.5.7 :)
I'm sure because I run same case with same hardware several times, by change only memory modules.
I noticed that in serial mode I haven't any errors with non ecc modules, but problems begin with parallel calculation.
For that particular case ecc for me is a must as I cannot restart simulation every 2-3 hours.

Daniele

There might just be a problem with one of your non ECC modules. I had similar problems some time ago. After running stressapptest and replacing the broken module I had no more crashes.

siefdi September 30, 2013 22:28

Quote:

I recently bought a new workstation:

- double intel xeon E5-2630
- Asus Z9PE-D8 WS
- Nvidia quadro 600
- 64 gb ram (I got ecc and non ecc to test them)

Non-ecc ram: Corsair valueselect 8x8gb (cmv8gx3m1a1333c9)
ecc ram: Samsung 8x8gb (M393B1K70CH0-CH9)

Both types are ddr3 and work at 1333 Mhz (PC3-10600).

Well, if you have this board (Z9PE-D8 WS) and Samsung ECC RAM DDR3 1333 MHz, I would recommend you to overclock the memory and run it at 1600 MHz through setting in BIOS (I could run it stable in my system which has almost the same configuration as yours, and get about 30% performance increases in my OpenFOAM calculation). Strangely enough (at least for me), I could not do it for the non-ECC ones even its originaly has speed up to 1866 MHz.

+1 for ECC :)

Regards,
siefdi

evcelica October 1, 2013 09:25

Quote:

Originally Posted by JBeilke (Post 454336)
There might just be a problem with one of your non ECC modules. I had similar problems some time ago. After running stressapptest and replacing the broken module I had no more crashes.

Correct, this may be a problem related to memory modules themselves, not so much ECC vs non-ECC in general.
Crucial does make some ECC memory rated to 1866 MHz, CL timings are 13.

wazoo42 October 1, 2013 16:13

You should have options to turn several ecc options off in the bios. Then you could run the tests again with the ECC ram and see if it crashes.

evcelica October 3, 2013 06:49

Quote:

Originally Posted by wazoo42 (Post 454564)
You should have options to turn several ecc options off in the bios. Then you could run the tests again with the ECC ram and see if it crashes.

That's actually an excellent idea. It would show a real ECC vs non-ECC with the same memory sticks.

ghost82 October 3, 2013 09:10

Quote:

Originally Posted by wazoo42 (Post 454564)
You should have options to turn several ecc options off in the bios. Then you could run the tests again with the ECC ram and see if it crashes.

Unfortunately in the bios I can see "ECC Enabled", but I cannot modify it :(

ghost82 October 7, 2013 10:52

Quote:

Originally Posted by siefdi (Post 454363)
Well, if you have this board (Z9PE-D8 WS) and Samsung ECC RAM DDR3 1333 MHz, I would recommend you to overclock the memory and run it at 1600 MHz through setting in BIOS (I could run it stable in my system which has almost the same configuration as yours, and get about 30% performance increases in my OpenFOAM calculation). Strangely enough (at least for me), I could not do it for the non-ECC ones even its originaly has speed up to 1866 MHz.

+1 for ECC :)

Regards,
siefdi

But processors support only 1333 Mhz, so I think is not usefull.
What is/are your cpu(s)?

siefdi October 7, 2013 20:07

Quote:

But processors support only 1333 Mhz, so I think is not usefull.
What is/are your cpu(s)?
Ah, my bad. Sorry, didn't check your CPU's spec before I wrote previous comment. I am working with E5-2660 which support 1600 MHz.

Regards,
siefdi

ghost82 October 11, 2013 11:43

I noticed that I have some errors in the cortexerror.log file:

Code:

Error [cortex] [time 10/7/13 0:29:23]
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win64\3ddp\fl1457s.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

Error [cortex] [time 10/7/13 0:32:33]
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win64\3ddp\fl1457s.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

Error [cortex] [time 10/7/13 0:52:45] ‡fl

Error [cortex] [time 10/7/13 1:34:47] ‡fl

Error [cortex] [time 10/7/13 1:46:29] ‡fl

Error [cortex] [time 10/7/13 1:56:52] ‡fl

Error [cortex] [time 10/7/13 2:8:49] ‡fl

Error [cortex] [time 10/7/13 2:11:16]
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\cortex\win64\cx1457.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

Error [cortex] [time 10/7/13 19:51:13] ‡fl

Error [cortex] [time 10/7/13 19:57:50] ‡fl

Error [cortex] [time 10/8/13 23:55:1] ‡fl

Error [cortex] [time 10/9/13 7:55:24]
C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\cortex\win64\cx1457.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

This type of error

C:\PROGRA~1\ANSYSI~1\v145\fluent\fluent14.5.7\win6 4\3ddp\fl1457s.exe received fatal signal ()
1. Note exact events leading to error.
2. Save case/data under new name.
3. Exit program and restart to continue.
4. Report error to your distributor.

comes sometimes when I'm exiting, the window closes and all seems ok, but in the log file this error is written.

The second type of error

Error [cortex] [time 10/7/13 19:57:50] ‡fl

comes randomly.

Now, when I had non ecc ram fluent crashes wtih this type of error, now, with ecc, simulation continues without problems and error is logged in the file.

Am I invested by cosmic rays?? :)

Daniele

JBeilke October 11, 2013 13:37

Did you check your non ecc rams using stressapptest? It is meaningless to compare broken non ecc ram modules to anything else.

jmcentee October 23, 2013 06:49

I think the intel xeon only supports ecc ram.

wyldckat October 26, 2013 05:40

Quote:

Originally Posted by jmcentee (Post 458481)
I think the intel xeon only supports ecc ram.

I guess that it's best to quote the manufacturer on this one. Here's an example: http://www.intel.com/cd/channel/rese...eon/440799.htm - "DDR3 Memory for the Intel Xeon Processor 5600 Series"
Quote:

Multiple DDR3 DIMM types are supported:
  • Registered DIMM (RDIMM)
  • Unbuffered DIMM (UDIMM) Error-Correcting Code (ECC)
  • Unbuffered DIMM (UDIMM) Non Error-Correcting Code (Non-ECC)

More specifically, for the CPU reported by the original poster, the specs page for E5-2630 is this: http://ark.intel.com/products/64593/...-QPI?q=e5-2630 - it indicates that ECC is supported and that it will only work if both the CPU and the chipset support it.

The chipset is embedded into the motherboard, so the limitation might actually come from said motherboard, in either direction, i.e. ECC only or non-ECC only.
Another limitation in some cases is that the certain memory modules are not compatible with the motherboard. This is why motherboard vendors usually have a list per motherboard on compatible memory modules.

Let me see if I can find a motherboard that specifically says that only ECC is supported... mmm... apparently there shouldn't exist such a motherboard/chipset, as indicated here: http://www.intel.com/support/motherb.../cs-009023.htm


I did a bit more research and found out that the RAM that the original poster used is meant for dual and tripple-channel motherboards: http://www.corsair.com/en/memory-by-...m1a1333c9.html
Quote:

Designed for use with all DDR3 motherboards with two or three memory channels
While the CPU is quad-channel: http://ark.intel.com/products/64593
Quote:

# of Memory Channels 4
So perhaps this is the real reason why it doesn't work on his box. The RAM simply wasn't designed for quad-channel.

ghost82 February 13, 2014 10:32

Updates on this topic:
since I upgraded my workstation to 2x xeon e5-2687w I read some usefull info about my motherboard asus z9pe-d8 ws; several users around the internet claim problems with non ecc ram with this mobo even if asus claims that it is compatible with non ecc memory.
So my problem could be related to my mobo/bios version and not to ecc/non ecc ram.
Anyway, non ecc ram was sold and buyers are still happy with that ram.

Daniele


All times are GMT -4. The time now is 23:37.