inconsistant SIGSEGV: memory access exception

hiddenbunny · April 18, 2012, 09:41

Hi guys,

I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered.

I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts...

Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC.

But I don't want to go out and buy ram just on a haunch here,

any suggestion would help

thanks!

- James

ryancoe · April 18, 2012, 10:31

I have also had this error occur seemingly randomly at times. I'm curious is anyone else out there has a solution.

rwryne · April 18, 2012, 11:43

This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.

Examples:

One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh".

Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this.

Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that.

arjun · April 18, 2012, 19:03

Quote:

Originally Posted by hiddenbunny

Hi guys,

I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered.

I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts...

Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC.

But I don't want to go out and buy ram just on a haunch here,

any suggestion would help

thanks!

- James

If this happens then there is a good chance that this is a bug in the program. You should report this to your local support.

Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated. Typically most of the compilers initiate the variables to value 0. Some of them do not.
So if you try to delete such array which is not initiated to 0 it might create problem.

This would work something like this.

real *Array; /// should be initiated to 0

delete [] Array.
(works fine if Array initiated to 0, gives the error you mentioned , if not initiated to 0).

If you are wondering why would I do like this, because first time array is declared like real *Array; but rest of the iterations it might be:

delete [] Array;
Array = new real [ size ] ;

All this is system and compiler dependent.

abdul099 · April 28, 2012, 07:03

Quote:

Originally Posted by arjun

Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated.

I think not only when deleting some memory, but also when trying to access memory which isn't already allocated. E.g. one thread allocates a huge array. Another thread tries to put a value or get a value from an address which is not yet completely initialised by the first thread.

I usually get this error when a lot changes, e.g. an interfaces needs update due to moving meshes, therefore the face count or the vertex count for a cell or boundary changes.

I totally agree to arjun that it is mostly a bug which should be reported.

Loothin · August 13, 2012, 13:42

But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?

rwryne · August 13, 2012, 14:40

Quote:

Originally Posted by Loothin

But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?

Bad memory mayhaps? try running memtest86+ ?

arjun · August 15, 2012, 07:36

Quote:

Originally Posted by Loothin

But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?

it should be compiler dependent and not dependent on the machine.

wyldckat · August 15, 2012, 09:31

Greetings to all!

We've had these kinds of problems in our office machines and have realized that memtest86+ isn't enough to catch most of these memory errors on the latest hardware (since 2008-2009).
We've only be able to detect these issues when using stressapptest: http://code.google.com/p/stressapptest/ - it's already available in most of the latest Linux distros.

Example of commands for properly testing RAM:

Code:

stressapptest -W --cc_test
stressapptest -W --cc_test -M 5000

The second one uses only 5GB of RAM.

Best regards,
Bruno

JBeilke · August 17, 2012, 04:07

Hi Bruno,

thanks very much for the hint. I had a very strange prostar behaviour after upgrading from 16GB to 32GB memory. Prostar crashed while running a long postprocessing script without any error message. The crashes occured randomly just on one machine. stressapptest reported some problems.

Did you see differences between normal RAM and ECC RAM?

arjun · August 17, 2012, 07:31

can you guys provide any test case that i could run and reproduce the issue??

JBeilke · August 17, 2012, 09:27

What testcase do you want? The output of a failed stressapptest run?

Code:

> stressapptest -W --cc_test -M 20000
...
Report Error: miscompare : DIMM Unknown : 1 : 6s
Hardware Error: miscompare on CPU 2(0x2) at 0x7f6cae2b6798(0x0:DIMM Unknown): read:0xe9e9e9e8e9e9e9e9, reread:0xe9e9e9e8e9e9e9e9 expected:0xe9e9e9e9e9e9e9e9
Log: Seconds remaining: 10
Stats: CC Thread(0): Time=20033474 us, Increments=1396261000, Increments/sec = 69696399.136765
Stats: CC Thread(1): Time=20034022 us, Increments=1167366000, Increments/sec = 58269178.300793
Stats: CC Thread(2): Time=20033434 us, Increments=935507000, Increments/sec = 46697286.146748
Stats: CC Thread(3): Time=19973276 us, Increments=1269766000, Increments/sec = 63573246.572070
Log: Thread 3 found 715 hardware incidents
Stats: Found 715 hardware incidents
Stats: Completed: 11814.00M in 21.00s 562.53MB/s, with 715 hardware incidents, 0 errors
Stats: Memory Copy: 11814.00M at 590.10MB/s
Stats: File Copy: 0.00M at 0.00MB/s
Stats: Net Copy: 0.00M at 0.00MB/s
Stats: Data Check: 0.00M at 0.00MB/s
Stats: Invert Data: 0.00M at 0.00MB/s
Stats: Disk: 0.00M at 0.00MB/s

Status: FAIL - test discovered HW problems

arjun · August 17, 2012, 09:35

Quote:

Originally Posted by JBeilke

What testcase do you want? The output of a failed stressapptest run?[CODE]

any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix.

Note: I now work with cd adapco, so it kinda interests me more now.

rwryne · August 17, 2012, 09:41

Quote:

Originally Posted by arjun

any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix.

Note: I now work with cd adapco, so it kinda interests me more now.

From his post above, it looks to be a hardware issue, not a simulation issue. Thats why it was only occuring on one machine.

JBeilke · August 17, 2012, 16:55

It can also be a software problem if it only occures on one machine (unless you have an identical configuration).

Thanks to Bruno hint with stressapptest I was able to find out, that it is a hardware problem.

In the meantime also other people came across the same prostar crashes. It will be interesting to see what results we get from the test on these machines.

wyldckat · August 18, 2012, 04:29

Quote:

Originally Posted by JBeilke

Did you see differences between normal RAM and ECC RAM?

So far the experience I've had is that:

Normal RAM is very susceptible to electrical quality. Overclocking and weak/cheap power supply can lead to the occasional damaged module
Even when electrical quality isn't the issue, normal RAM has seemed so far to be more prone to the occasional hiccup, i.e., has to go back to the store.
ECC is considered better because it can do hardware based self-correction, while normal RAM sometimes uses software based correction. For more on ECC: http://en.wikipedia.org/wiki/ECC_memory
For more on software based correction... sorry, don't have a reference on this one; it's sort-of a gut feeling from past experiences, but I don't have technical evidence.
Either way, it's good to buy RAM in a single combo package. For example, for normal RAM, it's best to fill all slots with modules from a single package purchase, because those modules have been tested to perform well as a team.
When it comes to multi-socket motherboards (more than one CPU), it seems that you can progressively fill memory slots, but the combo package criteria still stands. You can buy 1 package of combos per CPU socket and install one combo (2,3,4 or 6 modules for each combo) at a time. Asymmetry here is always very bad.
After all of this, always keep in mind to be careful if the voltage, CL and other specs of the RAM modules (such as rank) are all the same, otherwise there might be incompatibilities between between them that the motherboard cannot resolve automagically.

padimgr · February 6, 2015, 08:11

Quote:

Originally Posted by rwryne

This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.

Examples:

One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh".

Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this.

Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that.

Hi rwryne,

I am getting exactly the same error with memory access as you did so wondering if you have managed to solve the problem.
Funny thing is that I am getting the error occasionally only when trying to load a simulation on multiple cores and not on single core.
I don't think this is a matter of hardware as my computer is 2 processors 24 cores 2.3GHz, 64GB RAM.

Many thanks,
Pavlos

April 18, 2012, 09:41	inconsistant SIGSEGV: memory access exception	#1
hiddenbunny New Member James Lo Join Date: Feb 2012 Posts: 8 Rep Power: 14	Hi guys, I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered. I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts... Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC. But I don't want to go out and buy ram just on a haunch here, any suggestion would help thanks! - James

April 18, 2012, 10:31		#2
ryancoe Member Ryan Coe Join Date: Jun 2010 Location: Albuquerque, NM Posts: 98 Rep Power: 15	I have also had this error occur seemingly randomly at times. I'm curious is anyone else out there has a solution. __________________ Ryan

April 18, 2012, 11:43		#3
rwryne Senior Member Ryne Whitehill Join Date: Aug 2009 Posts: 312 Rep Power: 18	This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this. Examples: One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh". Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this. Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that. JM27 likes this.

August 15, 2012, 09:31		#9
wyldckat Retired Super Moderator Bruno Santos Join Date: Mar 2009 Location: Lisbon, Portugal Posts: 10,975 Blog Entries: 45 Rep Power: 128	Greetings to all! We've had these kinds of problems in our office machines and have realized that memtest86+ isn't enough to catch most of these memory errors on the latest hardware (since 2008-2009). We've only be able to detect these issues when using stressapptest: http://code.google.com/p/stressapptest/ - it's already available in most of the latest Linux distros. Example of commands for properly testing RAM: Code: stressapptest -W --cc_test stressapptest -W --cc_test -M 5000 The second one uses only 5GB of RAM. Best regards, Bruno __________________ OpenFOAM: FAQ \| Getting started Forum: How to get help, to post code/output and forum guide Read this before sending me PM

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
SIGSEGV: memory access exception	shawn123	STAR-CCM+	14	January 20, 2020 06:07
Streamlines -> SIGSEGV: memory access exception	eRzBeNgEl	Siemens	1	July 28, 2011 09:35
question to memory access error	hang1984	STAR-CD	0	July 26, 2010 04:25
Memory Exception Workaround?!	Maddin	STAR-CCM+	5	September 14, 2009 16:37
CFX CPU time & real time	Nick Strantzias	CFX	8	July 23, 2006 17:50

August 13, 2012, 13:42		#6
Loothin New Member John Anastos Join Date: Aug 2012 Posts: 1 Rep Power: 0	But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100. Any thoughts?

August 17, 2012, 04:07		#10
JBeilke Senior Member Joern Beilke Join Date: Mar 2009 Location: Dresden Posts: 498 Rep Power: 20	Hi Bruno, thanks very much for the hint. I had a very strange prostar behaviour after upgrading from 16GB to 32GB memory. Prostar crashed while running a long postprocessing script without any error message. The crashes occured randomly just on one machine. stressapptest reported some problems. Did you see differences between normal RAM and ECC RAM?

August 17, 2012, 07:31		#11
arjun Senior Member Arjun Join Date: Mar 2009 Location: Nurenberg, Germany Posts: 1,272 Rep Power: 34	can you guys provide any test case that i could run and reproduce the issue??

August 17, 2012, 16:55		#15
JBeilke Senior Member Joern Beilke Join Date: Mar 2009 Location: Dresden Posts: 498 Rep Power: 20	It can also be a software problem if it only occures on one machine (unless you have an identical configuration). Thanks to Bruno hint with stressapptest I was able to find out, that it is a hardware problem. In the meantime also other people came across the same prostar crashes. It will be interesting to see what results we get from the test on these machines.