CFD Online Discussion Forums

CFD Online Discussion Forums (https://www.cfd-online.com/Forums/)
-   STAR-CCM+ (https://www.cfd-online.com/Forums/star-ccm/)
-   -   inconsistant SIGSEGV: memory access exception (https://www.cfd-online.com/Forums/star-ccm/100013-inconsistant-sigsegv-memory-access-exception.html)

hiddenbunny April 18, 2012 09:41

inconsistant SIGSEGV: memory access exception
 
Hi guys,

I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered.

I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts...

Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC.

But I don't want to go out and buy ram just on a haunch here,

any suggestion would help

thanks!

- James

ryancoe April 18, 2012 10:31

I have also had this error occur seemingly randomly at times. I'm curious is anyone else out there has a solution.

rwryne April 18, 2012 11:43

This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.

Examples:

One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh".

Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this.

Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that.

arjun April 18, 2012 19:03

Quote:

Originally Posted by hiddenbunny (Post 355407)
Hi guys,

I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered.

I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts...

Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC.

But I don't want to go out and buy ram just on a haunch here,

any suggestion would help

thanks!

- James


If this happens then there is a good chance that this is a bug in the program. You should report this to your local support.

Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated. Typically most of the compilers initiate the variables to value 0. Some of them do not.
So if you try to delete such array which is not initiated to 0 it might create problem.

This would work something like this.

real *Array; /// should be initiated to 0

delete [] Array.
(works fine if Array initiated to 0, gives the error you mentioned , if not initiated to 0).


If you are wondering why would I do like this, because first time array is declared like real *Array; but rest of the iterations it might be:

delete [] Array;
Array = new real [ size ] ;

All this is system and compiler dependent.

abdul099 April 28, 2012 07:03

Quote:

Originally Posted by arjun (Post 355481)
Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated.

I think not only when deleting some memory, but also when trying to access memory which isn't already allocated. E.g. one thread allocates a huge array. Another thread tries to put a value or get a value from an address which is not yet completely initialised by the first thread.

I usually get this error when a lot changes, e.g. an interfaces needs update due to moving meshes, therefore the face count or the vertex count for a cell or boundary changes.

I totally agree to arjun that it is mostly a bug which should be reported.

Loothin August 13, 2012 13:42

But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?

rwryne August 13, 2012 14:40

Quote:

Originally Posted by Loothin (Post 376858)
But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?

Bad memory mayhaps? try running memtest86+ ?

arjun August 15, 2012 07:36

Quote:

Originally Posted by Loothin (Post 376858)
But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.

Any thoughts?


it should be compiler dependent and not dependent on the machine.

wyldckat August 15, 2012 09:31

Greetings to all!

We've had these kinds of problems in our office machines and have realized that memtest86+ isn't enough to catch most of these memory errors on the latest hardware (since 2008-2009).
We've only be able to detect these issues when using stressapptest: http://code.google.com/p/stressapptest/ - it's already available in most of the latest Linux distros.

Example of commands for properly testing RAM:
Code:

stressapptest -W --cc_test
stressapptest -W --cc_test -M 5000

The second one uses only 5GB of RAM.

Best regards,
Bruno

JBeilke August 17, 2012 04:07

Hi Bruno,

thanks very much for the hint. I had a very strange prostar behaviour after upgrading from 16GB to 32GB memory. Prostar crashed while running a long postprocessing script without any error message. The crashes occured randomly just on one machine. stressapptest reported some problems.

Did you see differences between normal RAM and ECC RAM?

arjun August 17, 2012 07:31

can you guys provide any test case that i could run and reproduce the issue??

JBeilke August 17, 2012 09:27

What testcase do you want? The output of a failed stressapptest run?
Code:

> stressapptest -W --cc_test -M 20000
...
Report Error: miscompare : DIMM Unknown : 1 : 6s
Hardware Error: miscompare on CPU 2(0x2) at 0x7f6cae2b6798(0x0:DIMM Unknown): read:0xe9e9e9e8e9e9e9e9, reread:0xe9e9e9e8e9e9e9e9 expected:0xe9e9e9e9e9e9e9e9
Log: Seconds remaining: 10
Stats: CC Thread(0): Time=20033474 us, Increments=1396261000, Increments/sec = 69696399.136765
Stats: CC Thread(1): Time=20034022 us, Increments=1167366000, Increments/sec = 58269178.300793
Stats: CC Thread(2): Time=20033434 us, Increments=935507000, Increments/sec = 46697286.146748
Stats: CC Thread(3): Time=19973276 us, Increments=1269766000, Increments/sec = 63573246.572070
Log: Thread 3 found 715 hardware incidents
Stats: Found 715 hardware incidents
Stats: Completed: 11814.00M in 21.00s 562.53MB/s, with 715 hardware incidents, 0 errors
Stats: Memory Copy: 11814.00M at 590.10MB/s
Stats: File Copy: 0.00M at 0.00MB/s
Stats: Net Copy: 0.00M at 0.00MB/s
Stats: Data Check: 0.00M at 0.00MB/s
Stats: Invert Data: 0.00M at 0.00MB/s
Stats: Disk: 0.00M at 0.00MB/s

Status: FAIL - test discovered HW problems


arjun August 17, 2012 09:35

Quote:

Originally Posted by JBeilke (Post 377494)
What testcase do you want? The output of a failed stressapptest run?[CODE]


any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix.

Note: I now work with cd adapco, so it kinda interests me more now.

rwryne August 17, 2012 09:41

Quote:

Originally Posted by arjun (Post 377496)
any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix.

Note: I now work with cd adapco, so it kinda interests me more now.


From his post above, it looks to be a hardware issue, not a simulation issue. Thats why it was only occuring on one machine.

JBeilke August 17, 2012 16:55

It can also be a software problem if it only occures on one machine (unless you have an identical configuration).

Thanks to Bruno hint with stressapptest I was able to find out, that it is a hardware problem.

In the meantime also other people came across the same prostar crashes. It will be interesting to see what results we get from the test on these machines.

wyldckat August 18, 2012 04:29

Quote:

Originally Posted by JBeilke (Post 377461)
Did you see differences between normal RAM and ECC RAM?

So far the experience I've had is that:
  • Normal RAM is very susceptible to electrical quality. Overclocking and weak/cheap power supply can lead to the occasional damaged module :rolleyes:
  • Even when electrical quality isn't the issue, normal RAM has seemed so far to be more prone to the occasional hiccup, i.e., has to go back to the store.
  • ECC is considered better because it can do hardware based self-correction, while normal RAM sometimes uses software based correction. For more on ECC: http://en.wikipedia.org/wiki/ECC_memory
    For more on software based correction... sorry, don't have a reference on this one; it's sort-of a gut feeling from past experiences, but I don't have technical evidence.
  • Either way, it's good to buy RAM in a single combo package. For example, for normal RAM, it's best to fill all slots with modules from a single package purchase, because those modules have been tested to perform well as a team.
  • When it comes to multi-socket motherboards (more than one CPU), it seems that you can progressively fill memory slots, but the combo package criteria still stands. You can buy 1 package of combos per CPU socket and install one combo (2,3,4 or 6 modules for each combo) at a time. Asymmetry here is always very bad.
  • After all of this, always keep in mind to be careful if the voltage, CL and other specs of the RAM modules (such as rank) are all the same, otherwise there might be incompatibilities between between them that the motherboard cannot resolve automagically.

padimgr February 6, 2015 08:11

Quote:

Originally Posted by rwryne (Post 355437)
This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.

Examples:

One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh".

Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this.

Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that.

Hi rwryne,

I am getting exactly the same error with memory access as you did so wondering if you have managed to solve the problem.
Funny thing is that I am getting the error occasionally only when trying to load a simulation on multiple cores and not on single core.
I don't think this is a matter of hardware as my computer is 2 processors 24 cores 2.3GHz, 64GB RAM.

Many thanks,
Pavlos


All times are GMT -4. The time now is 14:40.