inconsistant SIGSEGV: memory access exception
Hi guys,
I have read through another thread about SIGSEGV: memory access exception, and have checked my mesh and cell sizes, I believe its an different cause than others had encountered. I have 2 machines which run identical mesh but different bc for parametric analysis, and this error will only occur in one machine sometimes. I meant in 1 run it will fail and get this error, then if I reload the case and run it again, it might work. Sometimes I get 3-4 fails in the roll, all at different iteration/time steps, sometimes I won't get it at all for 3-4 runs. Its driving me nuts... Is it possible we just have a bad stick in there and whenever it accessed that bad sector the process crashes? At this point I am fairly certain it is not my model, since this error never happens on the other machine I am running. Again, all the cases have the same mesh just different BC. But I don't want to go out and buy ram just on a haunch here, any suggestion would help thanks! - James |
I have also had this error occur seemingly randomly at times. I'm curious is anyone else out there has a solution.
|
This error is probably the most frustrating one to me, because it doesnt provide any hints as to what is going wrong....and to top it off there are lots of things that can cause this.
Examples: One of my simulations was having this only yesterday, I spent a whole week trying to figure out what was wrong. In the end, the culprit was...a mass flow report, which was set to "inital surface" rather than "volume mesh". Another one was having a similar situation to yours: if I submitted it across say X cores, it would not run. But with X/2 cores, it would. I had no explanation for this. Yet another one would not run at all on my cluster. I opened it locally, and did a single iteration locally then submitted it to cluster. Ran fine after that. |
Quote:
If this happens then there is a good chance that this is a bug in the program. You should report this to your local support. Based on my experience with programming CFD codes, my guess is that this happens when starccm tries to delete some memory which is not already allocated. Typically most of the compilers initiate the variables to value 0. Some of them do not. So if you try to delete such array which is not initiated to 0 it might create problem. This would work something like this. real *Array; /// should be initiated to 0 delete [] Array. (works fine if Array initiated to 0, gives the error you mentioned , if not initiated to 0). If you are wondering why would I do like this, because first time array is declared like real *Array; but rest of the iterations it might be: delete [] Array; Array = new real [ size ] ; All this is system and compiler dependent. |
Quote:
I usually get this error when a lot changes, e.g. an interfaces needs update due to moving meshes, therefore the face count or the vertex count for a cell or boundary changes. I totally agree to arjun that it is mostly a bug which should be reported. |
But if this is a program bug, why would it fail on the one machine and not the other. I am currently experiencing the same issue, and we have two machines that are the exact same. We just launched the same exact run on both and on one it failed with this error, and on the other it did not. We have had this issue for the past two weeks and it is always the one machine that fails. One time it ran 267 iterations, then last Friday it died within the first 100.
Any thoughts? |
Quote:
|
Quote:
it should be compiler dependent and not dependent on the machine. |
Greetings to all!
We've had these kinds of problems in our office machines and have realized that memtest86+ isn't enough to catch most of these memory errors on the latest hardware (since 2008-2009). We've only be able to detect these issues when using stressapptest: http://code.google.com/p/stressapptest/ - it's already available in most of the latest Linux distros. Example of commands for properly testing RAM: Code:
stressapptest -W --cc_test Best regards, Bruno |
Hi Bruno,
thanks very much for the hint. I had a very strange prostar behaviour after upgrading from 16GB to 32GB memory. Prostar crashed while running a long postprocessing script without any error message. The crashes occured randomly just on one machine. stressapptest reported some problems. Did you see differences between normal RAM and ECC RAM? |
can you guys provide any test case that i could run and reproduce the issue??
|
What testcase do you want? The output of a failed stressapptest run?
Code:
> stressapptest -W --cc_test -M 20000 |
Quote:
any sim file that i could run on my machine and shows the problem. I can run starccm+ but so far have not encountered this problem so if problem does not show up it is really hard to fix. Note: I now work with cd adapco, so it kinda interests me more now. |
Quote:
From his post above, it looks to be a hardware issue, not a simulation issue. Thats why it was only occuring on one machine. |
It can also be a software problem if it only occures on one machine (unless you have an identical configuration).
Thanks to Bruno hint with stressapptest I was able to find out, that it is a hardware problem. In the meantime also other people came across the same prostar crashes. It will be interesting to see what results we get from the test on these machines. |
Quote:
|
Quote:
I am getting exactly the same error with memory access as you did so wondering if you have managed to solve the problem. Funny thing is that I am getting the error occasionally only when trying to load a simulation on multiple cores and not on single core. I don't think this is a matter of hardware as my computer is 2 processors 24 cores 2.3GHz, 64GB RAM. Many thanks, Pavlos |
All times are GMT -4. The time now is 14:40. |