CFX Solver does not write the results file and returns with error code 1

zeeshans · May 3, 2016, 03:05

Hi,

I am running an unsteady simulation of a turbine blade passage using Fourier Transformation method using ANSYS CFX 15.0.
Some related parameters of interest are:
- Frequency: 1294.85 Hz
- Interblade Phase Angle (-72) / Nodal Diameter -16 (corresponding to a set of 80 blades)
- No. of Time Steps per period: 16
- Total Number of Periods: 6
- No. of Fourier Coefficients: 3

I have prescribed the desired mode shape onto CFD mesh and specified the steady state solution as initial condition.

The solver starts fine and keeps solving till the last time step of the simulation and when it comes to writing the results, it return the error code 1 with this message: " An error has occurred in cfx5solve: The ANSYS CFX solver exited with return code 1. No results file has been created. "

The it says " The following user files have been saved in the directory
zFas3A4b, mon "

I have made several attempts but every time face the same issue. Can anyone please suggest a way to overcome this issue. Also how can I use the two filed created above at the end?

I shall be very grateful.

Regards,
Zeeshan Saeed

-Maxim- · May 3, 2016, 03:30

please post the complete error message/out-file.

zeeshans · May 3, 2016, 04:06

Hi, please see the error message here as well as the text file (read from .out file). Since file size was greater than allowed by the forum here, therefore, I have deleted the time steps and coefficient loop iterations. Hope it serves the purpose.

+--------------------------------------------------------------------+
| An error has occurred in cfx5solve: |
| |
| The ANSYS CFX solver exited with return code 1. No results file |
| has been created. |
+--------------------------------------------------------------------+

End of solution stage.

+--------------------------------------------------------------------+
| The following user files have been saved in the directory |
| C:\Users\zeeshans\Desktop\CoarseMesh_IBPA(72)\B_1B _IBPA(72)_001: |
| |
| zFas3A4b, mon |
+--------------------------------------------------------------------+

This run of the ANSYS CFX Solver has finished.

-Maxim- · May 3, 2016, 04:48

That's quite strange - usually there's more information about the error. And your 'return code 1' shows up when CFX is writing the result file to your hard drive.
Do you have enough disk space? Are you allowed to write big files on that drive/folder? (not sure how big the result file will be in your simple case though)
In case this is a computer from the university - some schools have a file size/space limit in the home folders such as the desktop...

Besides that, I am out of ideas - might have to wait for the geniuses

zeeshans · May 3, 2016, 11:02

Memory does not seem to be a problem.
Thanks for your comments, anyways.

ghorrocks · May 3, 2016, 21:22

As Maxim says, the run appears to have completed successfully but when it is writing the results file it fails. So I would suggest (and some of these are repeated from Maxim):

* You ran out of disk space, or your quota filled up or something stopped you writing the file to disk.
* You lost the network at the end when writing the result file
* Writing the results file involved calculating a variable which could not be evaluated (divide by zero, undefined variable etc). Could you have a user defined variable which has a divide by zero?

n0ukh3z007 · June 22, 2016, 02:59

Hi guys,

Just to share my experience.
I am working on flow inside the centrifugal compressor stage in steady condition in CFX.

I have had the same issue today that the .res file or a .bak file was not saving to post-process the results. I was looking online on CFD forum to solve this issue, but I could not find any solution.

There is a huge memory file saved in there as well 'zFas3A4b', in the working directory, which is very useful, I guess. What I did is, I have copied the same file and put an extension of .res next to it and opened it in CFD post and the problem is solved. It has all the data saved in it and post processing is working fine as well.

One more thing that is important is specification of solver memory allocation factor. I think we need to increase it to 1.5 I guess for my simulation, but it depends to upon the memory the solver takes to numerically simulate the data. The reason of giving this idea is I am running 5 other simulations with Solver memory allocation factor of 1.5 on them and they are working fine and saving the data. However, I forgot to specify the solver memory allocation factor to two other simulations and the result was not saving the data. Therefore, specification of solver memory allocation factor is important.

I hope this resolve your issue as well Zeeshan.

Regards,
Noukhez

jgross · July 25, 2018, 07:48

Hi everyone,

Sorry to hijack this post, but it seems to be fairly relevant to an issue I am having.

Similar to Zeeshan's original post, the solver for my problem runs perfectly smooth until it comes to writing the results. I should note that this problem only occurs for parallel computing, as it writes the results file just fine in serial. I am using Intel MPI local parallel method, as this is the only method that seems to "work" at all.

Furthermore, the process never gives an error message. It just simply hangs at this step with no progress in the out file after it prints the Variable Range Information. Eventually I have to kill the process and processes manually by using

Code:

kill -9 <PID>

I have included the out file if it may help with resolving the issue. I have also attempted to turn verbose mode on to see if it further information could be acquired, but it does print anything once the calculations begin or when they finish.

Afterwards, I am left with the same zFas3A4b file as well as mon and pids files. I attempted to copy the zFas3A4b to zFas3A4b.res to see if the results were there, as Noukhez suggested. However, when opening this file in Post, it has not written any of the variables as only the variables X, Y and Z are available for viewing.

I should also note that I have attempted to see if there were any lock files that were causing the program to stop. I noticed there were two such lock files. One was sm.<USER_ID>.<PID>.lock while the solver was running, and another lock file on zFas3A4b. Both of which were deleted to see if this may be the cause of the issue, however the problem still persisted.

Does anyone have any suggestions or experience with similar issues? This has been a problem for the last few days and it seems I am no closer to solving the issue. Any help is greatly appreciated.

Regards,
James

ghorrocks · July 25, 2018, 08:04

I cannot tell you what the problem is, only to suggest some ideas:

* If you run a tutorial example in serial, local parallel and distributed parallel - does that complete OK?
* If you run this example in distributed parallel - does that complete OK?
* Have you tried the other parallel options?
* If you save a backup file during a run does it crash when writing that?
* Are you sure your network is fast, stable and not flooded with junk packets from other users?

jgross · July 25, 2018, 11:43

Hi Glenn,

Thank you for taking the time to reply to my post. I greatly appreciate your suggestions.

Quote:

* If you run a tutorial example in serial, local parallel and distributed parallel - does that complete OK?

I have used the blunt body case (BluntBody.cfx) from the tutorials as a test case. As before, it runs perfectly smoothly in serial, however it trips up in the same place (right after writing the Variable Range Information in the out file) for parallel runs, for both local and distributed methods. That is, it finishes calculations, but then never writes the results file. The working directory contains the exact same files as with my case.

Code:

ccl  cfx5.mms  cfx5.tt  def  gui-err.txt  mms  mms.setup  mms.setup.attrb  mon  mon.old  mpd.hosts  out  par  pids  sm.jg847pc.14441.lock  zFas3A4b  zFas3A4b.lck

Quote:

* If you run this example in distributed parallel - does that complete OK?

Running the original case using distributed parallel gives the same problem, with seemingly no difference in the output file or even in any of the files contained in the working directory.

Quote:

* Have you tried the other parallel options?

My set up only has Intel MPI and IBM MPI as available options. Using the IBM MPI from the CFX GUI gives an exit code of 127. The output from the terminal is:

Code:

Permission denied, please try again.
/ansys_inc/v182/CFX/bin/linux-amd64/ifort/solver-mpi.exe: error while loading shared libraries: libmport.so: cannot open shared object file: No such file or directory
/ansys_inc/v182/CFX/bin/linux-amd64/ifort/solver-mpi.exe: error while loading shared libraries: libmport.so: cannot open shared object file: No such file or directory
MPI Application rank 0 exited before MPI_Init() with status 127
mpirun: Broken pipe

Indeed the libmport.so library was not available in the <Ansys_Root>/v182/commonfiles/MPI/IBM/9.1.4.2/linx64/lib directory. Googling the libmport.so library wasn't much help either as there didn't seem to be anything about it online.

Quote:

If you save a backup file during a run does it crash when writing that?

I set the tutorial case from before to save a backup file every 20 iterations during a Intel MPI local parallel run. Interestingly enough, the same thing occurs when attempting to write the bak file. I suppose this confirms my belief that the issue is with communication between master and slave nodes during the writing process. However I am still unsure how to solve the problem.

Quote:

* Are you sure your network is fast, stable and not flooded with junk packets from other users?

To be honest, I am little unsure of what you mean here. I am using my own university desktop with an Intel 8-Core i7-7820X CPU @ 3.60GHz, 64GB of RAM, and plenty of hard drive space. I will say it has hyperthreading on, and so it might be possible that the issue raised in Problem: Very long "write" time (~2h-3h) for results and transient results could be the issue. In general, I have left the processors to their own devices for probably an hour or two max before manually killing the process. I will leave it overnight to see if that is indeed the issue, but I would think this issue has been resolved by V18.2. Furthermore, the problem remains the same for 2, 4 or 6 cores, so that seemingly goes against that hypothesis.

Is there anything else I could do to test what is wrong with my set up? It seems strange that this has been such a massive issue.

Regards,
James

ghorrocks · July 25, 2018, 19:26

Most importantly - CFX is a thoroughly tested piece of software. It definitely does not have this problem in CFX itself. There is something unique in your setup which is causing the problem. So you have to find what is unique in your setup and fix it.

You should always run the most recent version of CFX, as it fixes many bugs and adds new features. So definitely go to the most recent version. This problem may magically disappear if you do.

But you appear to be running linux, and debugging these sort of problems on linux is always a nigthmare. Linux has such a convoluted network of libraries and background applications to get this sort of stuff working that it is impossible for mere mortals to figure it out. So my only recommendation here is to do a full update of all your linux libraries to the latest versions.

Also - make sure you have read and followed the instructions on the CFX installation documentation. There are a few special tasks you need to do for parallel operation.

jgross · July 27, 2018, 04:54

Hi Glenn,

Thank you for your response.

Unfortunately, I am unable to upgrade to V19, as V18.2 seems to be the latest version my university has the installation media for.

I will look through the Ansys installation guide again and see if there is something I missed.

Regards,
James

ghorrocks · July 27, 2018, 07:51

You download the installation from the ANSYS Customer website. I have not seen physical media for CFX or ANSYS for decades.

There is a special section in the installation notes about parallel setup, especially for distributed parallel.

jgross · July 27, 2018, 07:54

We're not using a physical installation media, but we have it on our shared server available to mount. I'm not sure why the newest version is not available there, but I will ask.

James

ghorrocks · July 27, 2018, 07:58

Oh yes: What linux distribution are you using?

The CFX forum thread on getting CFX working on unsupported Linux distributions is the longest and most read thread on the forum: https://cfd-online.com/Forums/cfx/25...istros-14.html

jgross · July 27, 2018, 11:05

I'm using Ubuntu 16.04. I used Ansys 18.2 on UBuntu 16.04: Installation Guide as a guide for installation, but I will also sift through the post you sent to see if anyone has had a similar issue.

Thanks again for your help Glenn. Hopefully I can sort something out.

James

Sidharthkp · February 21, 2020, 11:53

Hellow everyone,
I am met with the same problem now. I know this is an old post, but have anyone got the solution for this problem...
The error message I get in the black graphics window while running cfx 18.2 in windows 10 in local parallel MPi mode is this...

"application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
An error has occurred in cfx5solve:"

I tried initialising the case several times with many modifications thinking I have made something wrong, But the same file runs smoothly on another workstation.

drmatth · November 16, 2023, 11:11

Hello all, lonnnnnnnnng time reader, first time caller.

I know this is now many years too late, but it doesn't seem like a solution has been pinned down, and I wanted to chime in for posterity. I was having a very similar issue described in this thread awhile ago.

I was fortunate to work with my university's very talented research computing personnel and I was able to figure out the problem in my case.

I no longer have much idea where all of the out files and logs are, but I will do my best from memory with some generalities.

I was attempting to run a multistage, full-annulus turbomachinery model (~600m elements), and am fortunate enough to have access to quite a lot of computational horsepower (clearly) and as a grad student my time is still very cheap so literally over a hundred run attempts for troubleshooting were done.

The crux of my issue, was once I finally got the memory correct for the behemoth to run, it would crash when trying to save the .res files. I believe this is the common string in this thread (zeeshans, jgross).

I, too, was convinced it couldn't be a memory problem. We use the Slurm scheduler for submitting batch jobs to our computing cluster. I was able to learn some things (with a lot of help) using some native Slurm commands and a few special ones included by our research computing staff. Every time the model crashed during writing results, I would get the incredibly frustrating and useless Ansys error messages that we all know and love. Usually returning with maybe a code 1 or a code 2, always with the zFas3A4b (what an odd series of characters for this) "result/backup" file.

I could always see that when my jobs failed (with the Slurm tools) that my memory usage was well under the maximum available. So I continued allocating more and more trying to solve the issue. Same result over and over.

I reduced the size of my model by half (to ~600m elements... did I mention I am a graduate student with stupid amounts of cores available and maybe not had the best decision-making at the time... my embarrassment is relevant and funny, I suppose) and the issues continued. What finally unveiled the curtain of this incredibly frustrating issue, was noticing in the Slurm out file (not the Ansys one...) that the jobs were exiting with "return code 9". Ansys continues with it's terrible error reporting of the same thing over and over with no description or documentation to help troubleshoot

.

Here was a section from one of these out files:

Code:

/===================================================================================// 
//=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES// 
//=   PID 93089 RUNNING AT cluster-a052// 
//=   EXIT CODE: 9// 
//=   CLEANING UP REMAINING PROCESSES// 
//=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES// 
//===================================================================================// 
//   Intel(R) MPI Library troubleshooting guide:// 
//      https://software.intel.com/node/561764// 
//===================================================================================/

This put me on the path of it possibly being a Linux OS (possibly Out of Memory) issue and I reached out to research computing staff for help. All credit here goes to "LG" if he ever happens to find his way to these forums.

Here was his response (I've obfuscated identifying details of institution, etc.):

"Process death with signal 9 (unequivocal hard kill) is most commonly associated with an out-of-memory watchdog. I agree with you that this is more of a Slurm issue than an Ansys issue (though both are at at fault). Specifically, the way we configured Slurm to watch for per-core memory. You may have tamed your model to fit in the confines, but I bet there's something else inside Ansys that tries to reach for just a bit more - and it gets shot by Slurm.

Looking at your recent jobs that touched cluster-a052, I see two:

Code:

    $ sacct -X -u myusername -S 2021-07-01 -N cluster-a052 
      JobID    JobName  Partition Account AllocCPUS      State ExitCode 
    ------- ---------- ---------- ------- --------- ---------- -------- 
    3979729 myjob+     clus-a     mygrp    2560       FAILED      2:0 
    3979731 myjob+     clus-a     mygrp    2560   OUT_OF_ME+ 0:125

The state code for the second one is telling 🙂 Looking at both of them with 'jobinfo', both have the same default "Mem reserved" (1992M/core), and quite high Max Mem used values:

Code:

    3979729  Max Mem used:  241.20G (clus-a052,clus-a300,clus-a368) 
    3979731  Max Mem used:  242.78G (clus-a212,clus-a052,clus-a300)

Which kind of confirms the "uses a lot and tries to grab more than has been given" theory. The catch is that Max Mem used is the last _valid_ (within boundaries) value that Slurm got to register. It does not say anything about how much past the walled garden the process(es) tried to reach the following moment.

You might try adding '--mem=250000M' or '--mem=249G' to your script (I am not sure which one gets converted into larger value) to see if you could squeeze couple extra GBs for the garden, and whether this could be enough for things to proceed. If it works, this could be an easy (albeit potentially fragile) way out. If not... then you'd have to go back to the magic Ansys switches to tell it to limit its appetite somewhat."

This was the first real confirmation to me that there were, indeed, memory issues occurring, just not being reported by Ansys in any way, or when they were reported, I assumed that, like usual, I needed to bump the allocation factors.

As a result of this exchange, I started over with my memory multipliers. Resubmitting after satisfying each memory issue that finally started being reported correctly by iterating on the multipliers, e.g., bump -size-nr by 0.1, new error... bump -size-ni by 0.1... new error, bump -size-nr by 0.1... until.... it worked!

The final memory allocation looked like this:

Code:

 -size-cat 2.0x -size-nr 1.9x -size-ni 2.3x -single -size-interp-cat 10.0x -large

Where each -size factor was added as the result of a new error occurring by satisfying a previous one. When I watched the job as it saves, there is a quite massive jump in memory that is not recorded by the scheduler before the process crashes.

When it finally began working and saving, I could see this happening, and it would jump from like 180GB/node to about 220-230 GB/node. Incredibly frustrating, but alas, I now know about this. Sorry for my long story, but I hope this can help someone else not spend months going down the wrong path...

May 3, 2016, 03:05	CFX Solver does not write the results file and returns with error code 1	#1
zeeshans New Member Zeeshan Saeed Join Date: May 2016 Posts: 3 Rep Power: 10	Hi, I am running an unsteady simulation of a turbine blade passage using Fourier Transformation method using ANSYS CFX 15.0. Some related parameters of interest are: - Frequency: 1294.85 Hz - Interblade Phase Angle (-72) / Nodal Diameter -16 (corresponding to a set of 80 blades) - No. of Time Steps per period: 16 - Total Number of Periods: 6 - No. of Fourier Coefficients: 3 I have prescribed the desired mode shape onto CFD mesh and specified the steady state solution as initial condition. The solver starts fine and keeps solving till the last time step of the simulation and when it comes to writing the results, it return the error code 1 with this message: " An error has occurred in cfx5solve: The ANSYS CFX solver exited with return code 1. No results file has been created. " The it says " The following user files have been saved in the directory zFas3A4b, mon " I have made several attempts but every time face the same issue. Can anyone please suggest a way to overcome this issue. Also how can I use the two filed created above at the end? I shall be very grateful. Regards, Zeeshan Saeed

June 22, 2016, 02:59	Problem with saving .mon file instead of .res file.	#7
n0ukh3z007 New Member Noukhez Ahmed Join Date: Mar 2014 Posts: 9 Rep Power: 12	Hi guys, Just to share my experience. I am working on flow inside the centrifugal compressor stage in steady condition in CFX. I have had the same issue today that the .res file or a .bak file was not saving to post-process the results. I was looking online on CFD forum to solve this issue, but I could not find any solution. There is a huge memory file saved in there as well 'zFas3A4b', in the working directory, which is very useful, I guess. What I did is, I have copied the same file and put an extension of .res next to it and opened it in CFD post and the problem is solved. It has all the data saved in it and post processing is working fine as well. One more thing that is important is specification of solver memory allocation factor. I think we need to increase it to 1.5 I guess for my simulation, but it depends to upon the memory the solver takes to numerically simulate the data. The reason of giving this idea is I am running 5 other simulations with Solver memory allocation factor of 1.5 on them and they are working fine and saving the data. However, I forgot to specify the solver memory allocation factor to two other simulations and the result was not saving the data. Therefore, specification of solver memory allocation factor is important. I hope this resolve your issue as well Zeeshan. Regards, Noukhez marcogarutti likes this.

July 25, 2018, 08:04		#9
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,781 Rep Power: 143	I cannot tell you what the problem is, only to suggest some ideas: * If you run a tutorial example in serial, local parallel and distributed parallel - does that complete OK? * If you run this example in distributed parallel - does that complete OK? * Have you tried the other parallel options? * If you save a backup file during a run does it crash when writing that? * Are you sure your network is fast, stable and not flooded with junk packets from other users? __________________ Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.

July 25, 2018, 19:26		#11
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,781 Rep Power: 143	Most importantly - CFX is a thoroughly tested piece of software. It definitely does not have this problem in CFX itself. There is something unique in your setup which is causing the problem. So you have to find what is unique in your setup and fix it. You should always run the most recent version of CFX, as it fixes many bugs and adds new features. So definitely go to the most recent version. This problem may magically disappear if you do. But you appear to be running linux, and debugging these sort of problems on linux is always a nigthmare. Linux has such a convoluted network of libraries and background applications to get this sort of stuff working that it is impossible for mere mortals to figure it out. So my only recommendation here is to do a full update of all your linux libraries to the latest versions. Also - make sure you have read and followed the instructions on the CFX installation documentation. There are a few special tasks you need to do for parallel operation. __________________ Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.

July 27, 2018, 07:51		#13
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,781 Rep Power: 143	You download the installation from the ANSYS Customer website. I have not seen physical media for CFX or ANSYS for decades. There is a special section in the installation notes about parallel setup, especially for distributed parallel. __________________ Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.

May 3, 2016, 03:30		#2
-Maxim- Senior Member Maxim Join Date: Aug 2015 Location: Germany Posts: 415 Rep Power: 12	please post the complete error message/out-file.

May 3, 2016, 04:48		#4
-Maxim- Senior Member Maxim Join Date: Aug 2015 Location: Germany Posts: 415 Rep Power: 12	That's quite strange - usually there's more information about the error. And your 'return code 1' shows up when CFX is writing the result file to your hard drive. Do you have enough disk space? Are you allowed to write big files on that drive/folder? (not sure how big the result file will be in your simple case though) In case this is a computer from the university - some schools have a file size/space limit in the home folders such as the desktop... Besides that, I am out of ideas - might have to wait for the geniuses

May 3, 2016, 11:02		#5
zeeshans New Member Zeeshan Saeed Join Date: May 2016 Posts: 3 Rep Power: 10	Memory does not seem to be a problem. Thanks for your comments, anyways.

May 3, 2016, 21:22		#6
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,781 Rep Power: 143	As Maxim says, the run appears to have completed successfully but when it is writing the results file it fails. So I would suggest (and some of these are repeated from Maxim): * You ran out of disk space, or your quota filled up or something stopped you writing the file to disk. * You lost the network at the end when writing the result file * Writing the results file involved calculating a variable which could not be evaluated (divide by zero, undefined variable etc). Could you have a user defined variable which has a divide by zero?

July 27, 2018, 04:54		#12
jgross Member James Gross Join Date: Nov 2017 Posts: 77 Rep Power: 8	Hi Glenn, Thank you for your response. Unfortunately, I am unable to upgrade to V19, as V18.2 seems to be the latest version my university has the installation media for. I will look through the Ansys installation guide again and see if there is something I missed. Regards, James

July 27, 2018, 07:54		#14
jgross Member James Gross Join Date: Nov 2017 Posts: 77 Rep Power: 8	We're not using a physical installation media, but we have it on our shared server available to mount. I'm not sure why the newest version is not available there, but I will ask. James

July 27, 2018, 07:58		#15
ghorrocks Super Moderator Glenn Horrocks Join Date: Mar 2009 Location: Sydney, Australia Posts: 17,781 Rep Power: 143	Oh yes: What linux distribution are you using? The CFX forum thread on getting CFX working on unsupported Linux distributions is the longest and most read thread on the forum: https://cfd-online.com/Forums/cfx/25...istros-14.html __________________ Note: I do not answer CFD questions by PM. CFD questions should be posted on the forum.

July 27, 2018, 11:05		#16
jgross Member James Gross Join Date: Nov 2017 Posts: 77 Rep Power: 8	I'm using Ubuntu 16.04. I used Ansys 18.2 on UBuntu 16.04: Installation Guide as a guide for installation, but I will also sift through the post you sent to see if anyone has had a similar issue. Thanks again for your help Glenn. Hopefully I can sort something out. James

February 21, 2020, 11:53	Same Issue , any solution found...	#17
Sidharthkp New Member Sidharth K PIllai Join Date: Aug 2019 Location: INDIA Posts: 12 Rep Power: 6	Hellow everyone, I am met with the same problem now. I know this is an old post, but have anyone got the solution for this problem... The error message I get in the black graphics window while running cfx 18.2 in windows 10 in local parallel MPi mode is this... "application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0 An error has occurred in cfx5solve:" I tried initialising the case several times with many modifications thinking I have made something wrong, But the same file runs smoothly on another workstation. Stethedoctor likes this.

November 16, 2023, 11:11		#18
drmatth New Member Join Date: Feb 2015 Posts: 1 Rep Power: 0	Hello all, lonnnnnnnnng time reader, first time caller. I know this is now many years too late, but it doesn't seem like a solution has been pinned down, and I wanted to chime in for posterity. I was having a very similar issue described in this thread awhile ago. I was fortunate to work with my university's very talented research computing personnel and I was able to figure out the problem in my case. I no longer have much idea where all of the out files and logs are, but I will do my best from memory with some generalities. I was attempting to run a multistage, full-annulus turbomachinery model (~600m elements), and am fortunate enough to have access to quite a lot of computational horsepower (clearly) and as a grad student my time is still very cheap so literally over a hundred run attempts for troubleshooting were done. The crux of my issue, was once I finally got the memory correct for the behemoth to run, it would crash when trying to save the .res files. I believe this is the common string in this thread (zeeshans, jgross). I, too, was convinced it couldn't be a memory problem. We use the Slurm scheduler for submitting batch jobs to our computing cluster. I was able to learn some things (with a lot of help) using some native Slurm commands and a few special ones included by our research computing staff. Every time the model crashed during writing results, I would get the incredibly frustrating and useless Ansys error messages that we all know and love. Usually returning with maybe a code 1 or a code 2, always with the zFas3A4b (what an odd series of characters for this) "result/backup" file. I could always see that when my jobs failed (with the Slurm tools) that my memory usage was well under the maximum available. So I continued allocating more and more trying to solve the issue. Same result over and over. I reduced the size of my model by half (to ~600m elements... did I mention I am a graduate student with stupid amounts of cores available and maybe not had the best decision-making at the time... my embarrassment is relevant and funny, I suppose) and the issues continued. What finally unveiled the curtain of this incredibly frustrating issue, was noticing in the Slurm out file (not the Ansys one...) that the jobs were exiting with "return code 9". Ansys continues with it's terrible error reporting of the same thing over and over with no description or documentation to help troubleshoot . Here was a section from one of these out files: Code: /===================================================================================// //= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES// //= PID 93089 RUNNING AT cluster-a052// //= EXIT CODE: 9// //= CLEANING UP REMAINING PROCESSES// //= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES// //===================================================================================// // Intel(R) MPI Library troubleshooting guide:// // https://software.intel.com/node/561764// //===================================================================================/ This put me on the path of it possibly being a Linux OS (possibly Out of Memory) issue and I reached out to research computing staff for help. All credit here goes to "LG" if he ever happens to find his way to these forums. Here was his response (I've obfuscated identifying details of institution, etc.): "Process death with signal 9 (unequivocal hard kill) is most commonly associated with an out-of-memory watchdog. I agree with you that this is more of a Slurm issue than an Ansys issue (though both are at at fault). Specifically, the way we configured Slurm to watch for per-core memory. You may have tamed your model to fit in the confines, but I bet there's something else inside Ansys that tries to reach for just a bit more - and it gets shot by Slurm. Looking at your recent jobs that touched cluster-a052, I see two: Code: $ sacct -X -u myusername -S 2021-07-01 -N cluster-a052 JobID JobName Partition Account AllocCPUS State ExitCode ------- ---------- ---------- ------- --------- ---------- -------- 3979729 myjob+ clus-a mygrp 2560 FAILED 2:0 3979731 myjob+ clus-a mygrp 2560 OUT_OF_ME+ 0:125 The state code for the second one is telling 🙂 Looking at both of them with 'jobinfo', both have the same default "Mem reserved" (1992M/core), and quite high Max Mem used values: Code: 3979729 Max Mem used: 241.20G (clus-a052,clus-a300,clus-a368) 3979731 Max Mem used: 242.78G (clus-a212,clus-a052,clus-a300) Which kind of confirms the "uses a lot and tries to grab more than has been given" theory. The catch is that Max Mem used is the last _valid_ (within boundaries) value that Slurm got to register. It does not say anything about how much past the walled garden the process(es) tried to reach the following moment. You might try adding '--mem=250000M' or '--mem=249G' to your script (I am not sure which one gets converted into larger value) to see if you could squeeze couple extra GBs for the garden, and whether this could be enough for things to proceed. If it works, this could be an easy (albeit potentially fragile) way out. If not... then you'd have to go back to the magic Ansys switches to tell it to limit its appetite somewhat." This was the first real confirmation to me that there were, indeed, memory issues occurring, just not being reported by Ansys in any way, or when they were reported, I assumed that, like usual, I needed to bump the allocation factors. As a result of this exchange, I started over with my memory multipliers. Resubmitting after satisfying each memory issue that finally started being reported correctly by iterating on the multipliers, e.g., bump -size-nr by 0.1, new error... bump -size-ni by 0.1... new error, bump -size-nr by 0.1... until.... it worked! The final memory allocation looked like this: Code: -size-cat 2.0x -size-nr 1.9x -size-ni 2.3x -single -size-interp-cat 10.0x -large Where each -size factor was added as the result of a new error occurring by satisfying a previous one. When I watched the job as it saves, there is a quite massive jump in memory that is not recorded by the scheduler before the process crashes. When it finally began working and saving, I could see this happening, and it would jump from like 180GB/node to about 220-230 GB/node. Incredibly frustrating, but alas, I now know about this. Sorry for my long story, but I hope this can help someone else not spend months going down the wrong path... Opaque and Gert-Jan like this.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Definition of y+ in yPlusRAS (1.7.1)	Taka1	OpenFOAM Programming & Development	41	May 23, 2020 12:05
Trouble compiling utilities using source-built OpenFOAM	Artur	OpenFOAM Programming & Development	14	October 29, 2013 10:59