CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   CFX (http://www.cfd-online.com/Forums/cfx/)
-   -   Parallel startup trouble! (http://www.cfd-online.com/Forums/cfx/21172-parallel-startup-trouble.html)

John April 15, 2005 23:19

Parallel startup trouble!
 
Hello all!

I have a definition file made using CFX-5.6 that has run successfully on a Sun SMP machine using the 5.6 solver. Now I am trying to run it on a Linux cluster with the 5.7.1 solver, default PVM executable. The solver seems to have trouble starting. I notice by top-ing compute nodes that the CPUs are surging up and down. This all occurs after the mesh is partitioned with no problems whatsoever.

The compute processes (pvmsolve) die off one by one, and the output file ends before reporting the first iteration results.

I thought that this might be due to a poorly defined mesh or partition, but I can't understand why it would run on one machine and not another. Any clues as to the nature of this problem would be greatly appreciated!!

Thanks,

John

Moyo April 17, 2005 15:06

Re: Parallel startup trouble!
 
I think your cluster/pvm has problems. I had a problem like this before when I meshed using build in parallel, it started fine and then each nodes dies off. Try this in linux for each of your nodes, type 'top', it should list all the processes and memory usage, see if pvmsolver is there and see how long before it is removed, I bet it won't be long.

Glenn Horrocks April 17, 2005 18:25

Re: Parallel startup trouble!
 
Hi,

Moyo is right, there is something wrong in your machine. If the problem was a poor mesh or incorrect problem setup it would either give an error message or diverge.

Glenn Horrocks

John April 17, 2005 21:35

Re: Parallel startup trouble!
 
I agree with both assertions... it seems unlikely that it would run on one machine and not another if there were a bad mesh or definition file.

On the other hand, other .def files of similar size run without any problems whatsoever, where top shows pvmsolve taking up ~99% CPU on all involved nodes, steadily. In this case, however, I can top a node and the CPU fluctuates between 1% and 90%. It did not behave this way on the multi-processor Sun machine, however.

BTW, this 5.7.1 installation is over the new ROCKS distro. Are there any known bugs specific to implementations of CFX on ROCKS? The ROCKS usergroups are, of course, filled with instances of network troubles... after all, what do you want for free??

Thanks!!

John


Santhosh April 18, 2005 03:11

Re: Parallel startup trouble!
 
Hi,

What's the error message that you get when CFX exits?

I experience the same problem i.e. "CPU surges up & down" and then CFX exists with a return code 255. This happens during the 1st iteration and most of the time a bad mesh is the cause. I use WinXP.

Santhosh


John April 18, 2005 09:54

Re: Parallel startup trouble!
 
That's the trouble... I get no error message at all. Like your case, it happens during the 1st iteration.

John

Santhosh April 18, 2005 10:31

Re: Parallel startup trouble!
 
You might have to wait for an awful long time to get that error message. I usually kill the job while this happens, especially if in the 1st Iteration and get back to meshing as soon as I can. Have you tried to improve your mesh?

Santhosh


John April 18, 2005 21:13

Re: Parallel startup trouble!
 
No, I haven't tried any mesh improvement. Is mesh improvement available in CFX-Build?

John

Julian April 22, 2005 11:52

Re: Parallel startup trouble!
 
John,

Just a thought. Have you installed CFX5.7 and the pvm libraries on all of the nodes in the linux cluster or are you getting it from a server? (ie could it be a dodgy NFS share?)

Have you created the machines file (or is that for MPI?) in your home directory listing all of the machines that the code can run on? I ask this, as you say that it does the partitioning (serial) and then fails in iteration1 (parallel) If pvm is not on every machine then it can't communicate with itself and will fail when it tries to.

Have you looked in the /tmp or /var/tmp directories of each machine in the cluster to see if there are any messages either relating to cfx or pvm?

Julian

John April 22, 2005 20:17

Re: Parallel startup trouble!
 
Thanks to everyone for their input. I believe that Santhosh has the best answer, it was most likely due to a poorly constructed mesh. Some planar surfaces were not well parameterized. I added some new edge points and turned these ugly surfaces into more, but better looking, planar surfaces. I still get some CPU "surging", i.e. the processors do not run steadily at near 100%, but the solution does come out and the forces do converge rapidly. Now I'll try more mesh controls and up the element count in the wake region until the solution appears to be mesh independent.

Thanks again, everyone!

John


All times are GMT -4. The time now is 17:35.