Problem with parallelization speedup using many CPUbs

aunola · January 19, 2009, 05:01

It seems you have a bandwidth problem on your interconnect. One possible explanation is that your case is too small for an efficient parallel run using more than 2 CPU's. Small cases require low latency interconnects due to heavy network traffic. I am not familiar with the benchmark case, but my sugesstion is to make it larger. Keep us posted.

andreas_haakansson · January 19, 2009, 05:13

Thanks for the reply, I will try increasing the size of the case and report what happens.

/Andreas

velan · January 19, 2009, 05:22

Hi

Why not you try smaller case by grid size of
* 16*16*16
* 16*128*128

Decompose the grid in x direction, and if possible compare the results like

nproc realtime
1 ??secs
2 ??
4 ??

and samething for second case also(128cells in x-direction).

I found the problem between AMD vs INTEL on this issues(bandwidth), i will post the results later.

andreas_haakansson · January 19, 2009, 11:32

Martin: I have now tried expanding the case from its original 12 000 cells to about 300 000 cells. I have run this in 2h on 1, 2 and 4 CPU's to see what happens.
My results are then:
1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

This means that the execution/clock time ratio now drops even earlier (already at 2 CPU's) and I get no speedup even between 1 and 2 CPU's. When I try this on my PC there is a clear speedup between 1.5 and 2.
I also wonder if this is the result of slow networking; can it be fixed or do I need to use another cluster?

Velan: Thanks for your reply but i am not sure exactly what you mean. As I understand your posting you want me to check a very small case to see if the problem has to do with low memory or something like this, I will try following your instructions and post results tomorrow.

/Andreas

andreas_haakansson · January 20, 2009, 05:53

I have now done some asking around at the cluster support. The network uses Gigabit Ethernet (1000Base-T) with a theoretical speed of 1000Mbit/s which according to support should imply that it transfers data at 125Mbtes/s between processors IF the switch is not overloaded. Time to initiate communication between nodes (latency) is 0.35 ms.
Does these numbers say anything to anyone? My main concern is if the cluster is fast enough...

I have also tried out the case suggested by Velan. The time simulated when run for 2h decreases (i.e. speedup lower then 1) when comparing 1 and 2 or 4 CPU's even for this small case. The ratio between execution time and clock time also decreases below 50% for 2 CPU's and becomes even lower for 4 CPU's.

For a oodles case with 16x16x16 cells:
CPUs RelSimTime ExecutionTime ClockTime
1 1.000 6512 7226
2 0.733 3255 7249
4 0.532 1653 7235

Thanks in advance for any reply
Andreas

caw · January 20, 2009, 06:04

Some comments on scaling behavior:

Intel harpertown and woodcrest cpus (quadcore xeons) have a significant memory bottleneck between cores and RAM. I work on a cluster with dual quadcore nodes connected by infiniband.
Result: only use half of the cores per node! If i use all cores the ExecutionTime doubles! Otherwise scaling is good enough.

Regarding your problem: bandwidth is not that important for cfd (but i does not hurt), latency is the key.

You might want to use a low latency mpi implementation for gigabit: Gamma
http://www.disi.unige.it/project/gam...mma/index.html

I havent tried it myself, but they have performance data for OF 1.4 which looks quite nice.

have a look at OpenFOAM-1.5.x/src/Pstream/gamma

Best regards
Christian

aunola · January 20, 2009, 06:17

Can this be all latency induced? It appears that for N>1 CPUs, each executes for about the time one would expect but sits doing nothing for (N-1)/N of the total wall clock time.

As I understand it, this is a cluster of single-core, single-CPU nodes with a Gbit interconnect. This is exactly what I'm planning to invest in for OF, albeit for turbFoam. However, I suspect turbFoam and oodles only differ in the turbulence models, so this result is not at all encouraging.

I have no further ideas at this time, other than maybe trying other cases that use other sparse matrix solvers.

andreas_haakansson · January 20, 2009, 08:25

Thanks for your answers, I will check this with the latency.
Martin: Yes the cluster consists of 200 AMD Opteron 148 single-cpu nodes. I will report on any progress.

Regards
Andreas

aunola · January 20, 2009, 11:27

Having thought a bit about it, the fact that you see no difference using floatTransfer might indicate that this is not a bandwidth problem; rather it might indeed be a latency problem. You might want to examine your network performance.

Also, have a look at these threads for possible further assistance:

http://www.cfd-online.com/OpenFOAM_D...es/1/2970.html

http://www.cfd-online.com/OpenFOAM_D...es/1/5473.html

(posts from Sep 27 and onwards in the latter).

carsten · January 21, 2009, 09:32

Dear Andreas,

maybe I didn't understand your table:

> My results are then:
> 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
> 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
> 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

But this looks like rather perfect speed-up for me?! 7207.4s on 1 CPU, 3610.08s on 2 CPUS, 1804.03s on 4 CPUs. Perfect. Dumb question: You're sure that you're interpreting the numbers correctly?

Bye,

Carsten

andreas_haakansson · January 21, 2009, 10:21

@Martin:
Thanks for the info. I will continue to check out what I can do about the MPI and latency.

@Carsten:
Sorry, maybe my table was not very clear. What I did was to run each case for 2h (giving ClockTime approx= 7200 for all cases). Still the 4 CPU case only spends 1804.03s computing while the 1 CPU spends almost the whole 2h for computing. The remaining time is probably spent waiting for communication between the nodes. What I mean in the last column is that the different cases reaches almost the same simulation time at the end of the 2h.
So, as I see this I would have a very good speedup if the nodes didn't spend so much time waiting: if ExecutionTime where equal to clockTime, but they are not.

Regards
Andreas

ngj · January 21, 2009, 13:12

Hi Andreas

Somewhere on the Forum (cannot recall where), I read that you needed to have O(1e4) cells per processor to get reasonable results, otherwise transfer of BCs from processor to processor would eat all your time.

I have a rather pragmatic way of looking at it:

surface area / volume, i.e. number of processor patch faces divided by number of cells on each processor. If this is large (and the effort in solving the Poisson eq. is small), then you must expect a rather large still stand when running on multiple processors.

Thus I would like to suggest for you to run the exact same case just with 32 * 32 * 32. For the 2 processor case, you would get half of "surface area / volume", thus hopefully a better scaling.

I am not at all an expert, but I hope it is helpful.

Best regards,

Niels

olwi · January 21, 2009, 16:09

Hi,

I have no benchmark with OF, but we run Fluent (similar algoritm and MPI impl.) on a cluster with 50 nodes, 2 AMD Opterons per node, and Gbit ethernet. Our rule of thumb is to have minimum 200-300 thousand cells per cpu, otherwise latency will kill us...

/Ola

carsten · January 22, 2009, 02:54

Hi Andreas, I don't want to sound stubborn, but I think you misinterprete your results or your set-up is wrong or I still don't understand your set-up.

What your're saying is that for a mesh of 16x16x16=4096 cells you have speedup of

CPUs RelSimTime ExecutionTime ClockTime
1 1.000 6512 7226
2 0.733 3255 7249
4 0.532 1653 7235

Then, for 300000 cells you have a speedup of
1 CPU Clock=7222 Execution=7207.49 Speedup 1.00
2 CPU Clock=7187 Execution=3610.08 Speedup 1.00
4 CPU Clock=7187 Execution=1804.03 Speedup 0.99

From your interpretation this means that the speed-up is worse for larger grids. This is very very improbable from my experience.

So, I would try the following:

- use a testcase that is big enough (1e6 cells if you have enough RAM, this reduces latency impact)
- adjust the testcase so that it does produce as little result output as possible (IO can slow everything down)
- adjust "endTime" of the testcase so that it runs ~30min on 1 CPU (what did you mean by "giving ClockTime"?)
- execute all runs with the "time" command in front of the foam-solver and in front of mpirun
(e.g. "time mpirun -np 4 -hostfile mymachines time /pathtofoam/mysolver -case mycase -parallel" , syntax may differ for you)
- decompose and run it for 2 and 4 CPUs.
- post the reults

Maybe you can log in on each of your client nodes while the job is running and execute "top" on it. Check how much time is spend on your job, on system, on idle, on wait. Maybe activating "Sleeping in Function" in "top" may help (hit f y in top) to identify the culprit. As already stated by Martin, latency can be a big issue on gig-ethernet networks. But I wouldn't expect this to be so severe.

Bye,

Carsten

caw · January 22, 2009, 03:20

Hi,

i have the feeling that your job is running on a single node. That would explain everything!

In detail: you have four nodes with a single CPU-Core in each node. Now you start your job mit 4 processes. The idea is, that your system distributes those 4 processes to the 4 physical CPU-Cores. If this distribution is not done properly, all 4 processes are assigned to ONE CPU-Core.

Then you get the behavior you see: each process uses up a fourth of the whole clocktime.

I made this mistake once myself. That would be a bug in you job submission script or in the batch scheduling system.

So you should check, if each node is running your job. Just log in during the run using ssh and do "top".

Best regards
Christian

caw · January 22, 2009, 03:26

Some more comments on cell numbers per process and speedup:

I have some cases that i partitioned to have as low as 10.000 cells per CPU and even the show reasonable speedup with OF (also with Fluent, but not as good as OF).

But the bigger the chuncks of the mesh per CPU the better the scaleup, as stated before.

Best regards
Christian

andreas_haakansson · January 22, 2009, 05:06

Hi again,

@Carsten: I think I understand now what you mean. In each tabel I have standardized the simulation time to that of 1 CPU or comparison, clock time is posted in order to compare to execution time. If looking at actual numbers the small case have much longer simulation time.
I have made a 1M case with low output. Then I set the maximum wall time when submitting to the cluster at 2h for all cases and see how far they get in simulation time.
I have not used the "time" command. What is it for? Also see below.

@Christian: Maybe you are right, I also recently got an email from another user who suggested this. It would most definetly make sense of my low efficiency. My logg files say that more than one CPU is used:

bench_doc_1.o693801

bench_doc_2.o693802
(Only have 1 and 2 CPUs yet but for them I still have the problem)
but I do not know if they are really sharing the load. The cluster supports has told me that they do not allow access to single nodes so I can not go there directly and use "top" as I do one my own computer. But probably it would be possible to get the information from the running program, but I do not know how. Any suggestions would be very welcome.

Regards
Andreas

caw · January 22, 2009, 05:33

Ok, your Job is running on a single CPU :-))

Have a look at your log:

First it says:
...
names of assigned nodes dn209 dn208 echo
dn209 is main node
...

And then from OF:
[0] Date : Jan 22 2009
[0] Time : 08:25:34
[0] Host : dn209
[0] PID : 20938
[1] Date : Jan 22 2009
[1] Time : 08:25:34
[1] Host : dn209
[1] PID : 20939
[1] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run
[0] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run
[0] Case : myBench2/oodles_pitzDaily_2
[0] Nprocs : 2
[0] Slaves :
[0] 1
[0] (
[0] dn209.20939
[0] )
[0]
[1] Case : myBench2/oodles_pitzDaily_2
[1] Nprocs : 2

The Host process is on dn209 and the slave process is on dn209 as well!!!, but should be on dn208, right?

Could you mail your Job script?

Best regards
Christian

andreas_haakansson · January 22, 2009, 05:42

Great, hope you are right, then I have hopes to fix this!
My submitt script is:

bench2.scr

Regards
Andreas

caw · January 22, 2009, 06:02

Hi Andreas,

Try this command:

mpirun -np $nrnodes -hostfile $PBS_NODEFILE $solver $PBS_O_WORKDIR $casename -parallel

In my experience the -hostfile option actually must not be used with Openmpi but you should try it anyways. You are using LAM, right?

Best regards
Christian

January 19, 2009, 05:01	It seems you have a bandwidth	#1
aunola Member Martin Aunskjaer Join Date: Mar 2009 Location: Denmark Posts: 53 Rep Power: 17	It seems you have a bandwidth problem on your interconnect. One possible explanation is that your case is too small for an efficient parallel run using more than 2 CPU's. Small cases require low latency interconnects due to heavy network traffic. I am not familiar with the benchmark case, but my sugesstion is to make it larger. Keep us posted.

January 19, 2009, 05:13	Thanks for the reply, I will t	#2
andreas_haakansson New Member Andreas Håkansson Join Date: Mar 2009 Location: Lund, Sweden Posts: 12 Rep Power: 17	Thanks for the reply, I will try increasing the size of the case and report what happens. /Andreas

January 19, 2009, 05:22	Hi Why not you try smaller	#3
velan Member Velan Join Date: Mar 2009 Location: India Posts: 50 Rep Power: 17	Hi Why not you try smaller case by grid size of * 161616 * 16128128 Decompose the grid in x direction, and if possible compare the results like nproc realtime 1 ??secs 2 ?? 4 ?? and samething for second case also(128cells in x-direction). I found the problem between AMD vs INTEL on this issues(bandwidth), i will post the results later.

January 19, 2009, 11:32	Martin: I have now tried expan	#4
andreas_haakansson New Member Andreas Håkansson Join Date: Mar 2009 Location: Lund, Sweden Posts: 12 Rep Power: 17	Martin: I have now tried expanding the case from its original 12 000 cells to about 300 000 cells. I have run this in 2h on 1, 2 and 4 CPU's to see what happens. My results are then: 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99 This means that the execution/clock time ratio now drops even earlier (already at 2 CPU's) and I get no speedup even between 1 and 2 CPU's. When I try this on my PC there is a clear speedup between 1.5 and 2. I also wonder if this is the result of slow networking; can it be fixed or do I need to use another cluster? Velan: Thanks for your reply but i am not sure exactly what you mean. As I understand your posting you want me to check a very small case to see if the problem has to do with low memory or something like this, I will try following your instructions and post results tomorrow. /Andreas

January 20, 2009, 05:53	I have now done some asking ar	#5
andreas_haakansson New Member Andreas Håkansson Join Date: Mar 2009 Location: Lund, Sweden Posts: 12 Rep Power: 17	I have now done some asking around at the cluster support. The network uses Gigabit Ethernet (1000Base-T) with a theoretical speed of 1000Mbit/s which according to support should imply that it transfers data at 125Mbtes/s between processors IF the switch is not overloaded. Time to initiate communication between nodes (latency) is 0.35 ms. Does these numbers say anything to anyone? My main concern is if the cluster is fast enough... I have also tried out the case suggested by Velan. The time simulated when run for 2h decreases (i.e. speedup lower then 1) when comparing 1 and 2 or 4 CPU's even for this small case. The ratio between execution time and clock time also decreases below 50% for 2 CPU's and becomes even lower for 4 CPU's. For a oodles case with 16x16x16 cells: CPUs RelSimTime ExecutionTime ClockTime 1 1.000 6512 7226 2 0.733 3255 7249 4 0.532 1653 7235 Thanks in advance for any reply Andreas

January 20, 2009, 06:04	Some comments on scaling behav	#6
caw Member Christian Winkler Join Date: Mar 2009 Location: Mannheim, Germany Posts: 63 Rep Power: 17	Some comments on scaling behavior: Intel harpertown and woodcrest cpus (quadcore xeons) have a significant memory bottleneck between cores and RAM. I work on a cluster with dual quadcore nodes connected by infiniband. Result: only use half of the cores per node! If i use all cores the ExecutionTime doubles! Otherwise scaling is good enough. Regarding your problem: bandwidth is not that important for cfd (but i does not hurt), latency is the key. You might want to use a low latency mpi implementation for gigabit: Gamma http://www.disi.unige.it/project/gam...mma/index.html I havent tried it myself, but they have performance data for OF 1.4 which looks quite nice. have a look at OpenFOAM-1.5.x/src/Pstream/gamma Best regards Christian

January 20, 2009, 06:17	Can this be all latency induce	#7
aunola Member Martin Aunskjaer Join Date: Mar 2009 Location: Denmark Posts: 53 Rep Power: 17	Can this be all latency induced? It appears that for N>1 CPUs, each executes for about the time one would expect but sits doing nothing for (N-1)/N of the total wall clock time. As I understand it, this is a cluster of single-core, single-CPU nodes with a Gbit interconnect. This is exactly what I'm planning to invest in for OF, albeit for turbFoam. However, I suspect turbFoam and oodles only differ in the turbulence models, so this result is not at all encouraging. I have no further ideas at this time, other than maybe trying other cases that use other sparse matrix solvers.

January 20, 2009, 08:25	Thanks for your answers, I wil	#8
andreas_haakansson New Member Andreas Håkansson Join Date: Mar 2009 Location: Lund, Sweden Posts: 12 Rep Power: 17	Thanks for your answers, I will check this with the latency. Martin: Yes the cluster consists of 200 AMD Opteron 148 single-cpu nodes. I will report on any progress. Regards Andreas

January 20, 2009, 11:27	Having thought a bit about it,	#9
aunola Member Martin Aunskjaer Join Date: Mar 2009 Location: Denmark Posts: 53 Rep Power: 17	Having thought a bit about it, the fact that you see no difference using floatTransfer might indicate that this is not a bandwidth problem; rather it might indeed be a latency problem. You might want to examine your network performance. Also, have a look at these threads for possible further assistance: http://www.cfd-online.com/OpenFOAM_D...es/1/2970.html http://www.cfd-online.com/OpenFOAM_D...es/1/5473.html (posts from Sep 27 and onwards in the latter).

January 21, 2009, 09:32	Dear Andreas, maybe I didn'	#10
carsten Member Carsten Thorenz Join Date: Mar 2009 Location: Germany Posts: 34 Rep Power: 17	Dear Andreas, maybe I didn't understand your table: > My results are then: > 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00 > 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00 > 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99 But this looks like rather perfect speed-up for me?! 7207.4s on 1 CPU, 3610.08s on 2 CPUS, 1804.03s on 4 CPUs. Perfect. Dumb question: You're sure that you're interpreting the numbers correctly? Bye, Carsten

January 21, 2009, 10:21	@Martin: Thanks for the info.	#11
andreas_haakansson New Member Andreas Håkansson Join Date: Mar 2009 Location: Lund, Sweden Posts: 12 Rep Power: 17	@Martin: Thanks for the info. I will continue to check out what I can do about the MPI and latency. @Carsten: Sorry, maybe my table was not very clear. What I did was to run each case for 2h (giving ClockTime approx= 7200 for all cases). Still the 4 CPU case only spends 1804.03s computing while the 1 CPU spends almost the whole 2h for computing. The remaining time is probably spent waiting for communication between the nodes. What I mean in the last column is that the different cases reaches almost the same simulation time at the end of the 2h. So, as I see this I would have a very good speedup if the nodes didn't spend so much time waiting: if ExecutionTime where equal to clockTime, but they are not. Regards Andreas

January 21, 2009, 13:12	Hi Andreas Somewhere on the	#12
ngj Senior Member Niels Gjoel Jacobsen Join Date: Mar 2009 Location: Copenhagen, Denmark Posts: 1,900 Rep Power: 37	Hi Andreas Somewhere on the Forum (cannot recall where), I read that you needed to have O(1e4) cells per processor to get reasonable results, otherwise transfer of BCs from processor to processor would eat all your time. I have a rather pragmatic way of looking at it: surface area / volume, i.e. number of processor patch faces divided by number of cells on each processor. If this is large (and the effort in solving the Poisson eq. is small), then you must expect a rather large still stand when running on multiple processors. Thus I would like to suggest for you to run the exact same case just with 32 * 32 * 32. For the 2 processor case, you would get half of "surface area / volume", thus hopefully a better scaling. I am not at all an expert, but I hope it is helpful. Best regards, Niels __________________ Please note that I do not use the Friend-feature, so do not be offended, if I do not accept a request.

January 21, 2009, 16:09	Hi, I have no benchmark wit	#13
olwi Member Ola Widlund Join Date: Mar 2009 Location: Sweden Posts: 87 Rep Power: 17	Hi, I have no benchmark with OF, but we run Fluent (similar algoritm and MPI impl.) on a cluster with 50 nodes, 2 AMD Opterons per node, and Gbit ethernet. Our rule of thumb is to have minimum 200-300 thousand cells per cpu, otherwise latency will kill us... /Ola

January 22, 2009, 02:54	Hi Andreas, I don't want to so	#14
carsten Member Carsten Thorenz Join Date: Mar 2009 Location: Germany Posts: 34 Rep Power: 17	Hi Andreas, I don't want to sound stubborn, but I think you misinterprete your results or your set-up is wrong or I still don't understand your set-up. What your're saying is that for a mesh of 16x16x16=4096 cells you have speedup of CPUs RelSimTime ExecutionTime ClockTime 1 1.000 6512 7226 2 0.733 3255 7249 4 0.532 1653 7235 Then, for 300000 cells you have a speedup of 1 CPU Clock=7222 Execution=7207.49 Speedup 1.00 2 CPU Clock=7187 Execution=3610.08 Speedup 1.00 4 CPU Clock=7187 Execution=1804.03 Speedup 0.99 From your interpretation this means that the speed-up is worse for larger grids. This is very very improbable from my experience. So, I would try the following: - use a testcase that is big enough (1e6 cells if you have enough RAM, this reduces latency impact) - adjust the testcase so that it does produce as little result output as possible (IO can slow everything down) - adjust "endTime" of the testcase so that it runs ~30min on 1 CPU (what did you mean by "giving ClockTime"?) - execute all runs with the "time" command in front of the foam-solver and in front of mpirun (e.g. "time mpirun -np 4 -hostfile mymachines time /pathtofoam/mysolver -case mycase -parallel" , syntax may differ for you) - decompose and run it for 2 and 4 CPUs. - post the reults Maybe you can log in on each of your client nodes while the job is running and execute "top" on it. Check how much time is spend on your job, on system, on idle, on wait. Maybe activating "Sleeping in Function" in "top" may help (hit f y in top) to identify the culprit. As already stated by Martin, latency can be a big issue on gig-ethernet networks. But I wouldn't expect this to be so severe. Bye, Carsten

January 22, 2009, 03:20	Hi, i have the feeling that	#15
caw Member Christian Winkler Join Date: Mar 2009 Location: Mannheim, Germany Posts: 63 Rep Power: 17	Hi, i have the feeling that your job is running on a single node. That would explain everything! In detail: you have four nodes with a single CPU-Core in each node. Now you start your job mit 4 processes. The idea is, that your system distributes those 4 processes to the 4 physical CPU-Cores. If this distribution is not done properly, all 4 processes are assigned to ONE CPU-Core. Then you get the behavior you see: each process uses up a fourth of the whole clocktime. I made this mistake once myself. That would be a bug in you job submission script or in the batch scheduling system. So you should check, if each node is running your job. Just log in during the run using ssh and do "top". Best regards Christian

January 22, 2009, 03:26	Some more comments on cell num	#16
caw Member Christian Winkler Join Date: Mar 2009 Location: Mannheim, Germany Posts: 63 Rep Power: 17	Some more comments on cell numbers per process and speedup: I have some cases that i partitioned to have as low as 10.000 cells per CPU and even the show reasonable speedup with OF (also with Fluent, but not as good as OF). But the bigger the chuncks of the mesh per CPU the better the scaleup, as stated before. Best regards Christian

January 22, 2009, 05:06	Hi again, @Carsten: I thin	#17
andreas_haakansson New Member Andreas Håkansson Join Date: Mar 2009 Location: Lund, Sweden Posts: 12 Rep Power: 17	Hi again, @Carsten: I think I understand now what you mean. In each tabel I have standardized the simulation time to that of 1 CPU or comparison, clock time is posted in order to compare to execution time. If looking at actual numbers the small case have much longer simulation time. I have made a 1M case with low output. Then I set the maximum wall time when submitting to the cluster at 2h for all cases and see how far they get in simulation time. I have not used the "time" command. What is it for? Also see below. @Christian: Maybe you are right, I also recently got an email from another user who suggested this. It would most definetly make sense of my low efficiency. My logg files say that more than one CPU is used: bench_doc_1.o693801 bench_doc_2.o693802 (Only have 1 and 2 CPUs yet but for them I still have the problem) but I do not know if they are really sharing the load. The cluster supports has told me that they do not allow access to single nodes so I can not go there directly and use "top" as I do one my own computer. But probably it would be possible to get the information from the running program, but I do not know how. Any suggestions would be very welcome. Regards Andreas

January 22, 2009, 05:33	Ok, your Job is running on a s	#18
caw Member Christian Winkler Join Date: Mar 2009 Location: Mannheim, Germany Posts: 63 Rep Power: 17	Ok, your Job is running on a single CPU :-)) Have a look at your log: First it says: ... names of assigned nodes dn209 dn208 echo dn209 is main node ... And then from OF: [0] Date : Jan 22 2009 [0] Time : 08:25:34 [0] Host : dn209 [0] PID : 20938 [1] Date : Jan 22 2009 [1] Time : 08:25:34 [1] Host : dn209 [1] PID : 20939 [1] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run [0] Root : /disk/global1/andhak/OpenFOAM/andhak-1.4.1/run [0] Case : myBench2/oodles_pitzDaily_2 [0] Nprocs : 2 [0] Slaves : [0] 1 [0] ( [0] dn209.20939 [0] ) [0] [1] Case : myBench2/oodles_pitzDaily_2 [1] Nprocs : 2 The Host process is on dn209 and the slave process is on dn209 as well!!!, but should be on dn208, right? Could you mail your Job script? Best regards Christian

January 22, 2009, 05:42	Great, hope you are right, the	#19
andreas_haakansson New Member Andreas Håkansson Join Date: Mar 2009 Location: Lund, Sweden Posts: 12 Rep Power: 17	Great, hope you are right, then I have hopes to fix this! My submitt script is: bench2.scr Regards Andreas

January 22, 2009, 06:02	Hi Andreas, Try this comman	#20
caw Member Christian Winkler Join Date: Mar 2009 Location: Mannheim, Germany Posts: 63 Rep Power: 17	Hi Andreas, Try this command: mpirun -np $nrnodes -hostfile $PBS_NODEFILE $solver $PBS_O_WORKDIR $casename -parallel In my experience the -hostfile option actually must not be used with Openmpi but you should try it anyways. You are using LAM, right? Best regards Christian

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Superlinear speedup in OpenFOAM 13	msrinath80	OpenFOAM Running, Solving & CFD	18	March 3, 2015 05:36
speedup questions	tony	CFX	5	February 3, 2008 17:26
cluster - parallel speedup	George	Main CFD Forum	3	March 29, 2005 11:32
cluster - parallel speedup	George	FLUENT	0	March 25, 2005 05:54
About the parallelization	ptyue	Main CFD Forum	8	January 26, 2003 23:29