CFD Online Discussion Forums

CFD Online Discussion Forums (http://www.cfd-online.com/Forums/)
-   OpenFOAM Paraview & paraFoam (http://www.cfd-online.com/Forums/openfoam-paraview/)
-   -   Paraview in Prallel (server-client) (http://www.cfd-online.com/Forums/openfoam-paraview/63715-paraview-prallel-server-client.html)

prapanj April 17, 2009 01:05

Paraview in Prallel (server-client)
 
Hi,

I use paraview without the foamreader. So that I convert to VTK using foamToVTK. I have paraview installed (built myself) on a multiprocessor machine(8 processors). Now I have these basic questions.

1. Do I have to decomposePar the case before opening it in paraview?
2. If yes, then when I do foamToVTK, I dont get the time directories in the processor* folders written into the VTK folder. Is there a way I can write the subdomains into VTK ?
3. When I do mpirun -np 8 paraview, does it work for a non-decomposed case too?
4. Or should I mandatorily install the foam reader to do this?

PS: I have not installed any parallel reader after building paraview.

wolle1982 April 21, 2009 06:24

hi,

if you run mpirun -np 8 paraFoam (or paraView) paraView will open (8 times) but there is no benefit to more processors. It still runs on one processor. decomposing the case doesn't help eather.

maybe someone knows how to (i also struggle with a huge case).

brosemu April 25, 2009 21:37

MPI version and Libraries?
 
Hi,
I'm trying to compile ParaView3.5 with MPI support. I'm on a Mac and have tried openmpi, MPICH, and MPICH2 with various combinations of:

MPI_LIBRARY
MPI_COMPILER
MPI_EXTRA_LIBRARY
MPI_INCLUDE_PATH

and while it is building it will error out at some point or other depending on the combination of MPI settings. Does anyone have a tried and true way of installing on an Intel Mac?

thanks,
Bill

carl July 28, 2010 12:38

Paraview parallel postprocessing
 
Hello together,

I have had this post somewhere else already 2 weeks ago but moved it here because I feel it fits better here.

Basically I try postprocessing a Star-CD Case (actually saved as Ensight data)
with paraview 3.8.0, and I want to make it compute in parallel. I have
paraview already compiled with MPI and MESA support on Ubuntu 10.04

It discribed in the paraview wiki:
http://www.paraview.org/Wiki/ParaView/Git
http://www.paraview.org/Wiki/ParaView:Build_And_Install


In first experiements are done on a 2-Core machine, I got as far as
this:

- in Paraview, connect to server
- Server setup as follows:
localhost, port 11111
Command: "mpirun -np 2 pvserver (optional: --off-screen-rendering)"

This is starting the 2 extra windows with (or without, respectively) the
rendered image.

The paraview processing works allright, yet the performance of the
parallel job is not any faster than when I just use paraview on a
single core, rather slightly slower. Neither can I find a significant
influence of the openGL-rendering, OK, so maybe my case is too simple
for that:

Case 1
paraview only: avg. 16 sec. / frame


Case 2
paraview client, server on localhost,
"mpirun -np 2 pvserver" (rendering via openGL:)
avg. 17 sec. / frame


Case 3
paraview client, server on localhost,
"mpirun -np 2 pvserver --off-screen-rendering":
avg. 17 sec. / frame


time documentation was done by via "save animation" and comparing the
timestamps, so they round to seconds. The animation was 5 frames long,
time per frame did not vary more than one sec. For details, see below.


Ideas that I have so far:

- parellizing does not help in my specific case (2-stroke engine,
moving mesh, filters are: "Cell Data to Point Data", slices and
iso-Surfaces), so I should use another testcase?

- the ENSIGHT reader is not parallelized?

- on a 2-core machine, effects are too small to be visible?

- do I need to decompose the computation domain somehow to tell
paraview how to seperate the data for the parallelization?

- computation time is very small compared to the time to fetch data
from ram, and since both cpus rely in the same ram, there is nearly
no speed-up?

- the whole thing is not properly parallelized and effectively uses
only one thread?


Next Steps:
Concerning the parallelized reader, I have read somewhere that the
OpenFOAM reader is parallelzied, so I will try at least an OpenFOAM
case. Do I need a decomposed case, or can I just take any case and take
advantage of some automatic parallelization nontheless?

Is there any possibility to do some profiling of Paraview, like logging
the time a filter is taking, so to see at least where paraview
is taking the most time?

am I missing something else? any comments?

Thanks! Carl


ps: Could you please fix the spelling in the tiltle of this thread, so that it will be found by search engines?

carl July 28, 2010 12:56

1 Attachment(s)
Allright, so I tried a few testcases concerning the ensight reader,
the foam-reader and the VTK reader, all to no avail.

In the end I found some note somewhere on the paraview wiki, stating
that the readers (maybe except the "new VTK" reader are not in parallel,
and every instance of pvserver reads the whole data from the hard disk,
with the result that the bottleneck of the reader is beeing tightened
even more, since then the data has to be read for each process, resulting
in a performance degeneration.

Since most of the modern computers are multi-core (exept the iPad maybe...),
I think its sensible for paraview to use them, also noted on:
http://paraview.uservoice.com/forums...chin?ref=title

for now, I see no other chance than to wait or code the thing myself...

so here are my benchmarks:

wyldckat July 28, 2010 17:52

Greetings Carl,

OK, lets examine one point at a time:
  • When you say you used the "native OpenFOAM reader", which one exactly did you use? In other words, what was the file extension you used to open? Was is ".OpenFOAM" or ".foam"? If it was ".OpenFOAM", then try the other one :)
  • Opening in parallel in the same machine can only be advantageous if:
    1. your system has two disks in RAID 0 or 4 disks in RAID 10... or SSD drive(s), that allows high speed read/write access;
    2. you have more than one graphics card, and you point each pvserver to each independent card;
      A side-note for this point: you might get some speed up on the same machine with two graphic cards if you use the pvserver's only as rendering servers!
    3. you have a reaaaally overclocked multicore machine (4.5-5GHz?), so fast, that it beats the living daylights out of the graphics card :)
So, to sum up:
  • Parallel processing in ParaView is mostly useful only if you have really large cases split between multiple machines; that way it is faster to open one part at a time on each machine, rather than having to join the whole data into a single workstation.
  • The other advantage is that in case you don't have enough RAM process all of the data in a single machine, you can spread the case between multiple machines, thus having the total joint memory of all machines!
Now, if I'm not completely mistaken, the internal ".foam" reader available in ParaView 3.8.0 (and above), can open the decomposed case with just the built-in server and still use multicore at the same time for processing the data!

The ".OpenFOAM" reader from OpenFOAM, on the other hand, I'm not so sure about its capabilities...

Best regards,
Bruno

7islands July 29, 2010 12:06

Hi Carl, I'm the main developer of the builtin OF reader in PV 3.8 (the reader that is invoked by opening a file with .foam extension).

Quote:

Originally Posted by carl (Post 269187)
Concerning the parallelized reader, I have read somewhere that the
OpenFOAM reader is parallelzied, so I will try at least an OpenFOAM
case. Do I need a decomposed case, or can I just take any case and take
advantage of some automatic parallelization nontheless?

Yes and yes and no, the builtin OpenFOAM reader is parallelized and requires a decomposed case to work in parallel and has no goodies such as automatic parallelization. Also, here are things to be noted:
  • Don't forget to set Case Type to Decompsed Case in the reader GUI panel.
  • You can check if the parallelization is working (which region of the mesh is read by which process) by Filters -> Process Id Scalars.
  • You would want a fast disk subsystem (at least a RAID or a fastest SSD) in order for parallelization to be effective.

Quote:

Originally Posted by carl (Post 269187)
Is there any possibility to do some profiling of Paraview, like logging
the time a filter is taking, so to see at least where paraview
is taking the most time?

Yes, try accessing Timer Log from the Tools menu of the PV GUI.

Hi Bruno,

Quote:

Originally Posted by wyldckat (Post 269236)
Now, if I'm not completely mistaken, the internal ".foam" reader available in ParaView 3.8.0 (and above), can open the decomposed case with just the built-in server and still use multicore at the same time for processing the data!

Ur, not quite right, the reader can only read decompsed regions in serial when in built-in server mode. Just as was done by Carl, you have to launch pvservers with mpirun for parallel reading.

Takuya

carl July 30, 2010 10:13

Hi,

thank you very much for your replies! Its always much easier if you get a new hint, and its so good for the motivation...

Concerning the advantages of parallel processing:
I understand that parallelizing makes only sense if most of the time is used up
by the computational stuff. Neither data fetching from HDD nor the rendering
will have any speedup. However, HDD speed around >50MB/s does not seem to be the limiting factor: a timestep folder sums up to 1.2 MB, and timesteps and time/frame is around 3sec. /step. Rendering is rather fast, I shut off the LOD-stuff, because rotation is still quick enough. So at ~10 frames/sec this does not seem to be the limit either. I will try some profiling and parallel execution of a decomposed case...

The last tests have been carried out with paraview3.9, I didn't compile the reader that was delivered by OpenFOAM. The file is named "<file>.foam". I will try the other one too, and I will try some profiling.

Regards,

Carl

carl August 25, 2010 12:15

DepthCharge Benchmarks (8M cells)
 
Hello everybody,

finally I got some time to do some further benchmarking...


Paraview parallelization benchmarks
===================================


1) Test Description

The OpenFOAM tutorial case "depthCharge3D" was chosen as testcase, since it can
be referred to easily by eveybody.
~/OpenFOAM-1.7.x/tutorials/multiphase/compressibleInterFoam/les/depthCharge3D/

Cell numbers have been doubled in each direction in blockMeshDict, so the whole
case consists of about 8Mio cells. This has been done to increase computation
time of the postprocessing and also to check the scalablity of the
parallelization (depthCharge3Dfine). 10 timesteps have been used for the
evaluation (0.08s ... 0.17s), writeInterval=0.01 and purgeWrite=10, to keep the
total size reasonable. As the original, the decomposition was done for 4
regions.

Tests have been carried out for the Ensight reader and the (native) OpenFOAM
reader (by opening *.foam files by calling paraview --data="$foldername.foam" )

To check the ensight Reader, the data was converted beforehand using the
foamToEnsight utility (unfortunately, I forgot to take the time of this).

For evaluation, a single paraview state file was set up and adapted to the three
following cases:

- reading the Ensight data
- reading the OF-data as reconstructed case
(which shouldn't improve speed by using more server processes)
- reading the OF-data from the decomposed case results


file sizes:
reconstructed case, timestep 0.17:
28M alpha1.gz
24M p.gz
122M phi.gz
24M p_rgh.gz
17M rho.gz
117M U.gz
8,0K uniform

decomposed case, processor0, timestep 0.17:
4,3M alpha1.gz
8,9M p.gz
31M phi.gz
8,6M p_rgh.gz
5,7M rho.gz
30M U.gz
8,0K uniform

The evaluation is mainly a clip and a threshold showing alpha values greater
than 0.5, so you see the water splashing.

2) Test Environment

server:
running under ubuntu 10.04, 64bit, on a 12-core machine with 24G RAM, data is on
a local hdd (hdparm reported 120MB/s, so the ~350 MB of one timestep should be
read in about 3 sec.)

client on a notebook:
2-core core2duo, nvidia GeForce 9600M GT, ubuntu 10.04 (32bit version)

paraview 3.9.0,
32bit on client, 64bit on server,compiled with mpi and mesa support

execution via:
carl@server$ mpirun -np [1/2/4] pvserver
carl@client$ paraview
-> connect to server ...

100Mbit ethernet connection between the two.

3) Test Results
similar to last time: 11 frames evaluated, time measurement by "save animation"
and looking at the timestamps of the first and last png files

Ensight Reader
mpirun -np 1 pvserver => 214 s
mpirun -np 2 pvserver => 175 s
mpirun -np 4 pvserver (run 1) => 149 s
mpirun -np 4 pvserver (run 2) => 145 s
mpirun -np 4 pvserver (run 3) => 178 s

OF Reader (native paraview, .foam)

new test
decomposed case
mpirun -np 1 pvserver => 537 s
mpirun -np 2 pvserver => 303 s
mpirun -np 4 pvserver => 145 s

old test
(not sure if trustworthy. I ran the tests several times and they were
all in this region, but seemed somehow strange. maybe there were
problems concerning networking or something else. please note, the "new"
test above is an identical postprocessing evaluation as this "old" one)

decomposed case
mpirun -np 1 pvserver => 769 s
mpirun -np 2 pvserver => 580 s
mpirun -np 4 pvserver => 507 s

reconstructed case
mpirun -np 1 pvserver => 606 s
mpirun -np 2 pvserver => 642 s
mpirun -np 4 pvserver => 755 s


4) Comments

The ensight reader
does some domain-splitting by itself, so I get 4 vertical slices (z-normal),
while the OF reader uses the OF decomposition which is horizontal (y-normal),
there might be a little performance difference due to that, but maybe not too
much.

most important:
Time improvements using parallelization are significant, using any of the two
readers.
Using the OF reader, scaling is nearly linear, wich may be due to the large
number of cells. Using the ensight reader, the speedup is still significant.
I am wondering why the ensight reader is so much faster when using only one
process?


I evaluated some of the time logs, too, but always only the "load state",
because for the animation the time logs became too lenghty. Additionally, as
non-developer, the time logs are somehow hard to understand. I have the
impression, that sometimes the hirachical listing shows the sum in the parent
process in the tree below the cpu time consumption of the child processes.
sometimes, but not allways So merely summing up the time values does not really
give the total computation time, not even when using only one process. Am I
wrong?
Additionally, it is hard to judge whether jobs are really executed in parallel,
or if job-1 is waiting for job-5 to finish. CPU load doesn't tell you that
either, its allways at 100% (as described somewhere).

The timelogs are thus not particularly interesting, I wanted to post them but they weren't accepted due to file size.
It mitght help a bit if the steps in the time log had not only the duration, but
also a begin and end timestamp.

The processing was rather slow (which is rather OK at 8M cells)

I found it strange that when running e.g 4 server processes, 4 windows are
opening on the client side, but for rotation not only LOD is reduced but the
resolution, so I figure, it's using mesa on my client? That leads to two things:

i) when using mesa, it could be run on the server (maybe I need
--offScreenRendering for that),

ii)
I thought that when using pvserver, it runs the data server and the render
server on the server? then why does it open the VTK windows on my client?
and when it's opening the windows on my client, why doesn't it use GL? and what
is the IceT dev renderer?



ok. maybe again too much documentation for some little result,
but anything to improve the tools! ;)

Greetings, Carl

7islands August 27, 2010 03:02

Hi Carl,
Many thanks for sharing your observations. I am not familiar with the EnSight format, but on the whole your results looks reasonable if the following hypotheses are true:
  1. The main reason of OpenFOAM reader being slow compared to the EnSight reader is because the depthCharge3D case is in gzipped-ascii format which is the most complex thus slowest format to handle, whereas the EnSight data written in foamToEnsight is in single precision binary format unless an -ascii option is specified, which is fastest. (if we go into details I can point out a lot more "excuses" though)
  2. The OF reader catching up with the EnSight reader by increasing the number of processes is perhaps because the domain splitting of the EnSight reader is becoming the bottleneck.
I'll test by myself and post the results once the simulation run of your modified depthCharge3D case is finished (running a 8M case with 4 processors is taking a long time!).

As to why the pvservers are opening the VTK windows on the client, I have no idea. Perhaps you can discuss it better on the ParaView list.

Takuya

madad2005 August 27, 2010 03:55

Quote:

Originally Posted by 7islands (Post 273000)
As to why the pvservers are opening the VTK windows on the client, I have no idea. Perhaps you can discuss it better on the ParaView list.


This can be prevented by compiling ParaView yourself and applying a patch that can be obtained from the mailing list. There is no need for these windows to pop-up, as far as I'm aware.

carl September 24, 2010 07:12

Hello everybody,

lately I have reviewed my original case with the performance problems, the one I posted at end of July. It is actually a 2-stroke scavening simulation, quite a small model with about 200k cells, devided in several blocks. The parallelization strategy of the ensight reader devides every block in N pieces for N server processes, so in the end there are a multitude of blocks with very few cells. I guess that's the main limitation for the parallel performance. I found similar notes in some mailing lists.

I haven't tried applying the patch yet, also because I don't really need the server/client setup of paraview. I will have a look at it though some time.

Thanks again for all the suggestions,
Carl


All times are GMT -4. The time now is 18:50.