Paraview in Prallel (server-client)
I use paraview without the foamreader. So that I convert to VTK using foamToVTK. I have paraview installed (built myself) on a multiprocessor machine(8 processors). Now I have these basic questions.
1. Do I have to decomposePar the case before opening it in paraview?
2. If yes, then when I do foamToVTK, I dont get the time directories in the processor* folders written into the VTK folder. Is there a way I can write the subdomains into VTK ?
3. When I do mpirun -np 8 paraview, does it work for a non-decomposed case too?
4. Or should I mandatorily install the foam reader to do this?
PS: I have not installed any parallel reader after building paraview.
if you run mpirun -np 8 paraFoam (or paraView) paraView will open (8 times) but there is no benefit to more processors. It still runs on one processor. decomposing the case doesn't help eather.
maybe someone knows how to (i also struggle with a huge case).
MPI version and Libraries?
I'm trying to compile ParaView3.5 with MPI support. I'm on a Mac and have tried openmpi, MPICH, and MPICH2 with various combinations of:
and while it is building it will error out at some point or other depending on the combination of MPI settings. Does anyone have a tried and true way of installing on an Intel Mac?
Paraview parallel postprocessing
I have had this post somewhere else already 2 weeks ago but moved it here because I feel it fits better here.
Basically I try postprocessing a Star-CD Case (actually saved as Ensight data)
with paraview 3.8.0, and I want to make it compute in parallel. I have
paraview already compiled with MPI and MESA support on Ubuntu 10.04
It discribed in the paraview wiki:
In first experiements are done on a 2-Core machine, I got as far as
- in Paraview, connect to server
- Server setup as follows:
localhost, port 11111
Command: "mpirun -np 2 pvserver (optional: --off-screen-rendering)"
This is starting the 2 extra windows with (or without, respectively) the
The paraview processing works allright, yet the performance of the
parallel job is not any faster than when I just use paraview on a
single core, rather slightly slower. Neither can I find a significant
influence of the openGL-rendering, OK, so maybe my case is too simple
paraview only: avg. 16 sec. / frame
paraview client, server on localhost,
"mpirun -np 2 pvserver" (rendering via openGL:)
avg. 17 sec. / frame
paraview client, server on localhost,
"mpirun -np 2 pvserver --off-screen-rendering":
avg. 17 sec. / frame
time documentation was done by via "save animation" and comparing the
timestamps, so they round to seconds. The animation was 5 frames long,
time per frame did not vary more than one sec. For details, see below.
Ideas that I have so far:
- parellizing does not help in my specific case (2-stroke engine,
moving mesh, filters are: "Cell Data to Point Data", slices and
iso-Surfaces), so I should use another testcase?
- the ENSIGHT reader is not parallelized?
- on a 2-core machine, effects are too small to be visible?
- do I need to decompose the computation domain somehow to tell
paraview how to seperate the data for the parallelization?
- computation time is very small compared to the time to fetch data
from ram, and since both cpus rely in the same ram, there is nearly
- the whole thing is not properly parallelized and effectively uses
only one thread?
Concerning the parallelized reader, I have read somewhere that the
OpenFOAM reader is parallelzied, so I will try at least an OpenFOAM
case. Do I need a decomposed case, or can I just take any case and take
advantage of some automatic parallelization nontheless?
Is there any possibility to do some profiling of Paraview, like logging
the time a filter is taking, so to see at least where paraview
is taking the most time?
am I missing something else? any comments?
ps: Could you please fix the spelling in the tiltle of this thread, so that it will be found by search engines?
Allright, so I tried a few testcases concerning the ensight reader,
the foam-reader and the VTK reader, all to no avail.
In the end I found some note somewhere on the paraview wiki, stating
that the readers (maybe except the "new VTK" reader are not in parallel,
and every instance of pvserver reads the whole data from the hard disk,
with the result that the bottleneck of the reader is beeing tightened
even more, since then the data has to be read for each process, resulting
in a performance degeneration.
Since most of the modern computers are multi-core (exept the iPad maybe...),
I think its sensible for paraview to use them, also noted on:
for now, I see no other chance than to wait or code the thing myself...
so here are my benchmarks:
OK, lets examine one point at a time:
The ".OpenFOAM" reader from OpenFOAM, on the other hand, I'm not so sure about its capabilities...
Hi Carl, I'm the main developer of the builtin OF reader in PV 3.8 (the reader that is invoked by opening a file with .foam extension).
thank you very much for your replies! Its always much easier if you get a new hint, and its so good for the motivation...
Concerning the advantages of parallel processing:
I understand that parallelizing makes only sense if most of the time is used up
by the computational stuff. Neither data fetching from HDD nor the rendering
will have any speedup. However, HDD speed around >50MB/s does not seem to be the limiting factor: a timestep folder sums up to 1.2 MB, and timesteps and time/frame is around 3sec. /step. Rendering is rather fast, I shut off the LOD-stuff, because rotation is still quick enough. So at ~10 frames/sec this does not seem to be the limit either. I will try some profiling and parallel execution of a decomposed case...
The last tests have been carried out with paraview3.9, I didn't compile the reader that was delivered by OpenFOAM. The file is named "<file>.foam". I will try the other one too, and I will try some profiling.
DepthCharge Benchmarks (8M cells)
finally I got some time to do some further benchmarking...
Paraview parallelization benchmarks
1) Test Description
The OpenFOAM tutorial case "depthCharge3D" was chosen as testcase, since it can
be referred to easily by eveybody.
Cell numbers have been doubled in each direction in blockMeshDict, so the whole
case consists of about 8Mio cells. This has been done to increase computation
time of the postprocessing and also to check the scalablity of the
parallelization (depthCharge3Dfine). 10 timesteps have been used for the
evaluation (0.08s ... 0.17s), writeInterval=0.01 and purgeWrite=10, to keep the
total size reasonable. As the original, the decomposition was done for 4
Tests have been carried out for the Ensight reader and the (native) OpenFOAM
reader (by opening *.foam files by calling paraview --data="$foldername.foam" )
To check the ensight Reader, the data was converted beforehand using the
foamToEnsight utility (unfortunately, I forgot to take the time of this).
For evaluation, a single paraview state file was set up and adapted to the three
- reading the Ensight data
- reading the OF-data as reconstructed case
(which shouldn't improve speed by using more server processes)
- reading the OF-data from the decomposed case results
reconstructed case, timestep 0.17:
decomposed case, processor0, timestep 0.17:
The evaluation is mainly a clip and a threshold showing alpha values greater
than 0.5, so you see the water splashing.
2) Test Environment
running under ubuntu 10.04, 64bit, on a 12-core machine with 24G RAM, data is on
a local hdd (hdparm reported 120MB/s, so the ~350 MB of one timestep should be
read in about 3 sec.)
client on a notebook:
2-core core2duo, nvidia GeForce 9600M GT, ubuntu 10.04 (32bit version)
32bit on client, 64bit on server,compiled with mpi and mesa support
carl@server$ mpirun -np [1/2/4] pvserver
-> connect to server ...
100Mbit ethernet connection between the two.
3) Test Results
similar to last time: 11 frames evaluated, time measurement by "save animation"
and looking at the timestamps of the first and last png files
mpirun -np 1 pvserver => 214 s
mpirun -np 2 pvserver => 175 s
mpirun -np 4 pvserver (run 1) => 149 s
mpirun -np 4 pvserver (run 2) => 145 s
mpirun -np 4 pvserver (run 3) => 178 s
OF Reader (native paraview, .foam)
mpirun -np 1 pvserver => 537 s
mpirun -np 2 pvserver => 303 s
mpirun -np 4 pvserver => 145 s
(not sure if trustworthy. I ran the tests several times and they were
all in this region, but seemed somehow strange. maybe there were
problems concerning networking or something else. please note, the "new"
test above is an identical postprocessing evaluation as this "old" one)
mpirun -np 1 pvserver => 769 s
mpirun -np 2 pvserver => 580 s
mpirun -np 4 pvserver => 507 s
mpirun -np 1 pvserver => 606 s
mpirun -np 2 pvserver => 642 s
mpirun -np 4 pvserver => 755 s
The ensight reader
does some domain-splitting by itself, so I get 4 vertical slices (z-normal),
while the OF reader uses the OF decomposition which is horizontal (y-normal),
there might be a little performance difference due to that, but maybe not too
Time improvements using parallelization are significant, using any of the two
Using the OF reader, scaling is nearly linear, wich may be due to the large
number of cells. Using the ensight reader, the speedup is still significant.
I am wondering why the ensight reader is so much faster when using only one
I evaluated some of the time logs, too, but always only the "load state",
because for the animation the time logs became too lenghty. Additionally, as
non-developer, the time logs are somehow hard to understand. I have the
impression, that sometimes the hirachical listing shows the sum in the parent
process in the tree below the cpu time consumption of the child processes.
sometimes, but not allways So merely summing up the time values does not really
give the total computation time, not even when using only one process. Am I
Additionally, it is hard to judge whether jobs are really executed in parallel,
or if job-1 is waiting for job-5 to finish. CPU load doesn't tell you that
either, its allways at 100% (as described somewhere).
The timelogs are thus not particularly interesting, I wanted to post them but they weren't accepted due to file size.
It mitght help a bit if the steps in the time log had not only the duration, but
also a begin and end timestamp.
The processing was rather slow (which is rather OK at 8M cells)
I found it strange that when running e.g 4 server processes, 4 windows are
opening on the client side, but for rotation not only LOD is reduced but the
resolution, so I figure, it's using mesa on my client? That leads to two things:
i) when using mesa, it could be run on the server (maybe I need
--offScreenRendering for that),
I thought that when using pvserver, it runs the data server and the render
server on the server? then why does it open the VTK windows on my client?
and when it's opening the windows on my client, why doesn't it use GL? and what
is the IceT dev renderer?
ok. maybe again too much documentation for some little result,
but anything to improve the tools! ;)
Many thanks for sharing your observations. I am not familiar with the EnSight format, but on the whole your results looks reasonable if the following hypotheses are true:
As to why the pvservers are opening the VTK windows on the client, I have no idea. Perhaps you can discuss it better on the ParaView list.
lately I have reviewed my original case with the performance problems, the one I posted at end of July. It is actually a 2-stroke scavening simulation, quite a small model with about 200k cells, devided in several blocks. The parallelization strategy of the ensight reader devides every block in N pieces for N server processes, so in the end there are a multitude of blocks with very few cells. I guess that's the main limitation for the parallel performance. I found similar notes in some mailing lists.
I haven't tried applying the patch yet, also because I don't really need the server/client setup of paraview. I will have a look at it though some time.
Thanks again for all the suggestions,
|All times are GMT -4. The time now is 11:20.|