Problem: Very long "write" time (~2h-3h) for results and transient results
I'm running a ~10M element simulation in CFX13 and my "write" times are incredibly long...talking 2 hours or more to write a 4GB results file. I'm running the job in 12-way parallel, all local on a 6 core + hyperthreading i7-3930k cpu. I have 32GB memory of which ~26GB gets used by the simulation, so I should not be into a page file yet. My OS is Win7-64 and my disk drive is a WD green 1.5TB. I've tried no compression and high compression for the results and the "write" time is essentially unchanged but the results file is 20GB without compression. Copying the results file from the drive to itself takes about 2-3 minute so given these two facts I don't think that the bottleneck is my hdd itself.
During the write process I have one core of my cpu working near 100%. My .out file shows that this is most likely the master partition since it has a few hours more cpu time (according to the .out file) than all other 11 partitions, which show only small variation in their cpu times. Does anyone know what the master partition is doing for so long?
I ran the same job on a 16-core cluster node and writing the results files took much much less time. The cluster storage is likely some sort of RAID setup and operating system is HP centOS 5.4 (unix/linux). I read that in version 11 of CFX there was a 2GB limit of results file size in windows. I don't see this note on version 13, but am I running into a by-product of a work around for this limitation? Perhaps a "fix" that allows for larger files in windows but at the expense of ridiculously long writing times?
I had considered getting an SSD but I really don't think that my storage speed itself is the issue anymore. Since the partitioning behaves similarly (one core for a very long time), I almost wonder if the cpu time for the "write" is spent de-partitioning the mesh, but I don't even know if that makes any sense. Any insight would be greatly appreciated.
More recent versions of CFX have much improved this area so I recommend you go to the latest version of CFX. I gather you are on V13, try to upgrade to V14 as improvements were made to large cluster performance which may assist you.
I doubt an upgraded HD will make much difference. The bottleneck is getting the data from the slave nodes to the master, and the master assembling this data into the final results file. The data communications is a network issue (if a distributed cluster) or a multi-CPU multi-thread performance issue (if local parallel), and the results file assembly is a single CPU processor speed issue.
The strangest thing about this is that the cluster, which is also V13, did not exhibit the same behaviour. I found this very odd since each core on the cluster is a 2.4GHz Xeon core with a Core2 architecture, whereas my local system is a new i7-3930k running at ~4.2GHz. The cluster is also 16 partitions whereas my local system is 12 partitions. I figured my local system would have the advantage here.
Maybe as an experiment try doing a distributed run but only using nodes on your local machine.
Also you could have a look at the partitions from the two runs. A small chance the partitions are different enough to cause a problem.
But at the end of the day diagnosing this sort of stuff is very difficult and minor differences in machines make a big difference. I once had a simulation software (not CFX) which for distributed operation specified the network cards and motherboards you needed. It was easier to say "just use this system as we know it works" than to delve into the depths of computer performance. That is unless you want to chase up motherboard bus sizes and speeds, chip buffer sizes, network latency and all sorts of esoteric performance data.
I probably wasn't entirely clear on the network setup. I ran the simulation either exclusively on the single node on the cluster or on my local i7 machine. The partitions would have been difference since the job I ran on the cluster used 16 cores/partitions as compared to 12 cores/partitions on my local machine. So I would expect some differences, but they should favour my local machine since it has fewer partitions to reconstitute.
Fortunately it means I don't have to deal with any network variables since the simulation never leaves the node or my local machine.
As for looking at the partitions, to be honest I don't know how to see what the partitions look like.
I have been thinking that maybe I should try 6 partitions and turn off hyperthreading, but by running the CFX benchmark.def on a separate computer using 16GB memory and an i7-875k cpu, I actually see a very healthy speedup in solver time using hyperthreading:
1 partiton: 53s
2 partitions: 43s
4 partitons: 24s
8 partitons: 13s
This leads me to believe that I will run much faster with 12 partitons.
I suppose I just need to find some way to speed up the de-partitioning or figure out why it is so slow on my new cpu as compared to a cpu that is more than 4 years older. What I may end up doing is progressing my solution on my local machine until the initial conditions have washed out then continue the simulation on the cluster node once I want to start writing some transient results.
I have submitted a support request to Ansys, so maybe they can offer some suggestions as well. If I find anything interesting I will report back here.
To view the partitions load it in CFD-Post and look at the "real partition number" variable. It is under the "..." button.
CFX has not been compiled to use hyperthreading (at least up to V11 it was not) so I am surprised you report a speed up when you are using virtual cores.
All I saw for the real partition number was just a number, 12, for my local machine. Or is that all I'm supposed to see, just the number of partitions that were used? Is there a way to actually view the individual partitons?
Does CFX really need to be compiled for hyperthreading use? My understanding of hyperthreading is that it is hardware related. In fact, the cpu actually has specific sections that track the additional threads, kind of like two buffers for each core so that the math unit of each core can stay fully utilized and not sit idle waiting for the next instruction from only a single buffer. It doesn't have the same benefit as a full additional core but it is not entirely virtual. I believe that it may have been entirely virtual in hyperthreading's first implementation on older P4 and similar processors, hence not being very effective.
In fact, when I'm running sufficiently small jobs, it works quite well to run as many jobs serially as the cpu can track threads for, so 12 on the i7-3930k system. The performance hit comes when doing a parallel jobs on all 12 cores since there are some communication requirements between the partitions. Using all 12 is still faster than using 6 though (or using 8 instead of 4 as I posted earlier regarding my 875k computer). Still amazes me sometimes how far and how fast processing power has increased.
To make full use of hyperthreading the software needs to be optimised to make use of it. Testing I did of hyperthreading years ago (V11?) showed the virtual cores did speed the simulation slightly but not very much and certainly not enough to be worth paying the cost of extra parallel licenses. If you have spare parallel licenses to use then go for it but the price for performance equation certainly did not stack up last time I checked.
I thought I would try with HT enabled and disabledon my i7-3930k and think I may have found my problem:
1 core: 46s
2 cores: 19s
3 cores: 11s
4 cores: 9s
5 cores: 8s
6 cores: 168s***
12 cores: 294s
1 core: 26s
2 cores 13s
3 cores: 10s
4 cores: 8s
5 cores: 6s
6 cores: 6s
I was getting a warning in the solver manager command prompt window that my 3930k cpu contained numa-like packaging by my os didn't recognize it as numa. Looks like the cpu architecture may be too new for V13 CFX. I had a similar issue a couple versions ago with C2Q 8200 cpu in that I couldn't even run CFX solver without an update to a specific file related to cpu architecture. Up to 5 partitons on the 3930k and things run well, but, after 6, it grinds to a hault...well...almost.
My 875k cpu is a little older, and doesn't exhibit this issue, until I go over the HT limit of 8 threads to 9, which destroys the performance on that cou also. I'm going to see if we have V14 available yet, and see if that resolves the architecture issue. Otherwise, I will run with HT off. It seems that in some cases, HT can provide a benefit, and other times it can severely hurt performance. One a current simulation finishes on my 875k system, I think I will try the benchmark with HT off and see what the outcome is. Since I had never turned off HT on the 875k, and had always seen improved performance up to the maximum threads, I assumed that I was in fact getting some benefit from using HT, but I had never exclusively compared it the the performance with HT off. Seems there can be a significant difference.
So it turns out that somehow my slow processing time is related to hyperthreading. Using CFX V13 with HT on, I could only get good speedups until 5 partitions on my 3930k cpu. 6 and above would kill the performance. Turning HT off worked just fine.
I was also able to try CFX V14, and had some odd results. Again, up to 5 partitons worked fine with HT on, but 6 would kill the performance. Oddly, using 12 partitions worked fine again, but the speedup compared to 6 partitons without HT and was only about 4%, so definitely not worth using the extra licenses.
With HT off, my cpu usage was 100%, so it really wasn't that surprising that HT had no performance gain since each cpu was already fully occupied and spent very very little time idle waiting for instructions.
I also re-ran the banchmark simulation on my i7-875k computer with the following results:
1 core: 38s
2 cores: 21s
3 cores: 15s
4 cores: 13s
If you compare with my previous post, 4 cores with HT off is equivalent to 8 cores with HT on. The most important comparison for me is to look at the completion time for a single core, since I like to be able to use my computer for other things while simulations are running and not have 100% cpu utilization. According to the results, I could run 3 sequential sets of 4 jobs with HT on in about the same time as I could run 4 sequential sets of 3 jobs with HT off. Both cases give me 12 jobs, but the case without HT takes just a little bit less time (~96%) and only requires 3 licenses.
Overall, I blame my original problem on hyperthreading and on the cpu architecture being too new for the version of CFX that I was using, and although V14 seemes to remedy my problem, I still find it odd that the performance turned so terrible so quickly using 6 partitons with HT enabled on the i7-3930k instead of 5 with HT enabled, but 12 partitions was fine again. There must be something funky going on with the thread scheduling and assignments or something of that nature.
I am therefore in complete agreement with Glenn that hyperthreading is not beneficial for CFX. The extra licenses would be much better used to run a simulation on a separate system.
Having said that, I think my next project will be to see if I can get my i7-875k and my i7-3930k communicating over a gigE connection since the extra licenses will now be available.
Well spotted. This reminds me that CFX selects solvers based on CPU types and if your CPU type is not recognised (eg if your CPU is newer than the software) then it just defaults to a generic solver and can be a lot slower. This is another reason to ensure you have the latest software.
And finally I would also make sure your BIOS, network card driver, hard drive firmware and all the rest are up to date. BIOS especially - if your CPU is not supported by the BIOS it will default to a default mode which does not support all the CPU features and is much slower.
What you said about CFX not recognizing the CPU depending on the ansys version is very interesting. I just got a new workstation with an Intel Xeon CPU E5-26502.6GHz, memory 64gb. I tried running a simple calculation which takes 20mins from the settings to the end of the calculation on a 32bits machine. the time it takes to open the solver manager after pre-processing is RIDICULOUS....I am using Ansys 14.5, which is not so old... but considering that 14.5.7 and 15 came out....any idea????
ANSYS CFX can recognise the CPUs available at the time it was released, and the common CPUs released beforehand. CPUs released after that version of CFX was released have a chance of not being recognised and not running an appropriate solver for that CPU - the result being highly unoptimised software which does not run very well.
This is not a problem I have come across recently because I keep my CFX and computers current. But if you have an old version of CFX an a new CPU you might come across it.
CPU run time for CFX-Pre and CFD-Post is very different to the solver. The CPU recognition thing only applies to the solver - there is no CPU optimised versions of the software available for Pre or Post to my knowledge. So if CFX-Pre loads slowly on a machine which should be fast then I would check your OpenGL is OK or your virus checker is not going berzerk.
The Intel Xeon CPU E5-2650 is an SB-E chip released in Q1'12' so I would expect version 14.5 to support it. When you say the time it takes to 'open' solver manager (SM) after pre-processing is long, what do you mean by 'open'? Do you mean to launch SM, for SM to read your .def file, for parts of the CFX solver/partitioner/interpolator to run or something else?
|All times are GMT -4. The time now is 12:04.|