CFD Online Discussion Forums - Single Core vs. Multi Core Issue

Don't clearly understand you situation - is it a personal computer or a cluster? If personal, only "1" and "4" applies. "4" is the most probable in your case.
1) If you often read/write different case/data (not continuing one calculation) - the problem appears because the solution is partitioned to cores (time expences) and then gathered from parts (time expences). Mesh partitioning is such kind of operation when 1 core (host, main, head, master process...) divides the work with algorithmic balancing between many. Then, MPI is used, and these parts are sending through network interface to other cores. Then some kind of MPI receiving function is done by main process (gathering).
Solution: small meshes don't need partitioning/parallelization by domain decomposition (the method widely used in mesh solvers).
2) The other issue could be the interconnect throughput/latency. Relatively low speed + large network traffic generated by FLUENT (small parts of work on each core - iteration finish very quickly) => bad performance. You could even get worse performance that on single core.
Solution: choose proper interconnect. InfiniBand is supported by FLUENT and is very fast - use it instead ethernet, if you have it.

Code:

-pinfiniband

option at startup will help (interconnect should be tuned).
See also solution for "1" (for personal computer - only that solution applies in "2").
3) (for clusters only) The third thing to mention is your data storage system speed. Low speed of storage system + frequent disk r/w => bad performance.
Solution: use good data storage system.
4) If you are out of RAM, then your calculations proceed partially in swap that is hard disk drive space. When you use single core, single data stream is written on the hdd, when you use four - four data streams are written simultaneously. But your hdd couldn't write/read 4 streams simultaneously (assuming you don't have parallel r/w storage system), cylinder heads will go back and forth writing/reading pieces of data. So you wouldn't overcome hdd speed in that case + mind partitioning issues from "1" - you would ever slower your solution by adding another core.
One could correct me, if I'm somewhere wrong.
Solution: increase RAM. Use distributed memory systems (clusters, supercomputers).