EM64t w/STAR problems
My company recently opened up their frugal coffers and began leasing 10 Dell Blade EM64t 2-cpu nodes, running Redhat Enterprise 3.0as update 5, as an addition/replacement to my CFD cluster. It has been anything but a turnkey solution.
Since the release of STAR v3.2, the solver has become significantly more sensitive to the hardware and operating system. The current problem I'm struggling with is this – STAR is timing out on ssh and terminating mid-way through the simulation. The problem is intermittent and not specific to any one node.
CD-adapco support gave us a band-aid with the –notracker option. This seems to reduce the amount of ssh'ing STAR does during a simulation. It has allowed some of my analyses to run to completion but others are now failing at startup.
I don't believe that it's necessarily a STAR issue because we can get ssh, and rsh, to hang by just typing in the command at the command prompt. Just as within the STAR environment, these failures are random.
Does anyone have any thoughts on how to overcome this problem? Any suggestion would be very much appreciated.
Enthusiastically Dejected Paulh
Re: EM64t w/STAR problems
On our cluster the default Linux configuration (presumably set for interactive graphics use) was actually unstable on about half our headless nodes. Obvious because half the nodes were hot when idling. This was was fixed by passing a kernel parameter at boot time. The real point here is that the default configuration for Linux from most distributors is often inappropriate for a cluster.
What does your blade server use for an interconnect? Our cluster uses gigabit ethernet and the default parameters in the ethernet driver were also spectacularly bad for cluster use. That is, reasonable performance with lots of small messages rather than streaming large files. A few minutes experimentation brought large improvements.
A few minutes chatting to someone familiar with setting up linux clusters for numerical simulation might cure your problems. Does your supplier have this knowledge? Be careful because what is appropriate for numerical simulation does not follow from what is appropriate for data farms and other common uses of clusters.
If you want to post questions about the setup and configuration this is probably a good place:
Re: EM64t w/STAR problems
When we have had rsh hangs it was often due to the fact that the user's home directory was not mounted or could not be mounted (ie because of nfs or automount or rarely nis problems). In other cases the user was accessing something in his .cshrc file somewhere and that file couldn't be mounted (again because of nfs or automount). ssh is even worse because the keys that it needs to pass back and forth are usually stored in the home directory so if they are not there (because the home isn't mounted) then it almost always hangs. You could log into a node in advance as root and see if the home node gets mounted or not as the users tries to rsh/ssh in. You could also strip the .cshrc to nothing and maybe start building it up again until you see the problem.
|All times are GMT -4. The time now is 08:14.|