I stumbled across some open source software called Apache Hadoop. Wanted to know if any member here has experience with that thing. Apparently it is a 'software for reliable, scalable, distributed computing'.
Is it worth giving it a try?
AFAIK, Hadoop was developed for distributed computing of a very different kind of computing, when compared to CFD. It was developed for maintaining web-based platforms, such as social websites, financial stock dealing platforms and other high complexity inter-relational metadata.
Using it for CFD would likely only be useful as a highly distributed job scheduling system (http://en.wikipedia.org/wiki/Job_scheduler), interconnecting millions of clusters around the world, to solve independent problems on each cluster, cataloguing each simulation performed and then gathering the inputs, outputs and post-processing as a big gigantic library of results, including an inter-relational connection between those simulations. Sort-of like a very big fingerprint database.
If you only have access to one or two clusters, it's really a massive overkill to use Hadoop, unless you want to build a platform for a University or some other teaching facility, where the platform can point out to students whether a particular simulation will never work, as other students in previous years had already attempted to perform and fail in the past.
Thanks for your reply.
Well if I understand it correctly, does it mean I can actually use the resource to access remote HPC clusters? Currently my rights to access some of the HPC clusters that I had previously used is over and hence I am finding it hard to run simulations in parallel.
So if HADOOP could actually allow me to access and use multiple clusters or even one cluster, that would be immensely beneficial in my research I guess.
I came across this forum quite by chance as a result of your hadoop questions. I am a hadoop newbies and am looking for a hadoop related forum here.
Any suggestions would be appreciated
Greetings to all!
So I found out recently, thanks to Lorena Barba re-tweeting about this, that MPI is apparently getting too old and that Hadoop/Spark is just one of a few of the technologies that are likely to replace MPI sometime in the future:
Therefore, I'm posting about the original post here on this thread:
Essentially, Hadoop/Spark is pretty much a platform in its own right. It's mostly written in Java, which is a language that (AFAIK) is rarely used for programming CFD software, simply because Java is an interpreted language and won't be as run-time efficient as C/C++/FORTRAN. But as the blog post defends, with today's CPUs and how things have evolved, this language overhead might not be what's stopping us any more, it's actually how long things take to code. In fact, there are already optimization strategies embedded into these languages, that we are unlikely to be able to reproduce with C/C++/FORTRAN without some considerable effort (or at least a matter of searching for the right library).
Then there is the other detail: at least in theory, to make the most of the Hadoop platform, it's best to create the source code for the CFD software directly in Java and directly linked to Hadoop's libraries, which in most cases, implies having to re-write the whole code.
Using C++ and other languages to connect to Hadoop is also possible, but after a quick search, it seems that it requires some investigation into what should be really used as the base library for making the connection; MapReduce-MPI, Hadoop Pipes and MR4C (Google's implementation) are just to name a few, over the few dozens that already exist.
Then there is also complete alternatives to any of the above, such as:
All of this just to say that using Hadoop as a building block for creating CFD applications is something that perhaps might happen in 3-5 years from now, or be used in the back-office in cloud services that provide CFD software as an online service, without us even knowing about it.
|All times are GMT -4. The time now is 01:44.|