Features of parallel PHOENICS

With the rapid increase in computational power available on desktop machines users are demanding larger and larger computational grids (domain). Even with the fastest machines the time required to run some of these cases to convergence can run into many hours. PHOENICS offers the option of solving the computations quicker by splitting the computational domain into smaller sub-domains each of which can be solved in a separate process, thus significantly reducing the time to the overall solution. This is what is known as parallel PHOENICS.

Historically if a user wanted to make effective parallel calculations then one would have needed to create what was known as a PC-cluster of networked computers. Today one can often obtain sufficient parallel computing power within a single workstation. If one needs to set up a traditional cluster using multiple machines then they should be connected via a fast ethernet switch and be configured to enable TCP/IP socket connections between all computers. For efficient parallel computations it is preferable to have all machines in the cluster with the same specification.

The hardware available changes so quickly it is difficult to make specific recommendations, these would also depend on the budget available and the preferences of your organisation. PHOENICS in general, and the solver in particular, does not require any special graphics capabilities, so in chosing a machine we would recommend spending the extra on faster processors and additional RAM.

When choosing hardware for PHOENICS, a metric that generally gives a decent indication of performance is the available memory bandwidth per CPU core, followed by the number of CPU cores and then other things such as clock frequency or age of the hardware. PHOENICS will run on one core or on however many cores you can provide it. Memory use is proportional to the number of control volumes used and variables solved for and thus hardware requirements can become highly problem dependent.

For cluster operation the choice of interconnect may also be very important, but we do not have experience with this or data to quantify this. As a guideline one may look to e.g. university clusters or supercomputers, which almost all employ some sort of high speed, low latency interconnect such as Infiniband or modern Ethernet (> 20 Gbps).

For reference, in our office we currently have some 16-core AMD Threadripper based systems that perform well. What we were looking for when purchasing these CPUs was that they have quad channel memory connections and at least one stick of RAM per channel. Current server hardware like Intel's Xeon Scalable or AMD's Epyc have six and eight memory channels, respectively, and would likely be good choices with a balance 2-4 cores per memory channel.

When the user connects their machines on their local network, they hope to get accelerations in their calculations using the parallel program. Often this is the case, but sometimes one can see deceleration. To understand the reasons for this deceleration the user should know something about parallel programming. Each parallel CFD program uses decomposition of the computational domain into several sub-domains. The usual number of sub-domains is equal to the number of processors available. Each sub-domain exchanges data with other sub-domains. If many data items are exchanged, then the processors will spend much time in exchanging them and useful time for calculations will be very small. This situation demonstrates an ineffective use of parallel calculations. Effective parallel calculations will result when the time of calculations is significantly more than the time of data exchange. Hence users should use decompositions which have smaller amount of cells on boundaries. Time of data exchange is connected with latency and time for single exchange which are characteristics of the PC-cluster.

Items to note while running of parallel PHOENICS:

The very popular slab-wise solution of hydrodynamics tasks should be abandoned because it leads to a lot of small data exchanges instead of a single exchange of the whole data which the whole-field solution provides. The user should inspect the file q1 to be sure that all variables are run whole-field. This is especially of concern where the turbulence model uses slab-wise solution by default. For example, statement TURMODE(KE) in the Q1 leads to slab-wise solution of several variables. Parallel run of such case is non-effective.
Usually automatic decomposition will give the best decomposition. Sometimes though the user will want to use a non-standard decomposition or one will want to run a case with Multi-Block grid. For this case one should to write the file PARDAT manually. Description of file PARDAT see here.
User should remember that some features of sequential PHOENICS are not implemented in parallel. For example, InForm in sequential PHOENICS allows to link several cells by formulae, parallel PHOENICS does not allow it.