Introduction
In recent discussions about Dymola as a co-simulation platform, the question of running FMUs in parallel has come up. We have made some experiments to investigate the performance.
To enable parallelization set the flag:
Advanced.Translation.ParallelizeCode = true.
All experiments are done with default settings; the co-simulation FMUs are just dragged-and-dropped into the master model. The only exception is when using several instances of the same FMU, then the option
Advanced.FMI.LockFMUInstance = true
is used to lock each instance separately (as is recommended by the translation log).
For the case of FMUs of similar costs and equal communication interval, the results are very encouraging.
Setup
We use Dymola 2024x and a simple setup: One FMU is instatiated several times. The FMU is created in Dymola from a model consisting of a large nonlinear system of size N = 200 and as many dynamic states. As sparse solvers are not used, each FMU evaluation is expensive.
The model has one input and one output. The FMUs are connected in a loop. With the default option of fmi_UsePreOnInputSignals = true, the parallelization works as well even when there is direct feed-through in the FMU. Example with six FMUs:
The computer used has 6 physical cores, meaning 12 logical cores.
Experiment 1
We keep the number of cores equal to the number of FMUs. Perfect parallelization performance would then mean that the excecution time is constant when the number of FMUs increases. Here we use fmi_NumberOfSteps = 100.
The result is good: Only 29 % longer simulation time for six FMUs and cores compared to one FMU and core.
Experiment 2
We vary the size of the nonlinear system (N) to investigate the importance of costly steps. The setup is as above, but always using two FMUs and either enabling or disabling parallelization (on two cores). The parameter fmi_NumberOfSteps is increased to 500 to get measurable times also for the smaller sizes.
The plot presents the ratio of the simulation time on two cores over the simulation time on one core. Thus, optimal parallelization performance means a ratio of 50 %.
When the system size is small, then the FMU is too cheap and the overhead has an adverse affect. For N = 12 the full co-simulation takes only about 0.1 s. For sizes of N = 100 (about 8 s on two cores) and above, the parallelization performance is close to optimal.
