PETSC in Docker: container shared memory (solving "caught signal number 7 BUS")

Since the last May I have been using Underworld2 to run some thermo-mechanical models of rifted margins through time, using its nice Python API. More on that in a following post.

The suggested deployment for “usage on personal computers” of Underworld is via Docker. The infrastructure that I am using is on a slightly larger scale, a 36-cpu virtual machine - which we may still consider a PC for these purposes.

Using Docker had me skeptical at the beginning, for no valid reason apart my own ignorance. It has proven fit for this task, pending migration to HPC infrastructure (which will involve me learning a bit about building a singularity image).

In the setup I have adopted, “running a model” boils down to setting up a prototype model run in a Jupyter Notebook, converting that to a plain Python script using jupytext, then calling the script with mpirun, on a given number of processes, e.g. mpirun -np 8. I will not digress on this, which merits its own post, but carrying out all the setup phase in notebooks provides an hassle-free way to define and test complex starting conditions. Also, jupytext-converted notebooks are then copied in the model output directory and rendered (using papermill) ensuring complete reproducibility and documentation of the model starting conditions.

While looking for an optimum number of processes in this setup, I have incurred in the following error, when -np was set larger than circa 10 (the actual number depending on the model mesh size):

[3]PETSC ERROR: Caught signal number 7 BUS: Bus Error, possibly illegal memory access

Here is a larger portion of the error printout, with the repeated prefix omitted:

---------------------------------------------------------
Caught signal number 7 BUS: Bus Error, possibly illegal memory access
Try option -start_in_debugger or -on_error_attach_debugger
or see https://petsc.org/release/faq/#valgrind
or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
configure using --with-debugging=yes, recompile, link, and run 
to get more information on the crash.
--------------------- Error Message ---------------------
Signal received
See https://petsc.org/release/faq/ for trouble shooting.
Petsc Release Version 3.17.1, Apr 28, 2022 
03_ext.py on a  named 7e468e5db1bb by Unknown Wed Nov 16 13:16:10 2022
Configure options --with-debugging=0 --prefix=/usr/local --COPTFLAGS="-g -O3" --CXXOPTFLAGS="-g -O3" --FOPTFLAGS="-g -O3" --with-petsc4py=1 --with-zlib=1 --download-hdf5=1 --download-mumps=1 --download-parmetis=1 --download-metis=1 --download-superlu=1 --download-hypre=1 --download-scalapack=1 --download-superlu_dist=1 --download-ctetgen --download-eigen --download-triangle --useThreads=0 --download-superlu=1 --with-shared-libraries --with-cxx-dialect=C++11 --with-make-np=8
#1 User provided function() at unknown file:0
Run with -malloc_debug to check if memory corruption is causing the crash.
application called MPI_Abort(MPI_COMM_WORLD, 59) - process 3

The underlying cause was indeed trivial, however it took a while to troubleshoot:

gif: status of shared memory during test run

What I was running into is that the container was being allocated 64 MB of shared memory, by default (Docker run reference), which ran out when the number of processes increased. This resource provided to be quite helpful: datawookie.dev: Shared Memory & Docker [archive.org archived link].

Increasing the amount of shared memory allocated to the container (--shm-size option in docker run) solved the issue.

Tags: