Вхід

Сигнал 15 завершения работы задачи (SIGTERM 15)

Покажем причину возникновения сигнала 15 SIGTERM на примере.

После некоторого времени счета или на старте программа получает такое в журнал выполнения:

-----------------------------------------------------------------------------

[n3104.icyb:18312] [0,1,23]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18307] [0,1,18]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18308] [0,1,19]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18309] [0,1,20]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18310] [0,1,21]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18311] [0,1,22]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18302] [0,1,34]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18303] [0,1,35]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18305] [0,1,37]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18306] [0,1,38]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18307] [0,1,39]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19463] [0,1,9]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19464] [0,1,10]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19465] [0,1,11]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19467] [0,1,13]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19468] [0,1,14]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19469] [0,1,15]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18223] [0,1,25]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18224] [0,1,26]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18225] [0,1,27]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18227] [0,1,29]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18228] [0,1,30]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18229] [0,1,31]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
Signal 15(-621085248)!
Software termination signal from kill.
mpirun noticed that job rank 16 with PID 18305 on node n3104 exited on signal 11 (Segmentation fault).
43 processes killed (possibly by Open MPI)

Указанный журнал выполнения говорит о следующих событиях:

Происходит завершение процесса номер 16:
mpirun noticed that job rank 16 with PID 18305 on node n3104 exited on signal 11 (Segmentation fault).
После этого появляются ошибки комуникации у других процессов
[n3104.icyb:18312] [0,1,23]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18307] [0,1,18]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
...
[n3105.icyb:18229] [0,1,31]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)


Сервер openmpi обнаруживает падение одного из процессов, и завершает остальные процессы
(стандартное поведение openmpi)
Signal 15(-621085248)!
...
43 processes killed (possibly by Open MPI)
Как видно, завершается 43 процесса из 44, потому что один уже завершился ранее.

Из этого следует, что скорее всего это ошибка прикладного ПО, которое используется в задаче.


вычисления на суперкомпьютере, сверхбыстрые вычисления, рендеринг, фитнес клубы, спортивные клубы