Вхід

Сигнал 15 завершення роботи задачі (SIGTERM 15)

На старті, або через певний час після початку ліку програма отримує ці записи у журнал виконання:

-----------------------------------------------------------------------------

[n3104.icyb:18312] [0,1,23]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18307] [0,1,18]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18308] [0,1,19]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18309] [0,1,20]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18310] [0,1,21]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18311] [0,1,22]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18302] [0,1,34]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18303] [0,1,35]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18305] [0,1,37]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18306] [0,1,38]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3144.icyb:18307] [0,1,39]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19463] [0,1,9]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19464] [0,1,10]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19465] [0,1,11]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19467] [0,1,13]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19468] [0,1,14]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3103.icyb:19469] [0,1,15]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18223] [0,1,25]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18224] [0,1,26]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18225] [0,1,27]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18227] [0,1,29]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18228] [0,1,30]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3105.icyb:18229] [0,1,31]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
Signal 15(-621085248)!
Software termination signal from kill.
mpirun noticed that job rank 16 with PID 18305 on node n3104 exited on signal 11 (Segmentation fault).
43 processes killed (possibly by Open MPI)

Вищевказаний журнал виконання говорить про такі події:

Відбувається завершення процесу номер 16:
mpirun noticed that job rank 16 with PID 18305 on node n3104 exited on signal 11 (Segmentation fault).
Після цього з’являються помилки комунікації у інших процесорів.
[n3104.icyb:18312] [0,1,23]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[n3104.icyb:18307] [0,1,18]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
...
[n3105.icyb:18229] [0,1,31]-[0,1,0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)

Сервер openmpi виявляє падіння одного з процесорів, і завершує решту процесів (стандартна поведінка openmpi)
Signal 15(-621085248)!
...
43 processes killed (possibly by Open MPI)
Як бачимо, 43-и процеси з 44-ох завершуються, оскільки один з них вже було завершено раніше.
Це означає, що швидше за все це помилка прикладного ПО, яке використовується у задачі.

вычисления на суперкомпьютере, сверхбыстрые вычисления, рендеринг, фитнес клубы, спортивные клубы