ttcn3 connect() on port triggering kernel OOM killer on component
Summary
At osmocom we have a set of testsuites in TTCN3 which tests several telecommunications nodes in this repository [1]. Most tetsuite are run nightly with a prepared containerized setup under docker available at [2].
The PGW_Tests setup (docker-playground.git/ttcn3-pgw-test/) starts a open5gs core network (specifically UPF and SMF) and tests creating 256 GTPv2 concurrent sessions against it (TC_createSession_ping4_256), launching 256 ping commands on one tun dev per conn. TTCN3 emulates OCS and PFC (Gx, Gy interface) as well as SGW (GTP). The most relevant code can be found at [3].
[1] https://gitea.osmocom.org/ttcn3/osmo-ttcn3-hacks [2] https://gitea.osmocom.org/osmocom/docker-playground [3] https://gitea.osmocom.org/ttcn3/osmo-ttcn3-hacks/src/branch/master/pgw/PGW_Tests.ttcn
In TTCN3, we have 1 component per GTP session, each in turn has ports connected to a Gx DIAMETER Gx and Gy DIAMETER components through component ports.
Steps and/or TTCN-3 code to reproduce
Today, the PGW_Tests testsuite started to fail within a specific test: TC_createSession_ping4_256 Looking at the logs it seems the first 3-4 GTP sessions (ttcn3 components) are created fine and everything works fine, but at around the 4th-5th one it fails due to one component detecting a peer component closed the unix socket uses to intercommunicate with it (I also attah full merged log of all components during the failing test runPGW_Tests.TC_createSession_ping4_256.pcap.gzPGW_Tests.TC_createSession_ping4_256.merged.gz):
17:00:55.894883 22 - Connection of port DIAMETER to TC_createSession_ping4_256(4):DIAMETER_CLIENT was closed unexpectedly by the peer.
17:00:55.894983 19 PGW_Tests.ttcn:85 Connection of port DIAMETER_PROC to TC_createSession_ping4_256(4):DIAMETER_PROC was closed unexpectedly by the peer.
17:00:55.895109 16 PGW_Tests.ttcn:85 Connection of port DIAMETER_PROC to TC_createSession_ping4_256(4):DIAMETER_PROC was closed unexpectedly by the peer.
17:00:55.895114 13 PGW_Tests.ttcn:85 Connection of port DIAMETER_PROC to TC_createSession_ping4_256(4):DIAMETER_PROC was closed unexpectedly by the peer.
17:00:55.895215 10 PGW_Tests.ttcn:85 Connection of port DIAMETER_PROC to TC_createSession_ping4_256(4):DIAMETER_PROC was closed unexpectedly by the peer.
17:00:55.895219 7 PGW_Tests.ttcn:85 Connection of port DIAMETER_PROC to TC_createSession_ping4_256(4):DIAMETER_PROC was closed unexpectedly by the peer.
17:00:55.895260 22 - Port DIAMETER was disconnected from TC_createSession_ping4_256(4):DIAMETER_CLIENT.
17:00:55.895350 19 PGW_Tests.ttcn:85 Port DIAMETER_PROC was disconnected from TC_createSession_ping4_256(4):DIAMETER_PROC.
17:00:55.895386 mtc PGW_Tests.ttcn:275 Connection of port Gx_PROC to TC_createSession_ping4_256(4):DIAMETER_PROC was closed unexpectedly by the peer.
17:00:55.895423 16 PGW_Tests.ttcn:85 Port DIAMETER_PROC was disconnected from TC_createSession_ping4_256(4):DIAMETER_PROC.
17:00:55.895427 13 PGW_Tests.ttcn:85 Port DIAMETER_PROC was disconnected from TC_createSession_ping4_256(4):DIAMETER_PROC.
17:00:55.895509 10 PGW_Tests.ttcn:85 Port DIAMETER_PROC was disconnected from TC_createSession_ping4_256(4):DIAMETER_PROC.
17:00:55.895511 7 PGW_Tests.ttcn:85 Port DIAMETER_PROC was disconnected from TC_createSession_ping4_256(4):DIAMETER_PROC.
17:00:55.895743 mtc PGW_Tests.ttcn:275 Port Gx_PROC was disconnected from TC_createSession_ping4_256(4):DIAMETER_PROC.
17:00:55.896229 19 PGW_Tests.ttcn:85 Connection of port DIAMETER to TC_createSession_ping4_256(4):DIAMETER_CLIENT was reset by the peer.
17:00:55.896294 16 PGW_Tests.ttcn:85 Connection of port DIAMETER to TC_createSession_ping4_256(4):DIAMETER_CLIENT was closed unexpectedly by the peer.
17:00:55.896298 13 PGW_Tests.ttcn:85 Connection of port DIAMETER to TC_createSession_ping4_256(4):DIAMETER_CLIENT was closed unexpectedly by the peer.
17:00:55.896378 10 PGW_Tests.ttcn:85 Connection of port DIAMETER to TC_createSession_ping4_256(4):DIAMETER_CLIENT was closed unexpectedly by the peer.
17:00:55.896509 19 PGW_Tests.ttcn:85 Warning: The last outgoing messages on port DIAMETER may be lost.
17:00:55.896530 7 PGW_Tests.ttcn:85 Connection of port DIAMETER to TC_createSession_ping4_256(4):DIAMETER_CLIENT was closed unexpectedly by the peer.
17:00:55.896582 16 PGW_Tests.ttcn:85 Port DIAMETER was disconnected from TC_createSession_ping4_256(4):DIAMETER_CLIENT.
17:00:55.896586 13 PGW_Tests.ttcn:85 Port DIAMETER was disconnected from TC_createSession_ping4_256(4):DIAMETER_CLIENT.
17:00:55.896601 mtc PGW_Tests.ttcn:275 Connection of port Gx_UNIT to TC_createSession_ping4_256(4):DIAMETER_UNIT was closed unexpectedly by the peer.
17:00:55.896640 10 PGW_Tests.ttcn:85 Port DIAMETER was disconnected from TC_createSession_ping4_256(4):DIAMETER_CLIENT.
17:00:55.896762 7 PGW_Tests.ttcn:85 Port DIAMETER was disconnected from TC_createSession_ping4_256(4):DIAMETER_CLIENT.
17:00:55.896827 19 PGW_Tests.ttcn:85 Port DIAMETER was disconnected from TC_createSession_ping4_256(4):DIAMETER_CLIENT.
17:00:55.896851 mtc PGW_Tests.ttcn:275 Port Gx_UNIT was disconnected from TC_createSession_ping4_256(4):DIAMETER_UNIT.
17:00:55.897238 mtc PGW_Tests.ttcn:275 Dynamic test case error: Error message was received from MC: Establishment of port connection 4:DIAMETER_PROC - 22:DIAMETER_PROC failed because test component 4 has terminated during connection setup.
So looking at the TTCN3 logs (attached), I can compare the previous 3-4 created sessions with the last failing one, and it seems the "closed unexpectedly" happens while the port of the component is in the process of being connected() to the remote port (not yet completed).
When those errors show up, I get following output in "dmesg":
[28567.554950] __vm_enough_memory: pid: 233484, comm: PGW_Tests, no enough memory for the allocation
[28567.555002] __vm_enough_memory: pid: 233484, comm: PGW_Tests, no enough memory for the allocation
[28567.555066] __vm_enough_memory: pid: 233484, comm: PGW_Tests, no enough memory for the allocation
That's due to sysctl vm.overcommit_memory being 0 in my system (https://www.kernel.org/doc/html/latest/mm/overcommit-accounting.html). Disabling it (sysctl -w vm.overcommit_memory=1) makes it trigger the OOM killer every time I run the test, which kills the "PGW_Tests" binary containing the compiled TTCN3.
So I then run the "ttcn3_start" binary under "strace -ff -o /data/strace.log" to find out more. I get lots of strace.log, one per component (process). I grep for "unexpectedly" to find out the failing component, and look around to find out what's going on. According to the code, this happens because the unix socket recv() is returning 0. strace confirms that:
set_robust_list(0x7f9141acdee0, 24) = 0
close(3) = 0
epoll_create(16) = 3
close(4) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 4, 0x7ffc6c88dd5c) = -1 EBADF (Bad file descriptor)
fcntl(4, F_GETFD) = -1 EBADF (Bad file descriptor)
close(6) = 0
openat(AT_FDCWD, "PGW_Tests-TC_createSession_ping4_256-7b8d6c76ffd5-22.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
lseek(4, 0, SEEK_END) = 0
fcntl(4, F_GETFD) = 0
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
write(4, "17:00:54.715547 - Warning: A plu"..., 110) = 110
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.715905 - TTCN-3 Paralle"..., 235) = 235
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.716072 - TTCN Logger v2"..., 289) = 289
socket(AF_UNIX, SOCK_STREAM, 0) = 6
connect(6, {sa_family=AF_UNIX, sun_path="/tmp/ttcn3-mctr-33961"}, 110) = 0
fcntl(6, F_GETFD) = 0
fcntl(6, F_SETFD, FD_CLOEXEC) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 6, {EPOLLIN, {u32=6, u64=6}}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.716422 - Connected to M"..., 35) = 35
sendto(6, "\2\4\26", 3, 0, NULL, 0) = 3
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.716616 - Initializing v"..., 152) = 152
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.717908 - Port DIAMETER "..., 45) = 45
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.718042 - Port DIAMETER_"..., 50) = 50
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.718148 - Port DIAMETER_"..., 52) = 52
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.718230 - Component type"..., 81) = 81
epoll_wait(3, [{EPOLLIN, {u32=6, u64=6}}], 64, -1) = 1
recvfrom(6, "\200T\r\10DIAMETER\4\32TC_createSession_p"..., 1000, 0, NULL, NULL) = 86
socket(AF_UNIX, SOCK_STREAM, 0) = 7
connect(7, {sa_family=AF_UNIX, sun_path="/tmp/ttcn3-portconn-c554be55"}, 110) = 0 //<--- "/tmp/ttcn3-portconn-c554be55", fd=7
fcntl(7, F_GETFD) = 0
fcntl(7, F_SETFD, FD_CLOEXEC) = 0
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.718682 - Port DIAMETER "..., 141) = 141
epoll_wait(3, [{EPOLLIN|EPOLLHUP, {u32=7, u64=7}}], 64, -1) = 1 //<----- POLLED FOR READ!
recvfrom(7, "", 1000, 0, NULL, NULL) = 0 //<-------- recv() returns 0!!!!!! peer closed!
sendto(6, "\33\20\10DIAMETER\4\17DIAMETER_CLIENT", 28, 0, NULL, 0) = 28
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.894883 - Connection of "..., 132) = 132
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.895260 - Port DIAMETER "..., 101) = 101
epoll_ctl(3, EPOLL_CTL_DEL, 7, 0x7ffc6c88ddec) = 0
close(7) = 0
epoll_wait(3, [{EPOLLIN, {u32=6, u64=6}}], 64, -1) = 1
recvfrom(6, "\1\26", 1000, 0, NULL, NULL) = 2
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.904796 - Kill was reque"..., 68) = 68
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.905412 - Terminating co"..., 77) = 77
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.906088 - Port DIAMETER "..., 45) = 45
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.906693 - Port DIAMETER_"..., 50) = 50
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.907299 - Port DIAMETER_"..., 52) = 52
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.907881 - Component type"..., 122) = 122
sendto(6, "\3\27\0\0", 4, 0, NULL, 0) = 4
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.908685 - Final verdict "..., 45) = 45
sendto(6, "&\1\0\206\237\370\346G\267\273\r0\32Final verdict of PT"..., 39, 0, NULL, 0) = 39
shutdown(6, SHUT_WR) = 0
recvfrom(6, "", 1024, 0, NULL, NULL) = 0
close(6) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, 0x7ffc6c88dafc) = -1 EBADF (Bad file descriptor)
fcntl(6, F_GETFD) = -1 EBADF (Bad file descriptor)
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.958218 - Disconnected f"..., 40) = 40
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:55.958534 - TTCN-3 Paralle"..., 59) = 59
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f913d896d60}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f913d896d60}, NULL, 8) = 0
close(4) = 0
munmap(0x7f9141aae000, 32880) = 0
close(3) = 0
exit_group(0) = ?
+++ exited with 0 +++
So, the peer component is closing its socket, why? probably because it's crashing and the kernel is closing resources. Since the unix socket path is know from the above ("/tmp/ttcn3-portconn-c554be55"), we can easily identify the peer over all the other strace logs. Here it is (I attach the full file toostrace.log.46.gz):
// This shows it's the component running DIAMETER_Emulation.main():
write(4, "17:00:53.675978 DIAMETER_Emulati"..., 133) = 133
sendto(6, "\200e\1\0\206\237\370\346E\251\241\n-\200XCreated Expect[0]"..., 103, 0, NULL, 0) = 103
[...]
// unix socket to which the component exiting with "unexepctedly closed by peer" is connecting:
socket(AF_UNIX, SOCK_STREAM, 0) = 11
getpid() = 46
bind(11, {sa_family=AF_UNIX, sun_path="/tmp/ttcn3-portconn-c554be55"}, 110) = 0
listen(11, 0) = 0
fcntl(11, F_GETFD) = 0
fcntl(11, F_SETFD, FD_CLOEXEC) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 11, {EPOLLIN, {u32=11, u64=11}}) = 0
sendto(6, "9\f\17DIAMETER_CLIENT\26\10DIAMETER\2\34/t"..., 58, 0, NULL, 0) = 58
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.717919 DIAMETER_Emulati"..., 183) = 183
[...]
//peer connects to us:
epoll_wait(3, [{EPOLLIN, {u32=11, u64=11}}], 64, -1) = 1
accept(11, NULL, NULL) = 22
fcntl(22, F_GETFD) = 0
fcntl(22, F_SETFD, FD_CLOEXEC) = 0
fcntl(22, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(22, F_SETFL, O_RDWR|O_NONBLOCK) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 11, 0x7ffc6c88cb3c) = 0
getsockname(11, {sa_family=AF_UNIX, sun_path="/tmp/ttcn3-portconn-c554be55"}, [110->31]) = 0
unlink("/tmp/ttcn3-portconn-c554be55") = 0
close(11) = 0
epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN, {u32=22, u64=22}}) = 0
sendto(6, "\33\r\17DIAMETER_CLIENT\26\10DIAMETER", 28, 0, NULL, 0) = 28
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.727273 DIAMETER_Emulati"..., 139) = 139
epoll_wait(3, [{EPOLLIN, {u32=6, u64=6}}], 64, -1) = 1
recvfrom(6, ":\f\rDIAMETER_PROC\26\32TC_createSessi"..., 1000, 0, NULL, NULL) = 59
socket(AF_UNIX, SOCK_STREAM, 0) = 11
getpid() = 46
bind(11, {sa_family=AF_UNIX, sun_path="/tmp/ttcn3-portconn-df4eb11c"}, 110) = 0 //<---- HERE ANOTHER SOCKET WITH DIFFERENT PATH IS CREATED
listen(11, 0) = 0
fcntl(11, F_GETFD) = 0
fcntl(11, F_SETFD, FD_CLOEXEC) = 0
mmap(NULL, 17179869184, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) //<----- HERE MEM ALLOC OF 16GB!!!!!!! IT FAILS!!!!
brk(0x563e6dfce000) = 0x563a6dfbc000
mmap(NULL, 17180004352, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
futex(0x7f9141b5f1e0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(4, "17:00:54.732595 - Fatal error. A"..., 51) = 51 //<----- FAILURE TO ALLOC TRIGGERS TC_Error, ABORTS PROGRAM
sendto(6, ",\1\0\206\237\370\346F\254\3333\6 Fatal error. Aborti"..., 45, 0, NULL, 0) = 45
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f913d896d60}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f913d896d60}, NULL, 8) = 0
close(4) = 0
munmap(0x7f9141aae000, 32880) = 0
close(3) = 0
sendto(6, "\30\20\rDIAMETER_UNIT\1\7Gx_UNIT", 25, 0, NULL, 0) = 25
epoll_ctl(-1, EPOLL_CTL_DEL, 7, 0x7ffc6c88d6dc) = -1 EBADF (Bad file descriptor)
fcntl(7, F_GETFD) = 0x1 (flags FD_CLOEXEC)
write(2, "terminate called after throwing "..., 48) = 48
write(2, "TC_Error", 8) = 8
write(2, "'\n", 2) = 2
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid() = 46
gettid() = 46
tgkill(46, 46, SIGABRT) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=46, si_uid=0} ---
So allegadely according to logs and strace output, this mem allocation of 16GB (interestingly happens to be the memory on my system) which crashes the program happens when in PORT::connect_listen_unix_stream(), AFTER set_close_on_exec() is called (see strace) and before anything else is logged, which potentially means during calls to any of PORT::add_connection(), TTCN_Communication::send_connect_listen_ack_unix_stream() or TTCN_Logger::log_port_misc().
What is the current bug behavior?
A TTCN3 component crashes trying to allocate all the memory in my system and obviously fails to do so, crashing.
What is the expected correct behavior?
The TTCN3 component should not be attempting that big allocation of memory.
Relevant logs and/or screenshots
Attached as explained above.
Possible fixes
Couldn't find a fix for this.
Titan version
titan.core version: 8.3.0
Platform details (OS type and version)
linux: Linux 6.1.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 14 Feb 2023 22:08:08 +0000 x86_64 GNU/Linux
/cc @aknappqwt @mmagyari