Bug #430
closedComponent gets stuck after a while: events (disconnect, connect, buttons and axes) are no more detected
Description
Hi Anthony,
I finally managed to found out why the joystick component was getting stuck for me. In this issue, I try to explain you the cause, how I debug it, and the possible workarounds I found to solve it.
Introduction¶
Problem description¶
In my case, the component seems to get "sometimes" stuck and unable to detect any event coming from the connected joystick devices. The only way to get it to work again is to kill the process and re-run it.
Once the component is stuck, I am sure that the problem is not related to the connected joystick device but rather to the component, since I can see in jstest-gtk
the axes and buttons values change but not in the component output port.
Cause of the problem¶
It seems that this problem occurs when the event queue (internal to the SDL library) gets full.
In such a case, no more events are added to queue (as it is full), and therefore the component gets stuck and unresponsive to new events arriving (such as axis movements and button pressing).
How I debugged it¶
Here (branch debug-sdl
of a fork to the openrobots original repository), you can find the modified version of the component I used to debug it and test several solutions to this issue.
In this modified version, I introduced several attributes
to set some flags at runtime, and an intermediate state (check
) in the publish
task. The state check
is an intermediate state within the states start
and poll
, where the latter - instead of looping back on itself - returns back to check
. In this way, I could set the value of the additional flags with attributes and use them within the async codel state joystick_wait_event
. The purpose of these flags is to trigger either some debugging messages or to enable some countermeasures aimed at solving (or mitigating) the issue.
Lastly, I added a print to the SDL library (statically compiled and dynamically linked), since I tested several versions of the SDL library (that I installed from source as explained later on).
Original component workflow¶
In any case, the component waits for an event to arrive for a given amount of time. If events arrive, then it process all the events in queue. Otherwise, it goes back waiting.
See codel joystick_wait_event
at line 121 in joystick_publish_codels.c.
There, the function waitEventTimeout
is used to wait for an event to arrive, as it returns true in such a case, false otherwise.
Findings¶
To understand which was the issue and what was going on, I added a service set_debug
which allows enabling - on demand and at runtime - some debug messages which enabled me to inspect the component behavior while being stuck.
Hint: call
set_debug
with input argument1
to enable debugging messages, with0
to disable.
Among the messages, I print the latest error message of the SDL library by using the function SDL_GetError
, which returns the string of the latest error. Moreover, I make sure to clear all errors after getting the last one so that I am sure that the printed error is really the latest one.
From the debugging messages it is clear that the event queue is full (as reported from the latest error message), and that any new event is not registered.
So this explains why, when the component gets "stuck", why it cannot anymore detect axis movements and button pressing, and in turn making waitEventTimeout
returns always false (the timeout always expires).
Of course, once in that state, trying to clean the event queue makes the component unstuck and process new events, if some are arriving.
To do that, it is possible to call SDL_FlushEvents(<min>, <max>)
, where <min>
and <max>
can be set to delete all the events that have an id in the range [min-max]
.
Therefore, I added a service to do that, namely set_flush_events
. Once the component is stuck and this attribute called, then the component gets unstuck and returns to the expected behavior (processing new events).
At this point, why the events filling the queue do not make the function waitEventTimeout
return true?
As the current implementation does not parse the events continuously (but only when waitEventTimeout
returns true), I add a way to enable the polling of the events in the opposite case (i.e., when waitEventTimeout
returns false, and the timer expires).
To enable that, I introduce the attribute set_poll_events
. If 1
is passed as argument, then the component starts polling the events even when the timeout expires (which occurs whenever the component is "stuck").
Once the component is stuck, and the event poll is enabled, then it is possible to see that the code 0x7F00
is printed in the terminal. By looking for this code in the file SDL_events.h
, I found that it corresponds to the event SDL_POLLSENTINEL
.
Since the events are polled from the queue even in the latter case (waitEventTimeout
is returning false), the event queue does not get full and the component is prevented from getting stuck.
What is SDL_POLLSENTINEL used for?¶
This event is used internally to the library and it seems to be introduced to deal with high frequency devices (mouses) that produces many events, as explained here and here.
Therefore, I think it is not relevant for our application.
Why other people are not facing the same issue?¶
On my pc, I have SDL2 2.0.20
, which is installed by default by apt
on Ubuntu24
.
It seems that the introduction of this internal event SDL_POLLSENTINEL
brought in some issues within the SDL library, which have been solved in the newest releases.
Conversely, for older releases, this problem does not appear as this internal event was not yet implemented.
Therefore, I have started looking at some commits of bugfixes related to this event type and the functions waitEvent
and waitEventTimeout
.
In the component, the function
waitEventTimeout
is called with the 1st argument set toNULL
.
If you look at the implementation ofwaitEvent
inSDL_event.c
, both at the release-2.30.8 (the latest) and release-2.0.20 (the one I have installed), it is possible to see that internally it callswaitEventTimeout
with 1st argument the same as the 1st received parameter, and as second argument0
.
I found the following commits:
- Commit 8432026, which is included in
release-2.0.22
. - I have found this issue in GitHub, which discusses another bug with
SDL_WaitEvent(NULL)
and points to another bug fix, namely the one introduced by commit 00b87f1. This commit is introduced withrelease=2.28.0
.
I have tested the component with several SDL versions which I have installed separately in a certain path <sdllib_install_dir> = $SOFTWARE/sdl
and linked manually with the component by running configure
with the following options:
$ cd robotpkg/hardware/joystick-genom3
$ make checkout .
$ cd work.<hostname>/joystick-genom3-<release>
$ ./bootstrap
$ mkdir build && cd build
$ ../configure --prefix=<install_dir> --with-templates=pocolibs/server,pocolibs/client/c CFLAGS='-g' PKG_CONFIG_LIBDIR=<sdllib_install_dir>/lib/pkgconfig
# then run $make clean, prior to any new run of @configure@.
Here are some results:
SDL release | Installed how | Result | Considerations |
---|---|---|---|
2.0.10 | apt | working | 1 user at UTwente uses it and no problems have been encountered. |
2.0.12 | -- | -- | Not tested. |
2.0.14 | -- | -- | Not tested. |
2.0.16 | -- | -- | Not tested. |
2.0.18 | -- | -- | Not tested. |
2.0.18 | source | working | In this release, it is included commit#8bf32e1 , which introduces the event type SDL_POLLSENTINEL |
2.0.20 | source & apt | not working | In this release, the file src/events/SDL_events.c got modified in relation to those functions and that event type, which may explain why it is not working compared to the previous release. |
2.0.22 | manual | not working | In this release, you can find the commit 8432026, I mentioned above. |
2.24.0 | -- | -- | They starting changing the version name. Not tested. |
2.26.0 | source | Not working | -- |
2.28.0 | source | working | In this release, the commit 00b87f1 is included, which I mentioned above and it seems effective to solve the problems related to waitEventTimeout and SDL_POLLSENTINEL . |
2.30.0 | -- | -- | Not tested. |
A way to inspect your installed version is running the executable
sdl2-config
within a terminal with the option--version
.
Thus, this could explain why at LAAS, where I was using Ubuntu 18 and 20, and some other people (e.g. in UTwente), they have not encountered this issue. Here at IRISA, we are using mostly Ubuntu 22.04 and we use apt
to install libsdl2
.
How I installed a given version of SDL from source¶
Reference: https://wiki.libsdl.org/SDL2/Installation#linuxunix
$ cd $SOFTWARE
$ mkdir sdl src
$ cd src
$ git clone https://github.com/libsdl-org/SDL.git
$ cd SDL
$ git checkout release-<version>
$ mkdir build && cd build
$ ../configure --prefix=$SOFTWARE/sdl
$ make
$ make install
Only for release 2.0.18
I had to add some more options to the configure
in order to be able to configure and build it without errors, as follows:
../configure --prefix=$SOFTWARE/sdl --enable-video-wayland=no --enable-video-wayland-qt-touch=no --enable-wayland-shared=no
as explained here.
How to replicate the issue¶
Here I list the steps that I think are necessary to replicate the issue:
- Install and link to the component a version of
SDL
that can be affected by this issue (e.g.2.0.20
). - When configuring the component, link this version of
SDL
. - Run
h2 init
andjoystick-pocolibs
. - Wait a bit of time so that the event queue can get full; usually, it takes ~ few mins (1~3 mins)
- Try either to disconnect (connect) a (new) device, or to read the axes/buttons values from the output port related to a connected device.
Side effect problem when component is stuck (queue is full)¶
Consider the following actions performed in the following sequence:
- Use the latest release (and not the latest commit) of the component.
- Have the component stuck (event queue is full).
- Disconnect & connect a connected joystick device.
The OS registers the event (DEVICEREMOVED
), it may increment some internal device id, while the component doesn't record this event as it is stuck and the event queue is full.
- Clear the event queue (i.e. call to SDL_FlushEvents(min,max)
)
- Disconnect the same device.
This time both the OS and the component will register this event.
If these actions are performed in the sequence above, then the component goes into segmentation fault.
The new event will have an id attribute
(ev.jdevice.which
) which is no longer found in the device list, which will lead to scanning all the devices without finding a match, since none of the previously stored device ids will match the new one (as the component was stuck and it could no register the new id). Therefore, the dev
pointer will be outside the array boundaries.
The component falls into segmentation fault in the call of the function memmove
at line 323.
NB: This is the code version in the latest release and not of the latest commit ;)
Once the situation described above occurs, the variables i
and devices->_length
equal 2 (because I tested with 2 controllers connected), thus the 3rd parameter of memmove
is negative.
Therefore, we are passing a negative size to memmove
which is then casted internally to an unsigned int (see here), thus leading to (probably) a very large size, which in turn lead to dealing with memory areas we are not intended to access.
This issue is not occurring if the latest commit of the component is used (and the same steps repeated) since this issue is prevented when introducing the fix of commit#84dde240.
Therefore, the latter commit is important.
Just to give you some more context, I firstly made my test using the latest release (since I had installed it with
robotpkg
). Then, I moved to the latest commit and re-performed the same debugging sequence and applied the same changes to the code.
Indeed, the branchfix-stuck
of the fork repository branches out from the latest commit of themaster
branch.
Fixing the issue¶
Failed attempt to solve the issue¶
I tried to manually update the device state and push the events in the queue by calling the function SDL_PumpEvents
.
This is way I introduced the attribute set_pump_events
at runtime.
Conclusion: It doesn't help to unstuck the component.
How to successfully solve this problem¶
Several solutions work.
- Clear the event queue by calling the function
SDL_FlushEvents
.- Optionally, clear only the events that we are not interested on by playing with the values of
<min>
and<max>
. - Introduced attribute
set_flush_events
to do that at runtime. - Conclusion: it works to unstuck the component.
- It may be a little tricky to detect when the component gets stuck. One possibility could monitor the error message when calling
GetError
, but it is not good practice, as explained here.
- It may be a little tricky to detect when the component gets stuck. One possibility could monitor the error message when calling
- Optionally, clear only the events that we are not interested on by playing with the values of
- Set a custom filter to make the library add to the event queue only the events we are interested on. This can be done by using the function
SDL_SetEventFilter
:https://wiki.libsdl.org/SDL2/SDL_SetEventFilter.- NB:
SDL_SetEventFilter
, once called, it calls internallySDL_FlushEvents
, where the latter cleans the event queue.
See line 1229 ofSDL_events.c
(i.e. the definition ofSDL_SetEventFilter
). - Introduce attribute
set_custom_filter
to do set a custom event-type filter at runtime. - Conclusion: it works. The component does not seem to get stuck.
- Not very practical to manually include all the events we are interested on within the filter function definition.
- Error prone.
- Tedious.
SDL_POLLSENTINEL
will be excluded as it is the source of the problem, while it may be helpful for future developments of this component (or high-frequency devices?)
- Not very practical to manually include all the events we are interested on within the filter function definition.
- NB:
- Poll events continously, i.e. call the function
PollEvent
without waiting for the events to arrive.- This will poll also the event
SDL_POLLSENTINEL
. - Conclusion: this requires extra cpu to do that, but it prevents the event
queue from filling up.
- This will poll also the event
- It is possible to ignore some events of a given type (e.g.,
SDL_POLLSENTINEL
). This can be done by usingSDL_EventState
, with argumentsSDL_POLLSENTINEL
andSDL_DISABLE
.- Introduce attribute
set_state_pollsentinel
to enable (1) or disable (1) the state for event typeSDL_POLLSENTINEL
. - Conclusion: ignoring this even type avoids the component having a filled queue and getting stuck.
- This is similar to setting a custom filter, but it does not require to manually include each time the interesting event types within the filter function definition.
- To still use the event type
SDL_POLLSENTINEL
, it would be possible to ignore this event type if no events arrive, and then re-enable it back as soon as we detect a new event.- This could allow still using high-frequency devices.
- Introduce attribute
My solution¶
I decided to opt for disabling the event type SDL_POLLSENTINEL
for the following reasons:
- It is related to high frequency devices (mouses), thus out-of-scope considering our application (reading inputs from joystick devices)
- The fix is not release dependant (a possibility could have been adding a preprocessor macro condition to introduce a fix only for a given set of releases)
- The fix is only one line.
- The fix introduces an additional call function only at the component startup.
- Compared to the possibility of enabling/disabling at runtime (which may call the function
SDL_EventState
many times), it is a more performant solution.
- Compared to the possibility of enabling/disabling at runtime (which may call the function
- According to my tests, the cpu usage of the component is lowered.
- With version 2.0.20, the component was using ~12% of CPU, while with this fix (ignoring the event type
SDL_POLLSENTINEL
) to only ~2-3%. - These results may change according to the releases, since some bug fixes have been introduced in newer versions to address cpu usage.
- With version 2.0.20, the component was using ~12% of CPU, while with this fix (ignoring the event type
I pushed my solution in the branch main
of the fork repository mentioned above.