Project

General

Profile

Actions

Bug #430

closed

Component gets stuck after a while: events (disconnect, connect, buttons and axes) are no more detected

Added by Gianluca Corsini 3 months ago. Updated 3 months ago.

Status:
Closed
Priority:
Normal

Description

Hi Anthony,

I finally managed to found out why the joystick component was getting stuck for me. In this issue, I try to explain you the cause, how I debug it, and the possible workarounds I found to solve it.

Introduction

Problem description

In my case, the component seems to get "sometimes" stuck and unable to detect any event coming from the connected joystick devices. The only way to get it to work again is to kill the process and re-run it.
Once the component is stuck, I am sure that the problem is not related to the connected joystick device but rather to the component, since I can see in jstest-gtk the axes and buttons values change but not in the component output port.

Cause of the problem

It seems that this problem occurs when the event queue (internal to the SDL library) gets full.
In such a case, no more events are added to queue (as it is full), and therefore the component gets stuck and unresponsive to new events arriving (such as axis movements and button pressing).

How I debugged it

Here (branch debug-sdl of a fork to the openrobots original repository), you can find the modified version of the component I used to debug it and test several solutions to this issue.
In this modified version, I introduced several attributes to set some flags at runtime, and an intermediate state (check) in the publish task. The state check is an intermediate state within the states start and poll, where the latter - instead of looping back on itself - returns back to check. In this way, I could set the value of the additional flags with attributes and use them within the async codel state joystick_wait_event. The purpose of these flags is to trigger either some debugging messages or to enable some countermeasures aimed at solving (or mitigating) the issue.
Lastly, I added a print to the SDL library (statically compiled and dynamically linked), since I tested several versions of the SDL library (that I installed from source as explained later on).

Original component workflow

In any case, the component waits for an event to arrive for a given amount of time. If events arrive, then it process all the events in queue. Otherwise, it goes back waiting.
See codel joystick_wait_event at line 121 in joystick_publish_codels.c.
There, the function waitEventTimeout is used to wait for an event to arrive, as it returns true in such a case, false otherwise.

Findings

To understand which was the issue and what was going on, I added a service set_debug which allows enabling - on demand and at runtime - some debug messages which enabled me to inspect the component behavior while being stuck.

Hint: call set_debug with input argument 1 to enable debugging messages, with 0 to disable.

Among the messages, I print the latest error message of the SDL library by using the function SDL_GetError, which returns the string of the latest error. Moreover, I make sure to clear all errors after getting the last one so that I am sure that the printed error is really the latest one.

From the debugging messages it is clear that the event queue is full (as reported from the latest error message), and that any new event is not registered.

So this explains why, when the component gets "stuck", why it cannot anymore detect axis movements and button pressing, and in turn making waitEventTimeout returns always false (the timeout always expires).

Of course, once in that state, trying to clean the event queue makes the component unstuck and process new events, if some are arriving.
To do that, it is possible to call SDL_FlushEvents(<min>, <max>), where <min> and <max> can be set to delete all the events that have an id in the range [min-max].
Therefore, I added a service to do that, namely set_flush_events. Once the component is stuck and this attribute called, then the component gets unstuck and returns to the expected behavior (processing new events).

At this point, why the events filling the queue do not make the function waitEventTimeout return true?

As the current implementation does not parse the events continuously (but only when waitEventTimeout returns true), I add a way to enable the polling of the events in the opposite case (i.e., when waitEventTimeout returns false, and the timer expires).
To enable that, I introduce the attribute set_poll_events. If 1 is passed as argument, then the component starts polling the events even when the timeout expires (which occurs whenever the component is "stuck").

Once the component is stuck, and the event poll is enabled, then it is possible to see that the code 0x7F00 is printed in the terminal. By looking for this code in the file SDL_events.h, I found that it corresponds to the event SDL_POLLSENTINEL.

Since the events are polled from the queue even in the latter case (waitEventTimeout is returning false), the event queue does not get full and the component is prevented from getting stuck.

What is SDL_POLLSENTINEL used for?

This event is used internally to the library and it seems to be introduced to deal with high frequency devices (mouses) that produces many events, as explained here and here.

Therefore, I think it is not relevant for our application.

Why other people are not facing the same issue?

On my pc, I have SDL2 2.0.20, which is installed by default by apt on Ubuntu24.

It seems that the introduction of this internal event SDL_POLLSENTINEL brought in some issues within the SDL library, which have been solved in the newest releases.
Conversely, for older releases, this problem does not appear as this internal event was not yet implemented.

Therefore, I have started looking at some commits of bugfixes related to this event type and the functions waitEvent and waitEventTimeout.

In the component, the function waitEventTimeout is called with the 1st argument set to NULL.
If you look at the implementation of waitEvent in SDL_event.c, both at the release-2.30.8 (the latest) and release-2.0.20 (the one I have installed), it is possible to see that internally it calls waitEventTimeout with 1st argument the same as the 1st received parameter, and as second argument 0.

I found the following commits:

I have tested the component with several SDL versions which I have installed separately in a certain path <sdllib_install_dir> = $SOFTWARE/sdl and linked manually with the component by running configure with the following options:

$ cd robotpkg/hardware/joystick-genom3
$ make checkout .
$ cd work.<hostname>/joystick-genom3-<release>
$ ./bootstrap
$ mkdir build && cd build
$ ../configure --prefix=<install_dir> --with-templates=pocolibs/server,pocolibs/client/c CFLAGS='-g' PKG_CONFIG_LIBDIR=<sdllib_install_dir>/lib/pkgconfig
# then run $make clean, prior to any new run of @configure@.

Here are some results:

SDL release Installed how Result Considerations
2.0.10 apt working 1 user at UTwente uses it and no problems have been encountered.
2.0.12 -- -- Not tested.
2.0.14 -- -- Not tested.
2.0.16 -- -- Not tested.
2.0.18 -- -- Not tested.
2.0.18 source working In this release, it is included commit#8bf32e1, which introduces the event type SDL_POLLSENTINEL
2.0.20 source & apt not working In this release, the file src/events/SDL_events.c got modified in relation to those functions and that event type, which may explain why it is not working compared to the previous release.
2.0.22 manual not working In this release, you can find the commit 8432026, I mentioned above.
2.24.0 -- -- They starting changing the version name. Not tested.
2.26.0 source Not working --
2.28.0 source working In this release, the commit 00b87f1 is included, which I mentioned above and it seems effective to solve the problems related to waitEventTimeout and SDL_POLLSENTINEL.
2.30.0 -- -- Not tested.

A way to inspect your installed version is running the executable
sdl2-config within a terminal with the option --version.

Thus, this could explain why at LAAS, where I was using Ubuntu 18 and 20, and some other people (e.g. in UTwente), they have not encountered this issue. Here at IRISA, we are using mostly Ubuntu 22.04 and we use apt to install libsdl2.

How I installed a given version of SDL from source

Reference: https://wiki.libsdl.org/SDL2/Installation#linuxunix

$ cd $SOFTWARE
$ mkdir sdl src
$ cd src
$ git clone https://github.com/libsdl-org/SDL.git
$ cd SDL
$ git checkout release-<version>
$ mkdir build && cd build
$ ../configure --prefix=$SOFTWARE/sdl
$ make
$ make install

Only for release 2.0.18 I had to add some more options to the configure in order to be able to configure and build it without errors, as follows:

../configure --prefix=$SOFTWARE/sdl --enable-video-wayland=no --enable-video-wayland-qt-touch=no --enable-wayland-shared=no

as explained here.

How to replicate the issue

Here I list the steps that I think are necessary to replicate the issue:

  • Install and link to the component a version of SDL that can be affected by this issue (e.g. 2.0.20).
  • When configuring the component, link this version of SDL.
  • Run h2 init and joystick-pocolibs.
  • Wait a bit of time so that the event queue can get full; usually, it takes ~ few mins (1~3 mins)
  • Try either to disconnect (connect) a (new) device, or to read the axes/buttons values from the output port related to a connected device.

Side effect problem when component is stuck (queue is full)

Consider the following actions performed in the following sequence:

- Use the latest release (and not the latest commit) of the component.
- Have the component stuck (event queue is full).
- Disconnect & connect a connected joystick device.
The OS registers the event (DEVICEREMOVED), it may increment some internal device id, while the component doesn't record this event as it is stuck and the event queue is full.
- Clear the event queue (i.e. call to SDL_FlushEvents(min,max))
- Disconnect the same device.
This time both the OS and the component will register this event.

If these actions are performed in the sequence above, then the component goes into segmentation fault.

The new event will have an id attribute
(ev.jdevice.which) which is no longer found in the device list, which will lead to scanning all the devices without finding a match, since none of the previously stored device ids will match the new one (as the component was stuck and it could no register the new id). Therefore, the dev pointer will be outside the array boundaries.

The component falls into segmentation fault in the call of the function memmove at line 323.

NB: This is the code version in the latest release and not of the latest commit ;)

Once the situation described above occurs, the variables i and devices->_length equal 2 (because I tested with 2 controllers connected), thus the 3rd parameter of memmove is negative.
Therefore, we are passing a negative size to memmove which is then casted internally to an unsigned int (see here), thus leading to (probably) a very large size, which in turn lead to dealing with memory areas we are not intended to access.

This issue is not occurring if the latest commit of the component is used (and the same steps repeated) since this issue is prevented when introducing the fix of commit#84dde240.
Therefore, the latter commit is important.

Just to give you some more context, I firstly made my test using the latest release (since I had installed it with robotpkg). Then, I moved to the latest commit and re-performed the same debugging sequence and applied the same changes to the code.
Indeed, the branch fix-stuck of the fork repository branches out from the latest commit of the master branch.

Fixing the issue

Failed attempt to solve the issue

I tried to manually update the device state and push the events in the queue by calling the function SDL_PumpEvents.

This is way I introduced the attribute set_pump_events at runtime.

Conclusion: It doesn't help to unstuck the component.

How to successfully solve this problem

Several solutions work.

  1. Clear the event queue by calling the function SDL_FlushEvents.
    • Optionally, clear only the events that we are not interested on by playing with the values of <min> and <max>.
    • Introduced attribute set_flush_events to do that at runtime.
    • Conclusion: it works to unstuck the component.
      • It may be a little tricky to detect when the component gets stuck. One possibility could monitor the error message when calling GetError, but it is not good practice, as explained here.
  2. Set a custom filter to make the library add to the event queue only the events we are interested on. This can be done by using the function SDL_SetEventFilter:https://wiki.libsdl.org/SDL2/SDL_SetEventFilter.
    • NB: SDL_SetEventFilter, once called, it calls internally SDL_FlushEvents, where the latter cleans the event queue.
      See line 1229 of SDL_events.c (i.e. the definition of SDL_SetEventFilter).
    • Introduce attribute set_custom_filter to do set a custom event-type filter at runtime.
    • Conclusion: it works. The component does not seem to get stuck.
      • Not very practical to manually include all the events we are interested on within the filter function definition.
        • Error prone.
        • Tedious.
      • SDL_POLLSENTINEL will be excluded as it is the source of the problem, while it may be helpful for future developments of this component (or high-frequency devices?)
  3. Poll events continously, i.e. call the function PollEvent without waiting for the events to arrive.
    • This will poll also the event SDL_POLLSENTINEL.
    • Conclusion: this requires extra cpu to do that, but it prevents the event
      queue from filling up.
  4. It is possible to ignore some events of a given type (e.g., SDL_POLLSENTINEL). This can be done by using SDL_EventState, with arguments SDL_POLLSENTINEL and SDL_DISABLE.
    • Introduce attribute set_state_pollsentinel to enable (1) or disable (1) the state for event type SDL_POLLSENTINEL.
    • Conclusion: ignoring this even type avoids the component having a filled queue and getting stuck.
      • This is similar to setting a custom filter, but it does not require to manually include each time the interesting event types within the filter function definition.
      • To still use the event type SDL_POLLSENTINEL, it would be possible to ignore this event type if no events arrive, and then re-enable it back as soon as we detect a new event.
        • This could allow still using high-frequency devices.

My solution

I decided to opt for disabling the event type SDL_POLLSENTINEL for the following reasons:

  • It is related to high frequency devices (mouses), thus out-of-scope considering our application (reading inputs from joystick devices)
  • The fix is not release dependant (a possibility could have been adding a preprocessor macro condition to introduce a fix only for a given set of releases)
  • The fix is only one line.
  • The fix introduces an additional call function only at the component startup.
    • Compared to the possibility of enabling/disabling at runtime (which may call the function SDL_EventState many times), it is a more performant solution.
  • According to my tests, the cpu usage of the component is lowered.
    • With version 2.0.20, the component was using ~12% of CPU, while with this fix (ignoring the event type SDL_POLLSENTINEL) to only ~2-3%.
    • These results may change according to the releases, since some bug fixes have been introduced in newer versions to address cpu usage.

I pushed my solution in the branch main of the fork repository mentioned above.

Actions #1

Updated by Gianluca Corsini 3 months ago

  • Status changed from New to Closed
Actions #2

Updated by Anthony Mallet 3 months ago

Thanks for the detailed report!

I merged your commit, with just a slight change that fixes only the affected versions.

Actions #3

Updated by Gianluca Corsini 3 months ago

Perfect, thanks a lot. I agree with you that disabling that only for the affected solutions is the best way to go :)

Actions

Also available in: Atom PDF