XForms: Big time lag using XCopyArea, and get_next_event() event backlog problem.

From: jac@casurgica.com
Date: Wed Apr 23 2003 - 00:03:00 EDT

  • Next message: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."

    # To subscribers of the xforms list from jac@casurgica.com :

    I just read this email after I finished typing it. Turned into a small
    book. Oh well. Here it is:

    Ok, I've been trying to track down this problem all day. We (the people at
    this company and I) have been experiencing a lot of bad UI response in all
    of our applications in certain situations. I wrote a test program and
    figured out that a call to XCopyArea from a fairly complex part of our
    code was directly or indirectly causing this problem. Basically, the test
    program does this:

    - Set up a form with a canvas, a button that does nothing, an input box
    (to make sure key events are lagging, too), and a quit button.
    - Enter into a very simple loop, like this:

    while (fl_check_forms() != test->quit) {
       FunctionThatCallsXCopyArea();
    };

    We know it's the XCopyArea call because when we comment out just that
    call, or replace it with something like XFillRectangle, everything is
    smooth. The UI sluggishness we are experiencing is on the order of about
    3/4 of a second, but sometimes it can be up to 7 or 8 seconds (note: the
    program and x server are both running on the local machine).

    I've traced the problem down to somewhere in get_next_event() in forms.c.
    Here are the steps I took to narrow it down to that (without getting too
    descriptive, it took me all day):

    - Figured out how much time was spent in fl_check_forms, and how much time
    was spent in the image calls (the parts that end up calling XCopyArea).
    - I wanted to know if the problem was either 1) XCopyArea taking a long
    time and lagging things or 2) XCopyArea creating some conflict somewhere
    causing events to be dropped or delayed. I figured out the average time
    that one call to our imaging functions took. It was 1200 microseconds.
    - I replaced the image calls with a loop that delayed the program for 1200
    usecs, and calculated the function timings again.
    - With the delay instead of the image calls, the loop was running much
    slower (surprise!). The reason for this is that the fl_check_forms() call
    was taking almost 60x longer to execute than it was with the actual
    imaging calls instead. What this suggested is that when XCopyArea was
    being called, fl_check_forms() was returning a lot sooner than it should,
    apparently before the events were processed or without processing the
    events at all.
    - More timing and experimenting narrowed it down to the
    fl_treat_interaction_events() call, then to the fl_do_interaction_step()
    call (rather than form_event_queued()).
    - In do_interaction_step(), it turned out that when we used the 1200usec
    delay, the ratio of null to non-null returns from get_next_event() was
    about 68:1 with no activity, and 1:1 with lots of activity. By contrast,
    when we used XCopyArea (so the UI was real laggy), the ratio was 2277:1
    with no activity, and 68:1 with lots of activity (mouse movements and
    such). Although these ratios varied a lot, the trend was the same, and
    suggested that get_next_event() was returning 0 a lot more than it should
    when we used XCopyArea (which would definitely cause the UI to be very
    unresponsive).

    Here is what I understand about get_next_event() (directly copied from the
    notes I was typing at the time):

    - ok here, get_next_event kind of works like this (it's real hard to look
    at):

      + check for an event, if none: flush buffer, read new ones
      + if there's an event, process it like so:
        + if it's MappingNotify, handle_global_event() then return 0
        + if it's not meant for a form, it does some weird stuff
    (fl_compress_event)
          that i can't figure out, then ungets the event.
        + if it was meant for a form then it got one so return !0.
      + now check again, if none: flush buffer, read new ones
      + if there's an event, then don't do anything, just return 0.
      + if there's still no event, call fl_watch_io. this also happens regardless
        of whether there's an event or not every 11 events, to keep non-x things
        flowing smoothly. fl_watch_io does a bunch of socket crap.

    There are a lot of odd things about the get_next_event() call. Lots of
    sloppy coding and some minor hacks here and there. Here are the few things
    that stood out to me (they may or may not suggest the real cause of this
    problem, I dunno):

    By the way, "point 3" refers to the 3rd "return" statement in
    get_next_event(), the one here:

        if ((dox % 11) && XEventsQueued(flx->display, QueuedAfterFlush)) {
          return has_event;
        };

    This is significant because, out of the 4 return statements in the
    function, when the program was run for about 8 seconds with lots of
    activity, it went like this (the first number is the 1200usec delay one,
    the second number is with the real image calls):

    Returned from 1st one: 0 no imaging, 0 with imaging
    Returned from 2nd one: 1012 no imaging, 288 with imaging
    Returned from 3rd one: 5 no imaging, 5810 with imaging
    Returned from 4th one: 1905 no imaging, 618 with imaging

    See how many more times it returned from "point 3" when we used the delay
    instead of the XCopyArea call? Anyway, notes ("point 2" is the return in
    the "if (has_event || do_x_only)" conditional"):

    - note that get_next_event() returned from point 3 lots of times when the iv
      stuff was there, but not many times when the delay was there.
    - point 3 is around forms.c:2626. odd things about that code:
      - when it gets to point 3 it returns has_event, but to get to point 3
        has_event must be 0 (otherwise it would have returned at point 2).
        this seems like sloppy coding.
      - has_event will always be 0 for point 1, which handles global events,
        of which the only one is MappingNotify, because has_event is init'd
        to 0 and never gets set before that point. also seems sloppy, why
        not just return 0?
      - the only time has_event gets modified is just before point 1, as a
        result of calling select_form_event(). the does not happen every 11th
        event (i'm sure there's a reason for this), and does not happen if
        there are no xevents queued.
      - do_x_only is a global variable used to skip the fl_watch_io() call.
        do_x_only is only used in a small bit of code in close_form_win().
        incidently, it's not even used here because it's in the #else of
        a "#if 1". looks like a hack that got #if'd away by more sloppiness.
    - the x reference index says XEventsQueued is on page 468, but it's
      actually on 249 (468 is XPeekEvent). that's annoying.

    Another very odd thing that I noticed is that by the time get_next_event()
    has reached point 3, it has called XEventsQueued twice with
    QueuedAfterFlush (which flushes the event buffer and checks for new events
    if there's no events in the buffer when it is called). Now, since
    get_next_event() will normally be getting called over and over again with
    not too much lower level xevent handling stuff in between, I know this
    isn't really a problem. It still seemed strange to me, though. Earlier on
    (it's worth mentioning now), I changed my test loop in my program to this
    (and used the actual XCopyArea stuff):

    while (fl_check_forms() != test->quit) {
       if (fl_check_forms() == test->quit)
         break;
       FunctionThatCallsXCopyArea();
    };

    Calling fl_check_forms() twice completely solved the lag problem. I think
    that it is weird that calling fl_check_forms() twice instead of once
    eliminates UI lags that would normally be over 3 or 4 seconds long and
    makes the application completely responsive. Calling fl_check_forms()
    twice in many of our loops, however, is an unacceptable hack. Because of
    the market we develop our applications for, stableness and cleanness are
    crucial, so it is very important that I found the real cause of this
    problem.

    I removed the second fl_check_forms() call and continued playing around.
    As an experiment, I modified get_next_event() so that it behaved like
    this:

    - If it was about to return from point 3, jump back up to the top of the
    function and (without incrementing dox or calling XEventsQueued a third
    time) try to deal with the event again. If, during this second time
    around, it got to point 3 again, just return. Don't try it again.

    This also completely solved the problem, suggesting something... I have no
    idea what, though. I'm primarily a Windows and console-only developer and
    I am not yet very familiar with X or the internal workings of the forms
    library. For some reason, it appeared as though every other call to
    get_next_event was missing events, perhaps because of some weird logic in
    get_next_event(). But, more experimenting suggests otherwise.

    Moving on, I read this comment in get_next_events():

    /* if incoming XEvent has already being pumped from the socket,
       watch_io() will time out, causing a bad delay in handling xevent.
       Make sure there is no event in the X event queue before we go into
       watch_io() */

    This comment is just before the "point 3" return statement. It is another
    oddity of get_next_events(), as the actual code is:

       if ((dox % 11) && XEventsQueued(flx->display, QueuedAfterFlush)) {
          return has_event;
       };

    Which means that every 11th call to get_next_events() should proceed to
    the fl_watch_io() stuff regardless of whether or not there is an event in
    the queue (this is not what the comment indicates that it should do).

    This supports the findings I had above when I made get_next_event() try
    processing the event twice, I think. Here is my logic: If, every 11th
    event, there is an event in the queue, fl_watch_io() will still happen
    "causing a bad delay in handling xevents". The comment indicates that
    there should be no XEvent in the queue when fl_watch_io() is called,
    therefore if we are on an 11th event, and there is an XEvent in the queue,
    WE SHOULD PROCESS IT BEFORE PROCEEDING TO fl_watch_io(), rather than
    leaving it in there. That seems to make sense, and I have already seen
    that by processing this event I fix the problem.

    Here is a very small portion of output from a log file that is filled with
    some information during get_next_event() calls. The first number on each
    line is the number of events in the queue reported by XEventsQueued at the
    beginning of the function, before the event processing might occur. The
    second number is the number of queued events just before point 3, where
    XEventsQueued is called to determine if the function should return before
    calling fl_watch_io(). The string "GOT ONE!!!" is printed if has_event is
    non-zero and the function returns before point 3. The string "TO WATCH
    IO!" is printed just before fl_watch_io() is called.

           293 292
           292 291
           291 291 TO WATCH IO!
           291 GOT ONE!!!
           325 324
           325 324
           324 323
           323 322
           322 321
           321 GOT ONE!!!
           324 323
           324 323
           323 322
           322 322 TO WATCH IO!

    Keeping in mind that file comment from above, notice how once every 11
    calls to the function, fl_watch_io() is called with about 300 events in
    the queue! This is during a period of lots of mouse and keyboard activity.

    Here is the an excerpt from the log with my "event reprocessing" hack in
    place (which eliminates the lag completely):

             1 0 TO WATCH IO!
             1 1 TO WATCH IO!
             1 0 REHANDLING GOT ONE!!!
             2 1 REHANDLING 0 TO WATCH IO!
             1 0 TO WATCH IO!
             1 0 TO WATCH IO!
             1 0 TO WATCH IO!
             1 0 TO WATCH IO!
             1 0 TO WATCH IO!
             1 GOT ONE!!!
             1 0 TO WATCH IO!
             1 0 TO WATCH IO!
             1 1 TO WATCH IO!
             1 0 TO WATCH IO!
             1 GOT ONE!!!
             2 1 REHANDLING 0 TO WATCH IO!
             2 1 REHANDLING GOT ONE!!!

    The string "REHANDLING" is printed when get_next_event() would normally
    return without processing the event from point 3, but instead went back to
    process the event that is still in the queue. The number after that string
    but just before "TO WATCH IO!" is the number of events in the queue when
    fl_watch_io() is about to be called. This part of the log represents the
    peak of the queues capacity. 1 or 2 events, as it should be.

    So, some weird logic in get_next_event() is leading to a backlog of
    hundreds of unprocessed events in the queue. Here is why I believe this
    backlog is happening (talking about get_next_event() without my hacks and
    modifications):

    - When more than one event is in the queue when get_next_event() is
    called, it only processes one. It leaves the rest of the events in the
    queue to be handled the next time get_next_event() is called. When only
    one event is in the queue, it processes it and leaves none in the queue.
    This is all good and well.
    - Every 11th event, if you look at get_next_event(), you see that event
    processing is completely disabled. So every 11th event, one extra event
    remains unprocessed in the queue. These eventually add up and create a lag
    problem. I think they peak around 320 because perhaps that is when the
    queue is full.

    Imagine it like this:

    Say you have a bunch of candy. At regular intervals you add a single piece
    of candy to an initially empty bowl. Each time you do that, a child walks
    by and takes one piece of candy. So, you never really have more than one
    piece of candy in the bowl.

    Situation 1: You are doing this, but every once in a while you put 2
    pieces of candy in the bowl instead of one. Each child, however, is only
    taking one piece of candy. So the extra pieces of candy slowly add up, and
    eventually you have a bowl full of candy (because you're putting it in
    faster than the children are taking it out -- for every N>=10 pieces you
    put in, only 10 get removed).

    Situation 2: You are doing this, but every 11th child does not take a
    piece of candy. These pieces of candy add up, also leaving you with a bowl
    full of candy (once again, because for every 11 pieces you put in, only 10
    get removed).

    This situation: Not only are you putting in 2 pieces of candy sometimes,
    but every 11th child isn't taking a piece. So the bowl gets filled even
    quicker.

    It's almost like get_next_event() is experiencing two logic errors (I
    guess those are called 'bugs', heh) at once, both resulting in a backlog
    of xevents in the queue.

    Here are some other things I found that solve the problem, in addition to
    the things I mentioned above:

    - As I stated, calling fl_check_forms() twice solves the problem (perhaps
    this is because they events are getting handled faster relative to how
    fast you, the human, are causing them to be generated; since this also
    results in every 11th call happening twice as much).
    - Also as I stated, processing the "leftover" event in get_next_event()
    fixes the problem, too. This may or may not be a real fix, since if you
    started putting 3 or more events in the queue, the backlog would happen
    again.
    - Commenting out the fl_watch_io() stuff, or just forcing a return from
    "point 3" fixes this. But it also causes fl_watch_io() to not be called,
    which is bad, I think.
    - Running a loop with the imaging calls in it on a separate thread than
    the fl_check_forms() calls are on (using proper X threading techniques
    [i.e. XInitThreads, XLockDisplay, XUnlockDisplay], of course). I don't
    know why this works, maybe it has something to do with different event
    queues for different threads.

    It is also beyond me as to why the XCopyArea call alone indirectly causes
    this problem. I have some theories but I'd like to test them out first.
    XFillRectangle does not do this, nor do any of the other drawing calls I
    tested.

    Point is, the lag is definitely being causes by a backlog of events due to
    a problem in the way get_next_events() removes events from the queue. But
    I don't think that that is the root cause of the problem, because the
    question remains: Why are there so many more events being generated when
    XCopyArea is called rather than when we are not calling any X routines,
    and simply delaying for the same amount of time that the X calls would
    normally take?

    I have a theory that involves select_form_event and events sent to canvas
    windows. But I'm not sure if this makes any sense, and I really need to
    look into it a little further, so don't attack that just yet.

    Any thoughts or anything would be appreciated...

    Jason Cipriani

    _________________________________________________
    To unsubscribe, send the message "unsubscribe" to
    xforms-request@bob.usuhs.mil or see
    http://bob.usuhs.mil/mailserv/xforms.html
    XForms Home Page: http://world.std.com/~xforms
    List Archive: http://bob.usuhs.mil/mailserv/list-archives/
    Development: http://savannah.nongnu.org/files/?group=xforms



    This archive was generated by hypermail 2b29 : Wed Apr 23 2003 - 00:05:28 EDT