XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem.

From: jac@casurgica.com
Date: Wed Apr 23 2003 - 00:06:52 EDT

  • Next message: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."

    # To subscribers of the xforms list from jac@casurgica.com :

    One very important thing I forgot to mention was that at the beginning,
    when we replaced the imaging stuff with the 1200usec delay, there was no
    longer any lag and the UI was as responsive as it should have been.

    Jason

    > I just read this email after I finished typing it. Turned into a small
    > book. Oh well. Here it is:
    >
    > Ok, I've been trying to track down this problem all day. We (the people
    > at this company and I) have been experiencing a lot of bad UI response
    > in all of our applications in certain situations. I wrote a test program
    > and figured out that a call to XCopyArea from a fairly complex part of
    > our code was directly or indirectly causing this problem. Basically, the
    > test program does this:
    >
    > - Set up a form with a canvas, a button that does nothing, an input box
    > (to make sure key events are lagging, too), and a quit button.
    > - Enter into a very simple loop, like this:
    >
    > while (fl_check_forms() != test->quit) {
    > FunctionThatCallsXCopyArea();
    > };
    >
    > We know it's the XCopyArea call because when we comment out just that
    > call, or replace it with something like XFillRectangle, everything is
    > smooth. The UI sluggishness we are experiencing is on the order of about
    > 3/4 of a second, but sometimes it can be up to 7 or 8 seconds (note: the
    > program and x server are both running on the local machine).
    >
    > I've traced the problem down to somewhere in get_next_event() in
    > forms.c. Here are the steps I took to narrow it down to that (without
    > getting too descriptive, it took me all day):
    >
    > - Figured out how much time was spent in fl_check_forms, and how much
    > time was spent in the image calls (the parts that end up calling
    > XCopyArea). - I wanted to know if the problem was either 1) XCopyArea
    > taking a long time and lagging things or 2) XCopyArea creating some
    > conflict somewhere causing events to be dropped or delayed. I figured
    > out the average time that one call to our imaging functions took. It was
    > 1200 microseconds. - I replaced the image calls with a loop that delayed
    > the program for 1200 usecs, and calculated the function timings again.
    > - With the delay instead of the image calls, the loop was running much
    > slower (surprise!). The reason for this is that the fl_check_forms()
    > call was taking almost 60x longer to execute than it was with the actual
    > imaging calls instead. What this suggested is that when XCopyArea was
    > being called, fl_check_forms() was returning a lot sooner than it
    > should, apparently before the events were processed or without
    > processing the events at all.
    > - More timing and experimenting narrowed it down to the
    > fl_treat_interaction_events() call, then to the fl_do_interaction_step()
    > call (rather than form_event_queued()).
    > - In do_interaction_step(), it turned out that when we used the 1200usec
    > delay, the ratio of null to non-null returns from get_next_event() was
    > about 68:1 with no activity, and 1:1 with lots of activity. By contrast,
    > when we used XCopyArea (so the UI was real laggy), the ratio was 2277:1
    > with no activity, and 68:1 with lots of activity (mouse movements and
    > such). Although these ratios varied a lot, the trend was the same, and
    > suggested that get_next_event() was returning 0 a lot more than it
    > should when we used XCopyArea (which would definitely cause the UI to be
    > very unresponsive).
    >
    > Here is what I understand about get_next_event() (directly copied from
    > the notes I was typing at the time):
    >
    > - ok here, get_next_event kind of works like this (it's real hard to
    > look at):
    >
    > + check for an event, if none: flush buffer, read new ones
    > + if there's an event, process it like so:
    > + if it's MappingNotify, handle_global_event() then return 0
    > + if it's not meant for a form, it does some weird stuff
    > (fl_compress_event)
    > that i can't figure out, then ungets the event.
    > + if it was meant for a form then it got one so return !0.
    > + now check again, if none: flush buffer, read new ones
    > + if there's an event, then don't do anything, just return 0.
    > + if there's still no event, call fl_watch_io. this also happens
    > regardless
    > of whether there's an event or not every 11 events, to keep non-x
    > things flowing smoothly. fl_watch_io does a bunch of socket crap.
    >
    > There are a lot of odd things about the get_next_event() call. Lots of
    > sloppy coding and some minor hacks here and there. Here are the few
    > things that stood out to me (they may or may not suggest the real cause
    > of this problem, I dunno):
    >
    > By the way, "point 3" refers to the 3rd "return" statement in
    > get_next_event(), the one here:
    >
    > if ((dox % 11) && XEventsQueued(flx->display, QueuedAfterFlush)) {
    > return has_event;
    > };
    >
    > This is significant because, out of the 4 return statements in the
    > function, when the program was run for about 8 seconds with lots of
    > activity, it went like this (the first number is the 1200usec delay one,
    > the second number is with the real image calls):
    >
    > Returned from 1st one: 0 no imaging, 0 with imaging
    > Returned from 2nd one: 1012 no imaging, 288 with imaging
    > Returned from 3rd one: 5 no imaging, 5810 with imaging
    > Returned from 4th one: 1905 no imaging, 618 with imaging
    >
    > See how many more times it returned from "point 3" when we used the
    > delay instead of the XCopyArea call? Anyway, notes ("point 2" is the
    > return in the "if (has_event || do_x_only)" conditional"):
    >
    > - note that get_next_event() returned from point 3 lots of times when
    > the iv
    > stuff was there, but not many times when the delay was there.
    > - point 3 is around forms.c:2626. odd things about that code:
    > - when it gets to point 3 it returns has_event, but to get to point 3
    > has_event must be 0 (otherwise it would have returned at point 2).
    > this seems like sloppy coding.
    > - has_event will always be 0 for point 1, which handles global events,
    > of which the only one is MappingNotify, because has_event is init'd
    > to 0 and never gets set before that point. also seems sloppy, why
    > not just return 0?
    > - the only time has_event gets modified is just before point 1, as a
    > result of calling select_form_event(). the does not happen every
    > 11th event (i'm sure there's a reason for this), and does not happen
    > if there are no xevents queued.
    > - do_x_only is a global variable used to skip the fl_watch_io() call.
    > do_x_only is only used in a small bit of code in close_form_win().
    > incidently, it's not even used here because it's in the #else of a
    > "#if 1". looks like a hack that got #if'd away by more sloppiness.
    > - the x reference index says XEventsQueued is on page 468, but it's
    > actually on 249 (468 is XPeekEvent). that's annoying.
    >
    > Another very odd thing that I noticed is that by the time
    > get_next_event() has reached point 3, it has called XEventsQueued twice
    > with
    > QueuedAfterFlush (which flushes the event buffer and checks for new
    > events if there's no events in the buffer when it is called). Now, since
    > get_next_event() will normally be getting called over and over again
    > with not too much lower level xevent handling stuff in between, I know
    > this isn't really a problem. It still seemed strange to me, though.
    > Earlier on (it's worth mentioning now), I changed my test loop in my
    > program to this (and used the actual XCopyArea stuff):
    >
    > while (fl_check_forms() != test->quit) {
    > if (fl_check_forms() == test->quit)
    > break;
    > FunctionThatCallsXCopyArea();
    > };
    >
    > Calling fl_check_forms() twice completely solved the lag problem. I
    > think that it is weird that calling fl_check_forms() twice instead of
    > once eliminates UI lags that would normally be over 3 or 4 seconds long
    > and makes the application completely responsive. Calling
    > fl_check_forms() twice in many of our loops, however, is an unacceptable
    > hack. Because of the market we develop our applications for, stableness
    > and cleanness are crucial, so it is very important that I found the real
    > cause of this problem.
    >
    > I removed the second fl_check_forms() call and continued playing around.
    > As an experiment, I modified get_next_event() so that it behaved like
    > this:
    >
    > - If it was about to return from point 3, jump back up to the top of the
    > function and (without incrementing dox or calling XEventsQueued a third
    > time) try to deal with the event again. If, during this second time
    > around, it got to point 3 again, just return. Don't try it again.
    >
    > This also completely solved the problem, suggesting something... I have
    > no idea what, though. I'm primarily a Windows and console-only developer
    > and I am not yet very familiar with X or the internal workings of the
    > forms library. For some reason, it appeared as though every other call
    > to get_next_event was missing events, perhaps because of some weird
    > logic in get_next_event(). But, more experimenting suggests otherwise.
    >
    > Moving on, I read this comment in get_next_events():
    >
    > /* if incoming XEvent has already being pumped from the socket,
    > watch_io() will time out, causing a bad delay in handling xevent.
    > Make sure there is no event in the X event queue before we go into
    > watch_io() */
    >
    > This comment is just before the "point 3" return statement. It is
    > another oddity of get_next_events(), as the actual code is:
    >
    > if ((dox % 11) && XEventsQueued(flx->display, QueuedAfterFlush)) {
    > return has_event;
    > };
    >
    > Which means that every 11th call to get_next_events() should proceed to
    > the fl_watch_io() stuff regardless of whether or not there is an event
    > in the queue (this is not what the comment indicates that it should do).
    >
    > This supports the findings I had above when I made get_next_event() try
    > processing the event twice, I think. Here is my logic: If, every 11th
    > event, there is an event in the queue, fl_watch_io() will still happen
    > "causing a bad delay in handling xevents". The comment indicates that
    > there should be no XEvent in the queue when fl_watch_io() is called,
    > therefore if we are on an 11th event, and there is an XEvent in the
    > queue, WE SHOULD PROCESS IT BEFORE PROCEEDING TO fl_watch_io(), rather
    > than leaving it in there. That seems to make sense, and I have already
    > seen that by processing this event I fix the problem.
    >
    > Here is a very small portion of output from a log file that is filled
    > with some information during get_next_event() calls. The first number on
    > each line is the number of events in the queue reported by XEventsQueued
    > at the beginning of the function, before the event processing might
    > occur. The second number is the number of queued events just before
    > point 3, where XEventsQueued is called to determine if the function
    > should return before calling fl_watch_io(). The string "GOT ONE!!!" is
    > printed if has_event is non-zero and the function returns before point
    > 3. The string "TO WATCH IO!" is printed just before fl_watch_io() is
    > called.
    >
    > 293 292
    > 292 291
    > 291 291 TO WATCH IO!
    > 291 GOT ONE!!!
    > 325 324
    > 325 324
    > 324 323
    > 323 322
    > 322 321
    > 321 GOT ONE!!!
    > 324 323
    > 324 323
    > 323 322
    > 322 322 TO WATCH IO!
    >
    > Keeping in mind that file comment from above, notice how once every 11
    > calls to the function, fl_watch_io() is called with about 300 events in
    > the queue! This is during a period of lots of mouse and keyboard
    > activity.
    >
    > Here is the an excerpt from the log with my "event reprocessing" hack in
    > place (which eliminates the lag completely):
    >
    > 1 0 TO WATCH IO!
    > 1 1 TO WATCH IO!
    > 1 0 REHANDLING GOT ONE!!!
    > 2 1 REHANDLING 0 TO WATCH IO!
    > 1 0 TO WATCH IO!
    > 1 0 TO WATCH IO!
    > 1 0 TO WATCH IO!
    > 1 0 TO WATCH IO!
    > 1 0 TO WATCH IO!
    > 1 GOT ONE!!!
    > 1 0 TO WATCH IO!
    > 1 0 TO WATCH IO!
    > 1 1 TO WATCH IO!
    > 1 0 TO WATCH IO!
    > 1 GOT ONE!!!
    > 2 1 REHANDLING 0 TO WATCH IO!
    > 2 1 REHANDLING GOT ONE!!!
    >
    > The string "REHANDLING" is printed when get_next_event() would normally
    > return without processing the event from point 3, but instead went back
    > to process the event that is still in the queue. The number after that
    > string but just before "TO WATCH IO!" is the number of events in the
    > queue when fl_watch_io() is about to be called. This part of the log
    > represents the peak of the queues capacity. 1 or 2 events, as it should
    > be.
    >
    > So, some weird logic in get_next_event() is leading to a backlog of
    > hundreds of unprocessed events in the queue. Here is why I believe this
    > backlog is happening (talking about get_next_event() without my hacks
    > and modifications):
    >
    > - When more than one event is in the queue when get_next_event() is
    > called, it only processes one. It leaves the rest of the events in the
    > queue to be handled the next time get_next_event() is called. When only
    > one event is in the queue, it processes it and leaves none in the queue.
    > This is all good and well.
    > - Every 11th event, if you look at get_next_event(), you see that event
    > processing is completely disabled. So every 11th event, one extra event
    > remains unprocessed in the queue. These eventually add up and create a
    > lag problem. I think they peak around 320 because perhaps that is when
    > the queue is full.
    >
    > Imagine it like this:
    >
    > Say you have a bunch of candy. At regular intervals you add a single
    > piece of candy to an initially empty bowl. Each time you do that, a
    > child walks by and takes one piece of candy. So, you never really have
    > more than one piece of candy in the bowl.
    >
    > Situation 1: You are doing this, but every once in a while you put 2
    > pieces of candy in the bowl instead of one. Each child, however, is only
    > taking one piece of candy. So the extra pieces of candy slowly add up,
    > and eventually you have a bowl full of candy (because you're putting it
    > in faster than the children are taking it out -- for every N>=10 pieces
    > you put in, only 10 get removed).
    >
    > Situation 2: You are doing this, but every 11th child does not take a
    > piece of candy. These pieces of candy add up, also leaving you with a
    > bowl full of candy (once again, because for every 11 pieces you put in,
    > only 10 get removed).
    >
    > This situation: Not only are you putting in 2 pieces of candy sometimes,
    > but every 11th child isn't taking a piece. So the bowl gets filled even
    > quicker.
    >
    > It's almost like get_next_event() is experiencing two logic errors (I
    > guess those are called 'bugs', heh) at once, both resulting in a backlog
    > of xevents in the queue.
    >
    > Here are some other things I found that solve the problem, in addition
    > to the things I mentioned above:
    >
    > - As I stated, calling fl_check_forms() twice solves the problem
    > (perhaps this is because they events are getting handled faster relative
    > to how fast you, the human, are causing them to be generated; since this
    > also results in every 11th call happening twice as much).
    > - Also as I stated, processing the "leftover" event in get_next_event()
    > fixes the problem, too. This may or may not be a real fix, since if you
    > started putting 3 or more events in the queue, the backlog would happen
    > again.
    > - Commenting out the fl_watch_io() stuff, or just forcing a return from
    > "point 3" fixes this. But it also causes fl_watch_io() to not be called,
    > which is bad, I think.
    > - Running a loop with the imaging calls in it on a separate thread than
    > the fl_check_forms() calls are on (using proper X threading techniques
    > [i.e. XInitThreads, XLockDisplay, XUnlockDisplay], of course). I don't
    > know why this works, maybe it has something to do with different event
    > queues for different threads.
    >
    > It is also beyond me as to why the XCopyArea call alone indirectly
    > causes this problem. I have some theories but I'd like to test them out
    > first. XFillRectangle does not do this, nor do any of the other drawing
    > calls I tested.
    >
    > Point is, the lag is definitely being causes by a backlog of events due
    > to a problem in the way get_next_events() removes events from the queue.
    > But I don't think that that is the root cause of the problem, because
    > the question remains: Why are there so many more events being generated
    > when XCopyArea is called rather than when we are not calling any X
    > routines, and simply delaying for the same amount of time that the X
    > calls would normally take?
    >
    > I have a theory that involves select_form_event and events sent to
    > canvas windows. But I'm not sure if this makes any sense, and I really
    > need to look into it a little further, so don't attack that just yet.
    >
    > Any thoughts or anything would be appreciated...
    >
    > Jason Cipriani

    _________________________________________________
    To unsubscribe, send the message "unsubscribe" to
    xforms-request@bob.usuhs.mil or see
    http://bob.usuhs.mil/mailserv/xforms.html
    XForms Home Page: http://world.std.com/~xforms
    List Archive: http://bob.usuhs.mil/mailserv/list-archives/
    Development: http://savannah.nongnu.org/files/?group=xforms



    This archive was generated by hypermail 2b29 : Wed Apr 23 2003 - 00:07:36 EDT