XForms: Big time lag using XCopyArea, and get_next_event() event backlog problem.

From: jac@casurgica.com
Date: Wed Apr 23 2003 - 00:03:00 EDT

Next message: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."

Previous message: Angus Leeming: "XForms: autoconf/automake patch"
Next in thread: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."
Reply: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

# To subscribers of the xforms list from jac@casurgica.com :

I just read this email after I finished typing it. Turned into a small
book. Oh well. Here it is:

Ok, I've been trying to track down this problem all day. We (the people at
this company and I) have been experiencing a lot of bad UI response in all
of our applications in certain situations. I wrote a test program and
figured out that a call to XCopyArea from a fairly complex part of our
code was directly or indirectly causing this problem. Basically, the test
program does this:

- Set up a form with a canvas, a button that does nothing, an input box
(to make sure key events are lagging, too), and a quit button.
- Enter into a very simple loop, like this:

while (fl_check_forms() != test->quit) {
FunctionThatCallsXCopyArea();
};

We know it's the XCopyArea call because when we comment out just that
call, or replace it with something like XFillRectangle, everything is
smooth. The UI sluggishness we are experiencing is on the order of about
3/4 of a second, but sometimes it can be up to 7 or 8 seconds (note: the
program and x server are both running on the local machine).

I've traced the problem down to somewhere in get_next_event() in forms.c.
Here are the steps I took to narrow it down to that (without getting too
descriptive, it took me all day):

- Figured out how much time was spent in fl_check_forms, and how much time
was spent in the image calls (the parts that end up calling XCopyArea).
- I wanted to know if the problem was either 1) XCopyArea taking a long
time and lagging things or 2) XCopyArea creating some conflict somewhere
causing events to be dropped or delayed. I figured out the average time
that one call to our imaging functions took. It was 1200 microseconds.
- I replaced the image calls with a loop that delayed the program for 1200
usecs, and calculated the function timings again.
- With the delay instead of the image calls, the loop was running much
slower (surprise!). The reason for this is that the fl_check_forms() call
was taking almost 60x longer to execute than it was with the actual
imaging calls instead. What this suggested is that when XCopyArea was
being called, fl_check_forms() was returning a lot sooner than it should,
apparently before the events were processed or without processing the
events at all.
- More timing and experimenting narrowed it down to the
fl_treat_interaction_events() call, then to the fl_do_interaction_step()
call (rather than form_event_queued()).
- In do_interaction_step(), it turned out that when we used the 1200usec
delay, the ratio of null to non-null returns from get_next_event() was
about 68:1 with no activity, and 1:1 with lots of activity. By contrast,
when we used XCopyArea (so the UI was real laggy), the ratio was 2277:1
with no activity, and 68:1 with lots of activity (mouse movements and
such). Although these ratios varied a lot, the trend was the same, and
suggested that get_next_event() was returning 0 a lot more than it should
when we used XCopyArea (which would definitely cause the UI to be very
unresponsive).

Here is what I understand about get_next_event() (directly copied from the
notes I was typing at the time):

- ok here, get_next_event kind of works like this (it's real hard to look
at):

  + check for an event, if none: flush buffer, read new ones
  + if there's an event, process it like so:
    + if it's MappingNotify, handle_global_event() then return 0
    + if it's not meant for a form, it does some weird stuff
(fl_compress_event)
      that i can't figure out, then ungets the event.
    + if it was meant for a form then it got one so return !0.
  + now check again, if none: flush buffer, read new ones
  + if there's an event, then don't do anything, just return 0.
  + if there's still no event, call fl_watch_io. this also happens regardless
    of whether there's an event or not every 11 events, to keep non-x things
    flowing smoothly. fl_watch_io does a bunch of socket crap.

There are a lot of odd things about the get_next_event() call. Lots of
sloppy coding and some minor hacks here and there. Here are the few things
that stood out to me (they may or may not suggest the real cause of this
problem, I dunno):

By the way, "point 3" refers to the 3rd "return" statement in
get_next_event(), the one here:

    if ((dox % 11) && XEventsQueued(flx->display, QueuedAfterFlush)) {
      return has_event;
    };

This is significant because, out of the 4 return statements in the
function, when the program was run for about 8 seconds with lots of
activity, it went like this (the first number is the 1200usec delay one,
the second number is with the real image calls):

Returned from 1st one: 0 no imaging, 0 with imaging
Returned from 2nd one: 1012 no imaging, 288 with imaging
Returned from 3rd one: 5 no imaging, 5810 with imaging
Returned from 4th one: 1905 no imaging, 618 with imaging

See how many more times it returned from "point 3" when we used the delay
instead of the XCopyArea call? Anyway, notes ("point 2" is the return in
the "if (has_event || do_x_only)" conditional"):

- note that get_next_event() returned from point 3 lots of times when the iv
  stuff was there, but not many times when the delay was there.
- point 3 is around forms.c:2626. odd things about that code:
  - when it gets to point 3 it returns has_event, but to get to point 3
    has_event must be 0 (otherwise it would have returned at point 2).
    this seems like sloppy coding.
  - has_event will always be 0 for point 1, which handles global events,
    of which the only one is MappingNotify, because has_event is init'd
    to 0 and never gets set before that point. also seems sloppy, why
    not just return 0?
  - the only time has_event gets modified is just before point 1, as a
    result of calling select_form_event(). the does not happen every 11th
    event (i'm sure there's a reason for this), and does not happen if
    there are no xevents queued.
  - do_x_only is a global variable used to skip the fl_watch_io() call.
    do_x_only is only used in a small bit of code in close_form_win().
    incidently, it's not even used here because it's in the #else of
    a "#if 1". looks like a hack that got #if'd away by more sloppiness.
- the x reference index says XEventsQueued is on page 468, but it's
  actually on 249 (468 is XPeekEvent). that's annoying.

Another very odd thing that I noticed is that by the time get_next_event()
has reached point 3, it has called XEventsQueued twice with
QueuedAfterFlush (which flushes the event buffer and checks for new events
if there's no events in the buffer when it is called). Now, since
get_next_event() will normally be getting called over and over again with
not too much lower level xevent handling stuff in between, I know this
isn't really a problem. It still seemed strange to me, though. Earlier on
(it's worth mentioning now), I changed my test loop in my program to this
(and used the actual XCopyArea stuff):

while (fl_check_forms() != test->quit) {
   if (fl_check_forms() == test->quit)
     break;
   FunctionThatCallsXCopyArea();
};

Calling fl_check_forms() twice completely solved the lag problem. I think
that it is weird that calling fl_check_forms() twice instead of once
eliminates UI lags that would normally be over 3 or 4 seconds long and
makes the application completely responsive. Calling fl_check_forms()
twice in many of our loops, however, is an unacceptable hack. Because of
the market we develop our applications for, stableness and cleanness are
crucial, so it is very important that I found the real cause of this
problem.

I removed the second fl_check_forms() call and continued playing around.
As an experiment, I modified get_next_event() so that it behaved like
this:

- If it was about to return from point 3, jump back up to the top of the
function and (without incrementing dox or calling XEventsQueued a third
time) try to deal with the event again. If, during this second time
around, it got to point 3 again, just return. Don't try it again.

This also completely solved the problem, suggesting something... I have no
idea what, though. I'm primarily a Windows and console-only developer and
I am not yet very familiar with X or the internal workings of the forms
library. For some reason, it appeared as though every other call to
get_next_event was missing events, perhaps because of some weird logic in
get_next_event(). But, more experimenting suggests otherwise.

Moving on, I read this comment in get_next_events():

/* if incoming XEvent has already being pumped from the socket,
   watch_io() will time out, causing a bad delay in handling xevent.
   Make sure there is no event in the X event queue before we go into
   watch_io() */

This comment is just before the "point 3" return statement. It is another
oddity of get_next_events(), as the actual code is:

   if ((dox % 11) && XEventsQueued(flx->display, QueuedAfterFlush)) {
      return has_event;
   };

Which means that every 11th call to get_next_events() should proceed to
the fl_watch_io() stuff regardless of whether or not there is an event in
the queue (this is not what the comment indicates that it should do).

This supports the findings I had above when I made get_next_event() try
processing the event twice, I think. Here is my logic: If, every 11th
event, there is an event in the queue, fl_watch_io() will still happen
"causing a bad delay in handling xevents". The comment indicates that
there should be no XEvent in the queue when fl_watch_io() is called,
therefore if we are on an 11th event, and there is an XEvent in the queue,
WE SHOULD PROCESS IT BEFORE PROCEEDING TO fl_watch_io(), rather than
leaving it in there. That seems to make sense, and I have already seen
that by processing this event I fix the problem.

Here is a very small portion of output from a log file that is filled with
some information during get_next_event() calls. The first number on each
line is the number of events in the queue reported by XEventsQueued at the
beginning of the function, before the event processing might occur. The
second number is the number of queued events just before point 3, where
XEventsQueued is called to determine if the function should return before
calling fl_watch_io(). The string "GOT ONE!!!" is printed if has_event is
non-zero and the function returns before point 3. The string "TO WATCH
IO!" is printed just before fl_watch_io() is called.

       293 292
       292 291
       291 291 TO WATCH IO!
       291 GOT ONE!!!
       325 324
       325 324
       324 323
       323 322
       322 321
       321 GOT ONE!!!
       324 323
       324 323
       323 322
       322 322 TO WATCH IO!

Keeping in mind that file comment from above, notice how once every 11
calls to the function, fl_watch_io() is called with about 300 events in
the queue! This is during a period of lots of mouse and keyboard activity.

Here is the an excerpt from the log with my "event reprocessing" hack in
place (which eliminates the lag completely):

         1 0 TO WATCH IO!
         1 1 TO WATCH IO!
         1 0 REHANDLING GOT ONE!!!
         2 1 REHANDLING 0 TO WATCH IO!
         1 0 TO WATCH IO!
         1 0 TO WATCH IO!
         1 0 TO WATCH IO!
         1 0 TO WATCH IO!
         1 0 TO WATCH IO!
         1 GOT ONE!!!
         1 0 TO WATCH IO!
         1 0 TO WATCH IO!
         1 1 TO WATCH IO!
         1 0 TO WATCH IO!
         1 GOT ONE!!!
         2 1 REHANDLING 0 TO WATCH IO!
         2 1 REHANDLING GOT ONE!!!

The string "REHANDLING" is printed when get_next_event() would normally
return without processing the event from point 3, but instead went back to
process the event that is still in the queue. The number after that string
but just before "TO WATCH IO!" is the number of events in the queue when
fl_watch_io() is about to be called. This part of the log represents the
peak of the queues capacity. 1 or 2 events, as it should be.

So, some weird logic in get_next_event() is leading to a backlog of
hundreds of unprocessed events in the queue. Here is why I believe this
backlog is happening (talking about get_next_event() without my hacks and
modifications):

- When more than one event is in the queue when get_next_event() is
called, it only processes one. It leaves the rest of the events in the
queue to be handled the next time get_next_event() is called. When only
one event is in the queue, it processes it and leaves none in the queue.
This is all good and well.
- Every 11th event, if you look at get_next_event(), you see that event
processing is completely disabled. So every 11th event, one extra event
remains unprocessed in the queue. These eventually add up and create a lag
problem. I think they peak around 320 because perhaps that is when the
queue is full.

Imagine it like this:

Say you have a bunch of candy. At regular intervals you add a single piece
of candy to an initially empty bowl. Each time you do that, a child walks
by and takes one piece of candy. So, you never really have more than one
piece of candy in the bowl.

Situation 1: You are doing this, but every once in a while you put 2
pieces of candy in the bowl instead of one. Each child, however, is only
taking one piece of candy. So the extra pieces of candy slowly add up, and
eventually you have a bowl full of candy (because you're putting it in
faster than the children are taking it out -- for every N>=10 pieces you
put in, only 10 get removed).

Situation 2: You are doing this, but every 11th child does not take a
piece of candy. These pieces of candy add up, also leaving you with a bowl
full of candy (once again, because for every 11 pieces you put in, only 10
get removed).

This situation: Not only are you putting in 2 pieces of candy sometimes,
but every 11th child isn't taking a piece. So the bowl gets filled even
quicker.

It's almost like get_next_event() is experiencing two logic errors (I
guess those are called 'bugs', heh) at once, both resulting in a backlog
of xevents in the queue.

Here are some other things I found that solve the problem, in addition to
the things I mentioned above:

- As I stated, calling fl_check_forms() twice solves the problem (perhaps
this is because they events are getting handled faster relative to how
fast you, the human, are causing them to be generated; since this also
results in every 11th call happening twice as much).
- Also as I stated, processing the "leftover" event in get_next_event()
fixes the problem, too. This may or may not be a real fix, since if you
started putting 3 or more events in the queue, the backlog would happen
again.
- Commenting out the fl_watch_io() stuff, or just forcing a return from
"point 3" fixes this. But it also causes fl_watch_io() to not be called,
which is bad, I think.
- Running a loop with the imaging calls in it on a separate thread than
the fl_check_forms() calls are on (using proper X threading techniques
[i.e. XInitThreads, XLockDisplay, XUnlockDisplay], of course). I don't
know why this works, maybe it has something to do with different event
queues for different threads.

It is also beyond me as to why the XCopyArea call alone indirectly causes
this problem. I have some theories but I'd like to test them out first.
XFillRectangle does not do this, nor do any of the other drawing calls I
tested.

Point is, the lag is definitely being causes by a backlog of events due to
a problem in the way get_next_events() removes events from the queue. But
I don't think that that is the root cause of the problem, because the
question remains: Why are there so many more events being generated when
XCopyArea is called rather than when we are not calling any X routines,
and simply delaying for the same amount of time that the X calls would
normally take?

I have a theory that involves select_form_event and events sent to canvas
windows. But I'm not sure if this makes any sense, and I really need to
look into it a little further, so don't attack that just yet.

Any thoughts or anything would be appreciated...

Jason Cipriani

_________________________________________________
To unsubscribe, send the message "unsubscribe" to
xforms-request@bob.usuhs.mil or see
http://bob.usuhs.mil/mailserv/xforms.html
XForms Home Page: http://world.std.com/~xforms
List Archive: http://bob.usuhs.mil/mailserv/list-archives/
Development: http://savannah.nongnu.org/files/?group=xforms

Next message: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."
Previous message: Angus Leeming: "XForms: autoconf/automake patch"
Next in thread: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."
Reply: jac@casurgica.com: "XForms: Re: Big time lag using XCopyArea, and get_next_event() event backlog problem."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Apr 23 2003 - 00:05:28 EDT