Bug #5558
openlinuxdvb status / diseqc - global_lock problems with complex setup and/or slow drivers
0%
Description
With tvheadend 4.3 I noticed many problems with data corruption c,
which occur when background streams are running or epg scan is active.
This produces problems like
-Continuity counter errors in the log file
-mpegts: too much queued table input data (over 2MB)
-epg which populates slowly and which is incomplete
-minor block distortion in streams, often coinciding with epggrab tuning
to a different transponder, or a background stream starting.
-rarely: completely corrupted (unwatchable) streams when tuning while epg grab
is in progress (only on some transponders, so could be another problem)
Similar problems have been recorded by others in various tickets (some still open).
I know that these problems can be caused by faulty hardware or poor
signal, but I do not think this is the case here: strong transponders with
problems + problems also when transponders do not share any LNB or switch.
Based on a hunch that some threads cannot keep up with the incoming data,
I instrumented the code to find out how long threads lock the global_lock
and/or how long they are waiting to lock it.
My tests are not finished, I have not been able to instrument all occurences
of global_lock, and similar problems could occur for other locks.
However, I have already discovered several problems:
1. mpegts_input_table_thread has two phases: wait for data and process.
The process phase calls tvh_mutex_lock(&global_lock). This call often takes
very long: in my current log (covering 1 day)
I have 100 cases where it takes longer than 1 second.
In 3 cases, it takes 13 seconds! In 14 cases it takes over 4 seconds.
This is a time critical thread and these wait times are much too long. They could explain
why epg data is lost. More importantly, they show that some thread(s) keep global_lock
locked too long, and/or that the lock is too heavily contended. A good result should be
well below 200ms (preferably much less).
2. mtimer_thread runs delayed callbacks. I do not understand the code well enough, but I see
that often these callbacks are launched in bursts (e.g., at the end of scanning one epg
transponder)
The callbacks are started as follows:
tvh_mutex_lock(&global_lock);
dttime_init();
if (mtimer_running == mti) {
tprofile_start(&mtimer_profile, id);
cb(mti->mti_opaque);
tprofile_finish(&mtimer_profile);
}
dttime(500);
tvh_mutex_unlock(&global_lock);
In 50 cases, my logs show times longer than 500ms for some callbacks.
As these callbacks run with global_lock held, they delay other threads.
Even if a single run takes only a few 100 ms, remember that these calls occur
in bursts. So when an epg scan ends, many of these brief delays will add up to a
a very large one.
In fact, in 42 cases, the delay is longer than 1 second for a single callback.
In 6 cases, the delay for one callback is longer than 4 seconds.
These times are far too long. It could be what causes the delays
seen in mpegts_input_table_thread.
3. I see other delays (long locking of mutexes) in save_thread (idnode.c), but these
could be false alerts, as the locking/unlocking is bit complicated and my timing code
might measure the wrong thing. So I need to check those.
Currently my conclusion is that some callback in
mtimer_thread is taking too much time and causes some or all of the problems I have
noticed.
As the global_lock is used in many places, it is essential that it is never held
for a long period of time. As it is called many times, there is also a big risk for
livelock.
Note that in some parts of the code, lock/unlock pairs occur close too each other in the
code, which is good as it helps to identify problems and to clearly show that global_lock
is taken.
However, in other parts of the code, global_lock is acquired and released in non obvious places (e.g., in separate functions). This is quite risky, as it hides that the lock is taken.
Files