Bug #5366
EPG text is badly encoded and needs cleaning
0%
Description
EPG text supplied by Freeview in the UK contains some illegal characters, apparently caused by broken encoding systems somewhere in the transmission path. The most common example is occurrence of the byte 0x19, which appears to be the low byte of the Unicode U+2019 single right quote, sent without the high byte and in the wrong character set.
To avoid propagating these errors to other parts of the system, the encoding should be corrected/substituted as early in its path through TVH as possible. There is a thread on the forum https://tvheadend.org/boards/5/topics/35265?r=35325 that explains more about the issue, gives a list of the most likely incorrect character codes and contains references to external sources that explain the problem in detail.
Files
History
Updated by Jaroslav Kysela almost 6 years ago
src/epggrab/support/freesat_huffman.c
src/input/mpegts/dvb_support.c / dvb_get_string()
Updated by Dave H almost 6 years ago
Jaroslav Kysela wrote:
src/epggrab/support/freesat_huffman.c
src/input/mpegts/dvb_support.c / dvb_get_string()
I'm not quite sure what these mean, but the problem is [apparently] broken technology in the terrestrial Freeview chain. The errors do not occur in the Freesat stream. So I'm dubious whether freesat_huffman.c would be the place to do anything, since it is presumably used for Freesat decoding as well as perhaps for some Freeview T2 encoding that also uses Huffman. dvb_get_string also seems likely to be more widely applied than just Freeview, but I haven't looked in detail. Apologies if my noobiness is embarrassing.
Updated by Jaroslav Kysela almost 6 years ago
Ok, it should be checked if there are strings coded using the huffman code and the problem is there. Or if the string conversion routines (the second source file) should be updated to skip broken characters. I just pointed to the source code which should be checked.
Updated by Jaroslav Kysela almost 6 years ago
Perhaps, it might be better to show me the input MPEG-TS stream (save the mux for cca 60 seconds).
Updated by Dave H almost 6 years ago
Jaroslav Kysela wrote:
Perhaps, it might be better to show me the input MPEG-TS stream (save the mux for cca 60 seconds).
Sorry, I have no idea how to do that. If you can explain, or point me to an explanation for tyros, I'll have a go.
Updated by Pablo R. almost 6 years ago
Dave H wrote:
Jaroslav Kysela wrote:
Perhaps, it might be better to show me the input MPEG-TS stream (save the mux for cca 60 seconds).
Sorry, I have no idea how to do that. If you can explain, or point me to an explanation for tyros, I'll have a go.
With wget or curl and the stream url
e.g.
wget http://<IP>:9981/stream/service/7a3e8b99bb1d52d90c1d435fc19605a4?profile=pass
--> You can find this URL by the service/mux grid on tvh editing that file
for 60 seconds, then stop (ctrl+C) and upload.
Updated by saen acro almost 6 years ago
Dave H wrote:
Sorry, I have no idea how to do that. If you can explain, or point me to an explanation for tyros, I'll have a go.
curl -m 60 --output tvh_.ts http://tvh.ip:9981/stream/channelid/1792384735
Updated by Dave H almost 6 years ago
Pablo R. wrote:
--> You can find this URL by the service/mux grid on tvh editing that file
Sorry, I don't understand 'service/mux grid'. I capture the EPG using the API and my program can tell me which programme information has bad characters in it.
saen acro wrote:
curl -m 60 --output tvh_.ts http://tvh.ip:9981/stream/channelid/1792384735
With tvh.ip == localhost that produces
$ cat tvh_.ts
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>400 Bad Request</H1>
</BODY></HTML>
Updated by saen acro almost 6 years ago
Dave H wrote:
Pablo R. wrote:
--> You can find this URL by the service/mux grid on tvh editing that file
Sorry, I don't understand 'service/mux grid'. I capture the EPG using the API and my program can tell me which programme information has bad characters in it.
saen acro wrote:
curl -m 60 --output tvh_.ts http://tvh.ip:9981/stream/channelid/1792384735
With tvh.ip == localhost that produces
$ cat tvh_.ts
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>400 Bad Request</H1>
</BODY></HTML>
Do you know what Example is?
Implement example with your case.
Updated by Dave Pickles almost 6 years ago
Hopefully this is what's needed.
The problem shows in the Short Event Descriptor for the event at 2018/12/01 13:30:00 on "Sony Crime Channel". In the phrase "A basketball players lover", the apostrophe between the 'r' and 's' of 'players' has been encoded as 0x19.
Dave H: This is the command I used:wget http://user:[email protected]:9981/stream/service/c90f9607480f802d0103e799cc30d585?profile=pass
To find the uuid I used
curl http://user:[email protected]:9981/api/mpegts/service/grid | json_pp
and hunted for the correct service name.
Updated by Jaroslav Kysela almost 6 years ago
Save the MUX (use mux play URL from the mux grid from webui - left icon).
Updated by Dave Pickles almost 6 years ago
- File sample2a.ts sample2a.ts added
It seems large files hit a server limit. Trying again.
Updated by Jaroslav Kysela almost 6 years ago
Thanks, tvheadend uses standard ISO6937 conversion routines for this and really the text contains invalid 0x19 character like:
2018-12-01 19:26:44.701 [ TRACE]:tbl-eit: 42 6F 6F 6B 3A 20 41 20 62 61 73 6B 65 74 62 61 Book: A basketba 2018-12-01 19:26:44.701 [ TRACE]:tbl-eit: 6C 6C 20 70 6C 61 79 65 72 19 73 20 6C 6F 76 65 ll player.s love 2018-12-01 19:26:44.701 [ TRACE]:tbl-eit: 72 20 6B 69 6C 6C 73 20 68 69 73 20 67 69 72 6C r kills his girl
Updated by Jaroslav Kysela almost 6 years ago
I added filter for the wrong characters to latest: https://tvheadend.org/projects/tvheadend/repository/revisions/3ae6d947a4d074b3498e59f82d5a860273b0ae7f/diff
The question is, if there's demand to create a new ISO6937 conversion routine which will accept also those broken characters. Something like 'ISO6937-extra' or so.