Project

General

Profile

Bug #5366

EPG text is badly encoded and needs cleaning

Added by Dave H almost 6 years ago. Updated almost 6 years ago.

Status:
New
Priority:
Normal
Assignee:
Category:
EPG
Target version:
-
Start date:
2018-11-29
Due date:
% Done:

0%

Estimated time:
Found in version:
4.2.6 and others
Affected Versions:

Description

EPG text supplied by Freeview in the UK contains some illegal characters, apparently caused by broken encoding systems somewhere in the transmission path. The most common example is occurrence of the byte 0x19, which appears to be the low byte of the Unicode U+2019 single right quote, sent without the high byte and in the wrong character set.

To avoid propagating these errors to other parts of the system, the encoding should be corrected/substituted as early in its path through TVH as possible. There is a thread on the forum https://tvheadend.org/boards/5/topics/35265?r=35325 that explains more about the issue, gives a list of the most likely incorrect character codes and contains references to external sources that explain the problem in detail.


Files

sample.ts (14.1 MB) sample.ts Dave Pickles, 2018-12-01 10:48
sample2a.ts (93.1 MB) sample2a.ts 30 second MUX sample Dave Pickles, 2018-12-01 18:03

History

#1

Updated by Jaroslav Kysela almost 6 years ago

src/epggrab/support/freesat_huffman.c
src/input/mpegts/dvb_support.c / dvb_get_string()

#2

Updated by Dave H almost 6 years ago

Jaroslav Kysela wrote:

src/epggrab/support/freesat_huffman.c
src/input/mpegts/dvb_support.c / dvb_get_string()

I'm not quite sure what these mean, but the problem is [apparently] broken technology in the terrestrial Freeview chain. The errors do not occur in the Freesat stream. So I'm dubious whether freesat_huffman.c would be the place to do anything, since it is presumably used for Freesat decoding as well as perhaps for some Freeview T2 encoding that also uses Huffman. dvb_get_string also seems likely to be more widely applied than just Freeview, but I haven't looked in detail. Apologies if my noobiness is embarrassing.

#3

Updated by Jaroslav Kysela almost 6 years ago

Ok, it should be checked if there are strings coded using the huffman code and the problem is there. Or if the string conversion routines (the second source file) should be updated to skip broken characters. I just pointed to the source code which should be checked.

#4

Updated by Jaroslav Kysela almost 6 years ago

Perhaps, it might be better to show me the input MPEG-TS stream (save the mux for cca 60 seconds).

#5

Updated by Dave H almost 6 years ago

Jaroslav Kysela wrote:

Perhaps, it might be better to show me the input MPEG-TS stream (save the mux for cca 60 seconds).

Sorry, I have no idea how to do that. If you can explain, or point me to an explanation for tyros, I'll have a go.

#6

Updated by Pablo R. almost 6 years ago

Dave H wrote:

Jaroslav Kysela wrote:

Perhaps, it might be better to show me the input MPEG-TS stream (save the mux for cca 60 seconds).

Sorry, I have no idea how to do that. If you can explain, or point me to an explanation for tyros, I'll have a go.

With wget or curl and the stream url

e.g.

wget http://<IP>:9981/stream/service/7a3e8b99bb1d52d90c1d435fc19605a4?profile=pass

--> You can find this URL by the service/mux grid on tvh editing that file

for 60 seconds, then stop (ctrl+C) and upload.

#7

Updated by saen acro almost 6 years ago

Dave H wrote:

Sorry, I have no idea how to do that. If you can explain, or point me to an explanation for tyros, I'll have a go.

curl -m 60 --output tvh_.ts http://tvh.ip:9981/stream/channelid/1792384735
#8

Updated by Dave H almost 6 years ago

Pablo R. wrote:

--> You can find this URL by the service/mux grid on tvh editing that file

Sorry, I don't understand 'service/mux grid'. I capture the EPG using the API and my program can tell me which programme information has bad characters in it.

saen acro wrote:

curl -m 60 --output tvh_.ts http://tvh.ip:9981/stream/channelid/1792384735

With tvh.ip == localhost that produces
$ cat tvh_.ts
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>400 Bad Request</H1>
</BODY></HTML>

#9

Updated by saen acro almost 6 years ago

Dave H wrote:

Pablo R. wrote:

--> You can find this URL by the service/mux grid on tvh editing that file

Sorry, I don't understand 'service/mux grid'. I capture the EPG using the API and my program can tell me which programme information has bad characters in it.

saen acro wrote:

curl -m 60 --output tvh_.ts http://tvh.ip:9981/stream/channelid/1792384735

With tvh.ip == localhost that produces
$ cat tvh_.ts
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML><HEAD>
<TITLE>400 Bad Request</TITLE>
</HEAD><BODY>
<H1>400 Bad Request</H1>
</BODY></HTML>

Do you know what Example is? :D
Implement example with your case.

#10

Updated by Dave Pickles almost 6 years ago

Hopefully this is what's needed.

The problem shows in the Short Event Descriptor for the event at 2018/12/01 13:30:00 on "Sony Crime Channel". In the phrase "A basketball players lover", the apostrophe between the 'r' and 's' of 'players' has been encoded as 0x19.

Dave H: This is the command I used:
wget http://user:[email protected]:9981/stream/service/c90f9607480f802d0103e799cc30d585?profile=pass

To find the uuid I used
curl http://user:[email protected]:9981/api/mpegts/service/grid | json_pp
and hunted for the correct service name.

#11

Updated by Jaroslav Kysela almost 6 years ago

Save the MUX (use mux play URL from the mux grid from webui - left icon).

#12

Updated by Dave Pickles almost 6 years ago

Apologies, try this. The same example is included.

#13

Updated by Dave Pickles almost 6 years ago

It seems large files hit a server limit. Trying again.

#14

Updated by Jaroslav Kysela almost 6 years ago

Thanks, tvheadend uses standard ISO6937 conversion routines for this and really the text contains invalid 0x19 character like:

2018-12-01 19:26:44.701 [  TRACE]:tbl-eit: 42 6F 6F 6B 3A 20 41 20 62 61 73 6B 65 74 62 61 Book: A basketba
2018-12-01 19:26:44.701 [  TRACE]:tbl-eit: 6C 6C 20 70 6C 61 79 65 72 19 73 20 6C 6F 76 65 ll player.s love
2018-12-01 19:26:44.701 [  TRACE]:tbl-eit: 72 20 6B 69 6C 6C 73 20 68 69 73 20 67 69 72 6C r kills his girl

https://en.wikipedia.org/wiki/ISO/IEC_6937

#15

Updated by Jaroslav Kysela almost 6 years ago

I added filter for the wrong characters to latest: https://tvheadend.org/projects/tvheadend/repository/revisions/3ae6d947a4d074b3498e59f82d5a860273b0ae7f/diff

The question is, if there's demand to create a new ISO6937 conversion routine which will accept also those broken characters. Something like 'ISO6937-extra' or so.

Also available in: Atom PDF