Errors in JSON
Added by Dave H almost 6 years ago
I'm experimenting with using the API but am having a problem when I retrieve the EPG. (UK Freeview OTA EPG). My JSON decoder is refusing to parse the EPG because of an invalid character. Specifically there's currently a section of the EPG that reads 'Yasser(0x19)s got drivetime'. Now 0x19 is not a valid JSON character.
Note: the web software (sensibly?) appears not to include the actual character so I've added (0x19) to indicate where it is.
My guess is that one of the TV channels is broadcasting this character in a string, but I believe that whatever TVH is using to encode the strings a JSON is failing to properly encode the strings.
Section 7 of RFC 8259 says:
'All Unicode characters may be placed within the
quotation marks, except for the characters that MUST be escaped:
quotation mark, reverse solidus, and the control characters (U+0000
through U+001F).'
Is this correct and can it be fixed?
Replies (16)
RE: Errors in JSON - Added by Em Smith almost 6 years ago
What does your proper (non-tvh/non-kodi) TV guide show?
Sure, tvh should encode those characters, but it shouldn't be receiving characters in that particular byte range.
RE: Errors in JSON - Added by Dave H almost 6 years ago
What does your proper (non-tvh/non-kodi) TV guide show?
I'm not sure what you mean by this? You mean if I look at my TV or something?
If so, I have no idea. I've no idea which channel is showing a programme sometime in the next week that has that text in some text field or other. So I've no idea how to find it. My guess would be that it filters it to something sensible to display.
Sure, tvh should encode those characters, but it shouldn't be receiving characters in that particular byte range.
Agreed, but the important point is that TVH should be filtering them before/whilst encoding.
RE: Errors in JSON - Added by Em Smith almost 6 years ago
But you said the dialog in the UI wasn't showing the character? So, the dialog has start time and channel.
Type "Yasser.*got drivetime" in to the tvh EPG filter and it will narrow down to the programmes, then go to your tv and see what it displays. I'd assume it's an apostrophe, but it would be interesting to know if that has a bad character. I know my old tv would sometimes display a bad character symbol.
Unfortunately I can't check.
There's probably no point tvh sending an encoded character since nothing can display those characters, so it just causes problems downstream. Therefore it should probably be dropped. But, no real point dropping it in the json encoder since the xmltv encoder will likely have the same problem.
So, the question is whether it's bad data being broadcast, or if it's good data being mis-interpreted or incorrectly decoded.
RE: Errors in JSON - Added by Dave Pickles almost 6 years ago
I spotted this a few months ago. I had planned to raise a bug report against TVHeadend, on the basis that it should not be emitting unparseable JSON, but looking deeper it does seem to be the broadcasters' fault. The Unicode code points for left and right single-quote are \x2018 and \x2019 respectively, and the lower bytes of these are somehow getting into the EPG. It seems to affect the Summary field of the 'minor' channels.
Before parsing EPG JSON I strip out characters \x00 - \x0f.
EDIT: The problem is worse than I remembered. There are currently 680 corrupted entries on the Freeview EPG, including many from the BBC radio channels.
RE: Errors in JSON - Added by Dave H almost 6 years ago
But you said the dialog in the UI wasn't showing the character?
No, I meant this web software - the forum. What's mildly interesting is that it does show in the email notifications; just not in the forum web software itself.
Type "Yasser.*got drivetime" in to the tvh EPG filter
Ah yes, thanks Em. My brain wasn't working last night. It shows up in the TVH web UI and in my Humax's EPG the same - they just drop the bad character, which as Dave says is presumably meant to be an apostrophe, but however they're doing the encoding is broken.
But following Postel's law implies that filtering these bad characters out should be done once where they're encountered, not repeatedly in every piece of software using this supposedly JSON feed. Nowadays it's good practice to sanitise all input anyway, in case it's part of an injection attack.
RE: Errors in JSON - Added by Em Smith almost 6 years ago
That's interesting about the 0x2018 and 0x2019 being apostrophes (or right single quotation mark). Given that 0x20 is an ASCII space, and the generated description string has no space there, that would suggest to me that if the dvb stream is sending is 0x2019 to represent an apostrophe then we are probably &0xff it and stripping the 0x20 part and just keeping the 0x19, which is then causing the problems.
I had a play setting my charset for that service to utf-8, auto, and iso-8859-1, and doing a re-scan, but it didn't seem to make a difference to me.
I couldn't see an obvious place in the code where we were stripping the character, and, unfortunately, dvbsnoop doesn't work on my system so I couldn't look at a raw capture to see what is being sent.
So, one for Jaroslav.
RE: Errors in JSON - Added by Dave Pickles almost 6 years ago
The naked \x18 and \x19 are present in the EIT tables. This is from a data capture using 'dvbepg' - not sure where I got it from, possibly it was part of dvb-tools at one time:
"2018-12-1 13:30:00",23808,1800,"Bizarre Murders","Little Black Book: A basketball player^Ys lover kills his girlfriend to eliminate her as a love rival, but is caught confessing details of the murder to him. (S4 Ep8)",3,7942,"/E4202019","/S4975814"
The ^Y is the right-single-quote character.
EDIT: According to ETSI EN 300 468, the broadcaster should be sending \xA9 and \xB9 for left and right single quotes.
RE: Errors in JSON - Added by Em Smith almost 6 years ago
That's interesting (table A.1). But, if the character set were something different (table A.3) then it might explain the 0x19.
I don't know if you can get the raw data? I'd be happy to just map 0x19 to an ascii apostrophe, but that's because I think the official "curly" apostrophe character is difficult for anyone to type!
RE: Errors in JSON - Added by Dave Pickles almost 6 years ago
The EPG entry I captured above is for Sony Crime channel. My (Panasonic) TV just drops the invalid character.
I recall something similar to this at work, when folks tried to edit code with M$ Word then wondered why the compiler spat out the funny quote marks...
RE: Errors in JSON - Added by Dave Pickles almost 6 years ago
Here are few comments from someone else trying to understand the problem:
https://ukfree.tv/extras/inreplyto/836083
RE: Errors in JSON - Added by Dave H almost 6 years ago
I don't know if you can get the raw data? I'd be happy to just map 0x19 to an ascii apostrophe, but that's because I think the official "curly" apostrophe character is difficult for anyone to type!
I recall something similar to this at work, when folks tried to edit code with M$ Word then wondered why the compiler spat out the funny quote marks...
I suspect Dave may be close to the truth. I wouldn't be surprised if the text is input by a person using a word processor like MS Word with 'smart quotes' switched on and then passing (cut-and-paste?) it to some, probably old, submission script that doesn't understand Unicode but just ASCII and that just passes the bottom byte. TV and radio stations have some very smart tech, but they also have some real kludges.
I think deleting \x18 and \x19 would be OK, or converting them both to a single apostrophe \x27 would also be OK and doesn't require shuffling the buffer contents, so would be a little faster.
RE: Errors in JSON - Added by Dave H almost 6 years ago
Thanks Dave, the second link is extremely useful since it includes a table of the characters
https://ukfree.tv/extras/inreplyto/842218
https://ukfree.tv/transmitters/tv/Winter_Hill/PGSTART1680/irt842218#b842218
RE: Errors in JSON - Added by Em Smith almost 6 years ago
For future reference, from that link:
"By observation, it has become evident that the invalid codes are the Least Significant Bytes of a group of Unicode punctuation characters."
0x10 = LSB of Unicode Character 'HYPHEN' (U+2010) 0x13 = LSB of Unicode Character 'EN DASH' (U+2013) 0x14 = LSB of Unicode Character 'EM DASH' (U+2014) 0x18 = LSB of Unicode Character 'LEFT SINGLE QUOTATION MARK' (U+2018) 0x19 = LSB of Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019) 0x1C = LSB of Unicode Character 'LEFT DOUBLE QUOTATION MARK' (U+201C) 0x1D = LSB of Unicode Character 'RIGHT DOUBLE QUOTATION MARK' (U+201D)
RE: Errors in JSON - Added by Dave H almost 6 years ago
Yes, starting from that and with the help of a Unicode table to catch a few more possibles I've added this hack to my prog:
my $erc = $encoded_result->content;
my $apostrophecount = ($erc =~ tr/\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F/------|=''''""""/);
RE: Errors in JSON - Added by Em Smith almost 6 years ago
Can someone raise a bug asking for this type of mapping to be added to tvh?
(If I raise it, it would look like I was going to fix it, but it's one for Jaroslav since I can see places it might be fixed, but don't know for sure).