Discussion on an 'Extensible Genre' Architecture
Added by Delta Mike Charlie over 1 year ago
I would like to share my thoughts on the state of the genre system within tvheadend, especially regarding EPG data sourced via XMLTV, in the hopes of initiating a discussion that will eventuate in a new sub-module for handling EPG genre information.
Firstly, this proposal is designed to be 100% backwards compatible with the existing system. Its implementation should, in theory, break nothing.
Internally TVH appears to use the DVB/ETSI genre classification system. In addition to this, it is only capable of ingesting a subset of genre values.
There are multiple Digital TV systems in the world with a variety of accompanying EPG genre systems. In addition to this, there may be a number of third party EPG providers with their own proprietary systems.
For simplification, I will consider the three stages of EPG processing as follows:
1) Acquisition
2) Storage
3) Output
I would like to propose a proprietary extension to the XMLTV schema, as well as some internal EPG storage and processing changes in order to accommodate a more customisable genre data handling mechanism.
ACQUISITION
An example of the existing XMLTV 'category' syntax is as follows:
<category lang="en">Documentary</category>
TVH reads the value of this tag and compares it to an internal list of pre-defined text values in order to arrive at a 1 byte code that will be used to store this genre.
Until this point, this is still 100% backwards compatible with the existing system. This system, however, is not extensible.
My first proposal is to add a 'system' attribute to the 'category' tag as follows:
<category system="NAME" lang="en">Documentary</category>
If present, the 'system' attribute, along with the 'lang' attribute will be used to identify a list of extensible genre acquisition rules to be used as a replacement for the existing hard-coded rules.
Together, the 'system' and 'lang' values will be used to identify a JSON file within the TVH XMLTV directory structure. There can be multiple JSON files to support multiple systems and multiple languages. These can be bundled with TVH or user-defined.
The JSON file will consist of an array of rules for matching a 'category' string to a value code. Each rule would consist of the following 4 properties:
1) 'regex' - This is a regular expression used for case-blind matching a category string. It should be flexible enough to cater for dialect or punctuation differences, for example, 'programme' and 'program' or 'Science/Technology' and 'Science / Technology', yet not too flexible that it produces too many false positives. Should a category string match multiple rules, the first match will take priority.
2) 'description' - This is the actual genre description to use. It could be the same as the regex, but in most cases should not be.
3) 'code' - This is the proprietary numerical code for this 'system' that matches the text/regex provided.
4) 'etsi' - This is the equivalent ETSI numerical code that matches the text/regex provided.
Example:
XML
<category system="atsc" lang="en">Hunting/Fishing/Outdoors</category>
JSON Rules
{
"regex": "Hunting[ ]{0,1}[-/\\][ ]{0,1}Fishing[ ]{0,1}[-/\\][ ]{0,1}Outdoors",
"description": "Hunting/Fishing/Outdoors",
"code": "0x94",
"etsi": "0xA0"
}
The rule regex would match any of the following descriptions: 'Hunting/Fishing/Outdoors', 'Hunting / Fishing / Outdoors', 'Hunting - Fishing – Outdoors', 'Hunting-Fishing-Outdoors', 'Hunting\Fishing\Outdoors', etc.
A regex need not be so precise as the example provided, 'hunt|fish|outdoor' would also yield an adequate match.
Once matched, the code for this 'system' '0x94' would be stored in the 'extensible genre' field of the EPG record. The ETSI code '0xA0' (Leisure Hobbies (general)) would be stored in the existing 'genre' field of the EPG record.
Note: '<category system="atsc" lang="es">Caza/Pesca/Al aire libre</category>' would look up the Spanish rules file, match a Spanish regex, but still result in '0x94' and '0xA0' being stored as the relevant codes.
The use of regular expressions would come at the cost of additional CPU overhead during the XMLTV import process. In order to minimise this overhead, the JSON rules file could be sorted so that the most common genres (most likely to match) appear first. This could even be statistically determined as the XML EPG is loaded and then retained for future uploads.
My next proposal would result in regular expressions being avoided completely with the use of an additional custom attribute, 'code'.
Together, the 'system' and 'code' attributes could be used to bypass the regex process entirely and save the provided codes directly.
Example: <category system="atsc" code="0x3B" lang="en">Documentary</category>
The JSON rules file would be used to match ATSC '0x3B' with ETSI '0x23' directly. This would also eliminate the need to provide the tag text with only the attributes being necessary.
STORAGE
The existing genre storage mechanism would remain unchanged.
An additional 'extensible genre' array will be created to store tuples for each extensible genre. 1 byte for the 'system' and 1 byte for the 'code'.
It will be necessary to store both the system and code values as a single TVH instance may be receiving XMLTV EPG data for multiple sources. For example, XMLTV for a DVB satellite service may supply ETSI genre codes whereas XMLTV for an ATSC OTA system may supply ATSC genre codes.
Each 'system' will be assigned a unique persistent 8 bit integer when encountered for the first time on each TVH instance. This number will be used in the 'extensible genre' table. This table will map the 'system' names to their internal number as well as hold pointers to the regex, descriptions and code values for each language that has been used. Some regex libraries can pre-parse the regex text prior to use. Where this is possible, the pre-parsed value can also be saved to increase performance.
The 'extensible genre' table should be able to cater for instances where a single channel broadcasts programmes in multiple languages with the EPG data for each programme matching its language. For example, one programme may have the genre text 'Informations (général)' and another 'Nachrichten/Aktuelles'. These should both be interpreted and stored as ETSI '0x20'.
Whichever regular expression library is used, it should be Unicode-aware. If, during the research phase, some 'systems' are found that use more than 256 genres, 16 bits should be allowed for the genre code from the outset.
It is also possible that the existing OTA grabbers could be modified to utilise the genre information contained within the JSON rules file and eventually do away with their hard-coded rules.
OUTPUT
EPG data is available via the HTSP and JSON APIs as well as the Web UI.
A new field will be added to both APIs to handle the new extended genre information. Something short like 'extgenre' would be preferable.
This new EPG data field would consist of an array of objects, one for each 'extensible genre' present for that programme. Each object would consist of a 'system' integer, and a 'code' integer.
Perhaps an API request parameter could be added to include the 'system' name as well as the extended genre code and perhaps even the genre description with the data returned.
A new API call will be required to extract the map between the 'system' code and the 'system' name. This call could return 1 or all of the 'systems' depending upon the parameters used.
The existing JSON API 'epg/content_type/list' call will be modified to accept an additional argument of either the system name or the system code. If these additional arguments are present, the API will return the contents of the JSON rules file excluding the 'regex' property.
In order to display the 'extensible genre' Description, a client would have to first obtain the EPG data. For each 'system' in the 'extgenre', the client would then have to execute an API call to retrieve all of the description texts for that system. A client would be expected to adopt a caching strategy that limited the number of calls made.
Translation note: I realise that TVH already has a method for providing translations. However, due to the extensibility aspect, it may not be practical to use TVH translations for every extensible system/regex/description combination. More thought may be required here.
CONCLUSION
I do not intend for this to be 'the' solution the gets implemented. I personally only use DVB/modified-ETSI in my location so it is possible, if not probable, that I am unaware of some of the nuances of other digital TV systems with respect genre handing.
According to Wikipedia, there appears to be 5 digital TV systems in operation around the world: DVB, ATSC, ISDB, DTMB and DMB. I'm not sure which of these systems TVH currently supports, however, it would be good to get feedback from users of every system that TVH is able to support and more importantly, the genre classification system used. Perhaps these core 'systems' could have hard-coded 'system' integer codes.
I know of a 3rd party EPG information provider here in Australia, so I expect that there will be many more worldwide. The extensible genre system should be flexible enough to support the published standards as well as 3rd party EPG information providers where possible.
I look forward to a healthy and respectful discussion.
Replies (5)
RE: Discussion on an 'Extensible Genre' Architecture - Added by Dave H over 1 year ago
Delta Mike Charlie wrote:
I would like to share my thoughts on the state of the genre system within tvheadend, especially regarding EPG data sourced via XMLTV, in the hopes of initiating a discussion that will eventuate in a new sub-module for handling EPG genre information.
Firstly, this proposal is designed to be 100% backwards compatible with the existing system. Its implementation should, in theory, break nothing.
Internally TVH appears to use the DVB/ETSI genre classification system. In addition to this, it is only capable of ingesting a subset of genre values.
There are multiple Digital TV systems in the world with a variety of accompanying EPG genre systems. In addition to this, there may be a number of third party EPG providers with their own proprietary systems.
For simplification, I will consider the three stages of EPG processing as follows:
1) Acquisition
2) Storage
3) OutputI would like to propose a proprietary extension to the XMLTV schema, as well as some internal EPG storage and processing changes in order to accommodate a more customisable genre data handling mechanism.
ACQUISITION
An example of the existing XMLTV 'category' syntax is as follows:
<category lang="en">Documentary</category>TVH reads the value of this tag and compares it to an internal list of pre-defined text values in order to arrive at a 1 byte code that will be used to store this genre.
Until this point, this is still 100% backwards compatible with the existing system. This system, however, is not extensible.
My first proposal is to add a 'system' attribute to the 'category' tag as follows:
<category system="NAME" lang="en">Documentary</category>
I don't know anything about XMLTV - I just use terrestrial broadcast systems - so my question may be stupid. But I don't understand how you can change the incoming XML. Isn't that provided by the EPG source? And isn't XMLTV standardised using "xmltv.dtd"? So any change to the category tag would need a change to XMLTV itself, not to some part of TVH?
RE: Discussion on an 'Extensible Genre' Architecture - Added by Delta Mike Charlie over 1 year ago
Dave H wrote:
I don't know anything about XMLTV - I just use terrestrial broadcast systems - so my question may be stupid. But I don't understand how you can change the incoming XML. Isn't that provided by the EPG source? And isn't XMLTV standardised using "xmltv.dtd"? So any change to the category tag would need a change to XMLTV itself, not to some part of TVH?
Thanks for your feedback.
You are correct that TVH can not, and will not, modify the incoming XMLTV data.
The proposal is to modify TVH to allow for additional genre data to be provided via the XMLTV feed in a manner that is both more flexible and precise. Suppliers of XMLTV data could then modify their data feed with this enhancement.
The flexibility comes from the use of regular expressions to match the genre text as well as the ability to process 3rd party customisations and genre systems other than ETSI.
The precision comes from being able to specify the exact genre code and coding system within the tag instead of having to rely on a text match. Specifying the genre code directly should also provide a modest performance boost.
This proposed change to the XMLTV schema would be non-standard. However, if the change were deemed useful enough, perhaps it could be added to the standard in the future.
RE: Discussion on an 'Extensible Genre' Architecture - Added by Dave H over 1 year ago
It seems to me that proposing changes to just one client of a schema is not the best way to seek to get changes to the schema. Why not propose your changes to the schema itself, and then all clients will naturally implement the changes over time?
RE: Discussion on an 'Extensible Genre' Architecture - Added by Delta Mike Charlie over 1 year ago
Dave H wrote:
It seems to me that proposing changes to just one client of a schema is not the best way to seek to get changes to the schema. Why not propose your changes to the schema itself, and then all clients will naturally implement the changes over time?
Thanks for your feedback.
There are really two parts to what I am proposing:
1) I am proposing that tvheadend be modified to accept some form of extensible genre information as opposed to being hard-coded to ETSI codes.
2) I am proposing that since XMLTV is the method by which tvheadend can receive external (non-OTA) EPG data, then modifying that mechanism would be the way to achieve this.
If there is no appetite among the user base for extensible genre descriptions (part 1), then the modification would be a waste of time.
With respect to changing the schema (part 2), I am currently in the process of crafting a proposal for just that. Once complete, I will present that proposal to the XMLTV team for their assessment.
RE: Discussion on an 'Extensible Genre' Architecture - Added by Delta Mike Charlie over 1 year ago
Dynamic Genre Descriptions
I have another proposal that is self-contained within TVH, however, it will require some additional processing overhead as well as some optional manual user post-processing when ingesting XMLTV EPG data. This mechanism can also be designed to work in a manner that is compatible with and part of the previous proposal. This mechanism would be disabled by default and when enabled, only triggered when the existing ETSI description code matching fails.
The proposal involves TVH storing a record of every XMLTV genre description that it encounters.
ACQUISITION
When a genre description is encountered, TVH will lookup all of the other saved genre descriptions to see if it has encountered that description before. If the genre description is new, a new UUID will be allocated, if the genre description exists, the existing UUID will be used.
Optionally, via the WebUI, the user will be able to assign an ETSI genre description to the dynamic descriptions that TVH has learned. The WebUI will provide a list of learned genres. Each of the learned genres will have a drop-down list containing the ETSI genre names known to TVH. In this way, dynamic descriptions like 'Cricket' or 'Test Cricket' or 'ODI Cricket' or 'T20 Cricket' or 'The Ashes' could be mapped to ETSI 'Sport'.
Perhaps a user with sufficient IT knowledge could setup a manual map using a regex for matching purposes or assign a 'weighting' to more popular genres.
If the 'lang' attribute is provided with the genre in the XML feed, then this will be stored too. Perhaps there could be a mechanism where 'en'+'Tennis' and 'vn'+'quần vợt' and 'Wimbledon' can be declared to be aliases and all mapped to ETSI 0x44 'Tennis'.
Perhaps a JSON API call could be created so that 3rd party EPG vendors could send a list of their genres to TVH so that it can pre-learn genres and ESTI associations prior to receiving live EPG data.
STORAGE
Each learned genre description will be stored in its own JSON file. Perhaps to aid speed of recovery, instead of the UUID, the file name could be a hex string representing the UTF-8 encoding of the genre description.
The JSON file will contain the genre description text, the ETSI code (if assigned by the user) and potentially some grouping or aliasing information.
The 'extensible genre' array as described above would only require minor modification. The 'system' integer could be hard-coded to something like 0x00 or 0xFF and the 'code' integer could become the UUID instead.
OUTPUT
The changes to the JSON API output described above would also be modified slightly to provide the UUIDs as the 'code' instead of the integer.
The existing TVH WebUI EPG page will be capable of displaying the dynamic genres, however, because filtering happens at the ETSI group level, filtering will not be possible for dynamic genres not assigned an ETSI equivalent.
CONCLUSION
TVH will learn gradually over time as it acquires new genre descriptions. As TVH builds up a list of dynamic genres, the user will be able to reassign dynamic genres to traditional genres in a way that suits them best.
EPG clients will still have access to the learned genres and be able to resolve the codes/UUIDs back into meaningful names.