Project

General

Profile

eit scraper questions

Added by T J over 3 years ago

Hello.

I'm working on creating a new eit scraper for swedish dvb-t (Boxer)
Got most of it figured out but there is a few things i need clarification on.
I have read the epggrab/eit/scrape/README file a few times.

When looking at the EPG and its data, tvheadend seems to have both something called "summary" and something called "desc" (or description).
And this is in addition to subtitle.
I don't get what the difference is between summary and description, can someone clarify?

The reason for asking is that on swedish dvb the eit data is a bit screwed up (something i found out a long time ago when trying to do something similar in another application) so the short event descriptor in EIT is not a subtitle, it is the beginning of the description (or whatever you decide to call it) and the rest of the description is in the extended event descriptor.
so short+extended together makes for the description.
this most of the times gets completely screwed up in applications so you get a funny looking subtile that is just garbage and a cutoff description.

It looks like first part (short event descriptor) ends up in the summary and rest of it in description.
How can i convince tvheadend to merge them, either via options in tvheadend or via the eit scraper scripts?


Replies (7)

RE: eit scraper questions - Added by T J over 3 years ago

One more question.
if i in the eit scraper config do filtering, is there any relation between for example filtering done by subtitle regex and summary?
for example if i in the scrape_subtitle remove some junk, is it carried over to scrape_summary or do i have to repeat it?

RE: eit scraper questions - Added by T J over 3 years ago

One more thing, if you make a syntactical error in the scraper file, sometimes you don't get anything useful in the log other than that the file doesn't exist.
Took me some time to figure out that error about missing file is really a syntax error.
For example when the json file can't be parsed because i misplaced a comma or forgot it.

RE: eit scraper questions - Added by Mark Clarkstone over 3 years ago

Here in the UK AFAIK

Subtitle: I don't think it's used here, the parser for the UK one pulls it from the description.
Summary: Basically a short version of the description, or sometimes just the actual episode title or numbering. It has also been used for the likes of just shoving "new" somewhere.
Description: The full description of the programme, including, if provided, release dates (xxxx), director, actors. and often additional features in brackets [AD] (Audio description), [S] Subtitling [R] Repeat.

HTH

RE: eit scraper questions - Added by T J over 3 years ago

The problem i have (or one of them anyway) is that Summary and Description must be merged together to make sense.

Example from a real entry, unfortunately its all in Swedish

Title: Mästaren
Subtitle: Vill du bli mer kreativ?
This one comes from my scraping file so can for now mostly be ignored

Summary: "Vill du bli mer kreativ? Lär av mästarna! Säsong 3. Eva Dahlgren. Eva Dahlgren slog igenom som tonåring, deltog i Melodifestivalen två gånger och vann fem grammisar för albumet En blekt blondins"

Description: " hjärta. Sen insåg hon att hon ville göra helt annan musik och gå sina egna vägar. Nu delar hon med sig av erfarenheter och tips. Hon berättar om hur idén till en ny låt uppstår, och om hur hon disciplinerat går till sin arbetslokal varje dag. Del 1 av 6. HD. Sänds med 5.1 ljud. Även i SVT1 6/7, SVT1 7/7, SVT24 9/7 och SVT1 10/7. Textat sid. 199."

Note that it is cut off mid sentence.
So i need to somehow get Summary + Description to end up in same field, either Description or Summary and then leave the other blank.

Contents of Summary is from the short event descriptor in EIT.
This is many times used as a subtitle or similar, but here in sweden it is a bit screwed up and i don't think they are following the standard properly.
So here the short event descriptor is treated as the text that goes with the program and then it continues in the extended descriptor.

Was checking the code yesterday in one of the eit files and the short and extended descriptor is stored in different fields and this makes it hard if not impossible to fix as it is right now from the scraper rules.

RE: eit scraper questions - Added by Em Smith over 3 years ago

Take a look at the uk file (and the README), in particular "scrape_subtitle" with the "filter" option and search the README.uk for "...". I think that is probably similar to what you want since in the UK it will merge and replace the title/subtitle if one ends in "..." and the other starts with "...". So, you can probably do something similar with summary and description but just always merge. There's a test harness that you can use to test your changes rather than having to await an epg grab.

The code should be logging a warning of "failed to load config" and "Invalid JSON", but looks like we don't pass in a buffer/log it. So, the best thing to do is to validate the file before starting the server, for example "jq . data/conf/epggrab/eit/scrape/se" (note the period) will dump the file to the screen if it is valid.

RE: eit scraper questions - Added by T J about 3 years ago

Continuing this thread...

I have been working on the scraping a bit more and also using other providers.
Getting season, episode and airdate to work is usually quite easy, but getting scrape_summary to do something is not possible at all across several providers.

If i create some test data and use the eitscrape_test.py script i can get the summary filtering to do the right thing but against real data it doesn't do anything.

See partial example of the scraping script.
Can someone tell me why this will not remove the text (excluding the quotation marks) "Series. " from the beginning of the description?

{
  "scrape_summary": [
     {
       "pattern": "^.+ serie[.] (.*)",
       "filter": 1
     },
     {
       "pattern": "^Serie[.] (.*)",
       "filter": 1
     },
     {
       "pattern": "^.+ film från [0-9]{4}[.] (.*)",
       "filter": 1
     },
     {
       "pattern": "^Film från [0-9]{4}[.] (.*)",
       "filter": 1
     },
     "(.*)" 
   ]
}

RE: eit scraper questions - Added by T J about 3 years ago

I just realized another problem...
How do i run different EIT scrapers against different providers? in this case dvb-t vs dvb-c that needs different rules applied to get them cleaned up.

I don't think this is possible since i cant create a duplicate of "Over-the-air: EIT: EPG Grabber" but named something else with different settings.

    (1-7/7)