Feature #4795: eit: scraper regex processing should use PCRE/PCRE2 if available - Tvheadend

Actions

Copy link

Feature #4795

closed

eit: scraper regex processing should use PCRE/PCRE2 if available

Added by Jim Hague over 7 years ago. Updated over 7 years ago.

Status:

Fixed

Priority:

Normal

Assignee:

Adam Sutton

Category:

EPG - Grabbers

Target version:

Start date:

2017-12-14

Due date:

% Done:

Estimated time:

Description

PCRE/PCRE2 (if already in use) would make writing some scraper patterns easier.

Right now I'm thinking of being able to have a grouping subpattern that is not captured. So you could, say, use

^{(?:[.][.][.][}:.]*[.:]|[0-9]+/[0-9]+[.]) ([^:]*):

when matching for a subtitle.

Actions

Copy link

Updated by Jim Hague over 7 years ago

(PR In progress)

Actions

Copy link

Updated by Jaroslav Kysela over 7 years ago

I think that we have a problem now. Because of slight different regex syntax for PCRE/PCRE2/glibc regex, it might be a bit difficult to create "universal" rules now.

Actions

Copy link

Updated by Jim Hague over 7 years ago

This is true, and of course to some extent the problem already exists in the regex use for autorecs as the user has to be aware of the regex library in use to use advanced regex features, though I guess that case differs a little as any regex is user-entered and not part of the distribution. Also, the test script for scraper regexes is using Python re, which is neither Posix nor PCRE, but is closer to PCRE than Posix, certainly for basic usage.

I'm not proposing to alter the existing supplied regexes (which work the same with all engines), though I could argue this could be done for files distributed with packages where, like the Debian package, we know that PCRE will be used. Rather, I'm trying to give the home tinkerer (like myself) some more flexibility. Also, I think that in practice problems with universal rules (i.e. rules that will run on any engine) are not likely to happen, because the major differences involve handling of items in (), and that's restricted by the scraper only looking at the first (or now possibly second) () match.

Prompted by a github request, I'm going to add documentation on how to make your own scraper regex file to the README file in data/conf/epggrab/eit/scrape, and I will mention the Posix/PCRE issue. Perhaps I should add a wiki page under Technical documentation?

Actions

Copy link