Feature #4795
eit: scraper regex processing should use PCRE/PCRE2 if available
0%
Description
PCRE/PCRE2 (if already in use) would make writing some scraper patterns easier.
Right now I'm thinking of being able to have a grouping subpattern that is not captured. So you could, say, use
(?:[.][.][.][:.]*[.:]|[0-9]+/[0-9]+[.]) ([^:]*):
when matching for a subtitle.
History
Updated by Jaroslav Kysela almost 7 years ago
I think that we have a problem now. Because of slight different regex syntax for PCRE/PCRE2/glibc regex, it might be a bit difficult to create "universal" rules now.
Updated by Jim Hague almost 7 years ago
This is true, and of course to some extent the problem already exists in the regex use for autorecs as the user has to be aware of the regex library in use to use advanced regex features, though I guess that case differs a little as any regex is user-entered and not part of the distribution. Also, the test script for scraper regexes is using Python re, which is neither Posix nor PCRE, but is closer to PCRE than Posix, certainly for basic usage.
I'm not proposing to alter the existing supplied regexes (which work the same with all engines), though I could argue this could be done for files distributed with packages where, like the Debian package, we know that PCRE will be used. Rather, I'm trying to give the home tinkerer (like myself) some more flexibility. Also, I think that in practice problems with universal rules (i.e. rules that will run on any engine) are not likely to happen, because the major differences involve handling of items in (), and that's restricted by the scraper only looking at the first (or now possibly second) () match.
Prompted by a github request, I'm going to add documentation on how to make your own scraper regex file to the README file in data/conf/epggrab/eit/scrape, and I will mention the Posix/PCRE issue. Perhaps I should add a wiki page under Technical documentation?