Autorecs duplicate handling, description question
Added by Oleg _ almost 5 years ago
Build: 4.2.8-31~g0a60f73ee (2019-07-01T01:03:03+0200)
Autorecs duplicate handing: All Record if different description
Hello,
Some recordings are recent duplicates, a few weeks ago. I searched dvr log directory and found 2 entries, 1 is history of removed recording and 1 new recording that still has file.
I copied files and compared them in notepad ++ but it didn't show any differences in description tag.
Question:
What tags does description include, if it includes title than it explains?
Would polish characters cause the issue of comparing description, how can i test that?
Thank you in advance for any suggestions/feedback.
Replies (17)
RE: Autorecs duplicate handling, description question - Added by Joe User almost 5 years ago
I have not used 4.2 in a long, long time, so I am not sure about it. I know there have been many fixes/enhancements for autorec if 4.3 though.
Maybe because the "title"s are different?
Also the "episode" field is different - is the EPG from different sources? If so, while they appear the same, possibly the descriptions have different character encodings???
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Joe User wrote:
I have not used 4.2 in a long, long time, so I am not sure about it. I know there have been many fixes/enhancements for autorec if 4.3 though.
Maybe because the "title"s are different?
Also the "episode" field is different - is the EPG from different sources? If so, while they appear the same, possibly the descriptions have different character encodings???
Thank you for directing towards the area to check, title. Today i was sorting removed recordings by subtitle to see if there are any duplicates and there were many. The difference was that some titles contained season and episode numbers and they were from different channels. But description was off the same encoding and identical in dvr log files. Epg is from the same source. Dvr log file encoding is utf8.
I would really like to keep current setup and just adjust autorec settings if possible. Do you think it is possible that tvheadend uses title regexp differently when looking for tv shows to record and different when looking for duplicates? I was thinking maybe adding after ".*" title in regexp would exclude it but then i am not sure if description in settings also includes title. The only way to find out is to apply and test or look at source code but i do not have knowledge of that.
Will tvheadend preserver all config from 4.2.8 when upgrading to 4.3?
Please see attached image of most recent duplicate that shows possible reasons. My autorec rule has title (regexp) "Nasz nowy dom".
Thank you
RE: Autorecs duplicate handling, description question - Added by Em Smith almost 5 years ago
https://github.com/tvheadend/tvheadend/blob/v4.2.8/src/dvr/dvr_db.c#L1494
The code says "if titles are not defined or do not match, don't dedup"
It is the same in 4.3, so upgrading won't change the behaviour for you.
https://github.com/tvheadend/tvheadend/blob/master/src/dvr/dvr_db.c#L1804
If you want only "Nasz nowy dom" then add "$" after the title instead of ".*" (since all regex already have ".*").
RE: Autorecs duplicate handling, description question - Added by Joe User almost 5 years ago
Thanks for code info.
Em Smith wrote:
If you want only "Nasz nowy dom" then add "$" after the title instead of ".*" (since all regex already have ".*").
Unfortunately what he wants is any "Nasz nowy dom*" but not if the description is the same. If you see his picture it shows the same episode, but with slightly different titles - he doesn't care which title gets recorded, but doesn't want two of the same episode.
It would be possible to check if the descriptions are exactly the same and a regex comparison of the titles matches, then they are dupes. But then you would have problems with something like:
Title:"News at 5" desc:"News" Title:"News live" desc""News"
In this case the user wants a regex of "News" to record both, but they would not be.
I am not feeling clever enough at the moment to think of a way to satisfy both...
Also note the episode information is different:
"episode": "Season 9.Episode 96" "episode": "S09E96"
Does the code in 4.3 for "record if different episode" consider the two the same? ie. does it parse each to values for season and episode, or a simple compare?
RE: Autorecs duplicate handling, description question - Added by Em Smith almost 5 years ago
It's possible to do it with 4.3, but not easy.
There are scraper configurations such as data/conf/epggrab/eit/scrape/uk. These are applied to the epg grabber and can manipulate the epg fields before they are stored and used.
For example, in the UK some broadcasters prepend "New: " to the title, so the regex strips it from the title before using it.
Assuming the epg grabber is used, then a config could strip the "odcinek [0-9]+" from the title (using the UK config as a guide on writing the config file), and that should solve the problems.
Re: episode, yes I noticed that. I'd have to re-check the code to see how that is extracted since 4.3 does it very differently and stores the fields separately so that it can be passed better to Kodi.
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Joe User wrote:
Thanks for code info.
Em Smith wrote:
If you want only "Nasz nowy dom" then add "$" after the title instead of ".*" (since all regex already have ".*").
Unfortunately what he wants is any "Nasz nowy dom*" but not if the description is the same. If you see his picture it shows the same episode, but with slightly different titles - he doesn't care which title gets recorded, but doesn't want two of the same episode.
It would be possible to check if the descriptions are exactly the same and a regex comparison of the titles matches, then they are dupes. But then you would have problems with something like:
[...]In this case the user wants a regex of "News" to record both, but they would not be.
I am not feeling clever enough at the moment to think of a way to satisfy both...
Also note the episode information is different:
[...]Does the code in 4.3 for "record if different episode" consider the two the same? ie. does it parse each to values for season and episode, or a simple compare?
Hello Joe,
Thank you for clarifying my intentions for Em and detailing scenarios!
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Em Smith wrote:
It's possible to do it with 4.3, but not easy.
There are scraper configurations such as data/conf/epggrab/eit/scrape/uk. These are applied to the epg grabber and can manipulate the epg fields before they are stored and used.
For example, in the UK some broadcasters prepend "New: " to the title, so the regex strips it from the title before using it.
Assuming the epg grabber is used, then a config could strip the "odcinek [0-9]+" from the title (using the UK config as a guide on writing the config file), and that should solve the problems.
Re: episode, yes I noticed that. I'd have to re-check the code to see how that is extracted since 4.3 does it very differently and stores the fields separately so that it can be passed better to Kodi.
Thanks a lot for confirming how it does it in code as it is a guess work to do tests in order to identify what information it uses and how.
I am using external xmltv grabber that is downloaded from iptv provider and passed on to tvhadend through short script, below line is the most important.
cat /sharedfolders/Appdata/tvheadend42/scripts/epg.xml | socat - UNIX-CONNECT:/sharedfolders/Appdata/tvheadend42/epggrab/xmltv.sock
So i was thinking to add to my script to remove anything after comma in title using regex, some use abbreviations and some use just numbers but most use comma. I am not using dvb epg for this source and will also do the same on existing recordings in dvr log files. I will have to take backup and practice on few for a start.
I will also update autorec title regex to add $ at the end.
perl -pi -e 's/(^\s*\<title\ lang\=\"pl\"\>.*)(\,.*\<)/$1\</g' epg.xml
perl -pi -e 's/\s*\<\/title\>/\<\/title\>/g' epg.xml #to remove any spaces in case there were any before comma
Epg xml file example entry
Programme start="20191128132000 +0100" stop="20191128135000 +0100" channel="HGTV"> <title lang="pl">Odlotowy ogród 4, odc. 5/6</title> <desc lang="pl">W miejscowości Modliczki niedaleko Krakowa mieszka małżeństwo z pięciorgiem dzieci i psem. Pani Kasia jest psychologiem i doradcą zawodowym, a pan Paweł listonoszem. Marzą, by ich ogródek stał się miejscem wymarzonego relaksu, w którym każdy znajdzie swój ulubiony zakątek. Obecnie dostęp do ogrodu jest mocno utrudniony. Bezładnie porozmieszczane są tam piaskownica, trampolina i huśtawka. W dodatku pies skutecznie niszczy trawę. Dominik Strzelec proponuje rodzinie ciekawe rozwiązania</desc>
After further thinking i decided to add above method on each problematic recording, separate command for each. I just looked in epg for titles with commas and there were 1245 for 3 days, not the channels i mainly use but majority commas were part of the description.
I also found some titles that contain season number before comma so for each problematic tvshow i would have to copy and ammend below command e.g. "Odlotowy ogród 4, odc. 5/6".
perl -pi -e 's/(^\s*\<title\ lang\=\"pl\"\>Odlotowy\ ogród)(.*\<)/$1\</g' epg.xml
I will try rename of all dvr logs but it will be possibly in 2 days time.
One more time thank you Em and Joe for helping me with this. Please let me know if my approach is not appropriate.
RE: Autorecs duplicate handling, description question - Added by Em Smith almost 5 years ago
Looks good. We are pleased to help.
I would be wary of assuming a comma means episode since there might be cases where a title contains a comma? So a film called "A, B, C" would strip the ", C".
I would use:
perl -pi -e 's/(^\s*\<title\ lang\=\"pl\"\>.*)(, odc. \d.*\<)/$1\</g' epg.xml
Perhaps this regex will help for the easy cases with a series, so, if we have a series number followed by episode then we strip them:
perl -p -e 's/(^\s*\<title\ lang\=\"pl\".*)( \d, odc. \d.*\<)/$1\</g'
RE: Autorecs duplicate handling, description question - Added by Em Smith almost 5 years ago
That should be:
perl -p -e 's/(^\s*\<title\ lang\=\"pl\".*)( \d+, odc. \d.*\<)/$1\</g'
(with the extra "+" after the series), just in case you're recording series 10 of something.
Also, what "episode-num" "system" tag does your xmltv contain? Since your new and old recordings have different episode information, it suggests something else was different.
The best supported format is xmltv_ns for episodes. If your epg has something else such as "onscreen" (that is consistent) then it may be possible to re-format it as part of your script to be xmltv_ns, which then means Kodi will see proper season/episode details (though possibly not until you have a newer version of Tvheadend). If you post a few examples I'll take a look.
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Em Smith wrote:
That should be:
[...]
(with the extra "+" after the series), just in case you're recording series 10 of something.Also, what "episode-num" "system" tag does your xmltv contain? Since your new and old recordings have different episode information, it suggests something else was different.
The best supported format is xmltv_ns for episodes. If your epg has something else such as "onscreen" (that is consistent) then it may be possible to re-format it as part of your script to be xmltv_ns, which then means Kodi will see proper season/episode details (though possibly not until you have a newer version of Tvheadend). If you post a few examples I'll take a look.
Thank you for suggested regex changes that i will be able to apply. I just looked at a few shows that have the same description and i use "All: Record if different episode number" but i couldn't find any duplicates possibly because no shows were listed in epg as titles without episode numbers but does it use the same dedup logic "// if titles are not defined or do not match, don't dedup"?
I had different iptv provider packages previously that possibly had different epg.
I had a look at epg and current one has "<episode-num system="xmltv_ns">" tags but they do not show correct season and episode is minus 1 and not on all programs as some have "<episode-num system="onscreen">".
Attached parts of epg.xml.
After looking into epg deeper and deeper it makes it harder and harder due to so many naming conventions used. What i though would be best is to pull title from all autorecs in dvr/autorec and just strip down episodes or season for those in epg.xml. After this it will allow me to make decision what to search for description, episode or subtitle within tvheadend and not edit any scripts.
Below script is just a part of pulling autorec information and modifying xml file. I tested it and it works in my case, the only limit is if there is a leading space in title either in epg.xml or in autorecs.
I still need to write a script of how to rename existing dvr log entries.
#!/bin/bash
# set -x
#declare file/path variables
autorecFiles=/sharedfolders/Appdata/tvheadend42/dvr/autorec/*
epgFile=/sharedfolders/Appdata/tvheadend42/scripts/epg.xml
#temp file to store autorec titles
autorecShows=/tmp/autorecShows.txt
rm -f $autorecShows
#extract Title from autorec files
for f in $autorecFiles
do
cat $f|sed -n -Ee 's/(^\s*\"title\"\:\ \")(.*)(\"\,)/\2/p'>>$autorecShows
done
#remove any spaces at the end
perl -pi -e 's/\ *$/$2/g' $autorecShows
#for each show in epg keep title as in autorec entry, use double quotes and \1 instead of $1
cat $autorecShows | while read line || [[ -n $line ]];
do
perl -pi -e "s/(^\s*\<title\ lang\=\"pl\"\>${line})(.*\<)/\1\</g" $epgFile
done
RE: Autorecs duplicate handling, description question - Added by Joe User almost 5 years ago
Oleg _ wrote:
I still need to write a script of how to rename existing dvr log entries.
Just remember that Tvheadend only reads the dvr log files on start up. So Tvheadend will have to be restarted in order for it to see any changes you make to the files - and probably best to stop Tvheadend before making any outside changes to the files...
RE: Autorecs duplicate handling, description question - Added by Em Smith almost 5 years ago
Your script looks ok to me.
The xmltv_ns season/episode are zero-based (for odd reasons),
https://github.com/XMLTV/xmltv/blob/master/xmltv.dtd#L367
To convert onscreen to xmltv_ns, this appears to work for the examples you have:
perl -pi -e 'if(/(<episode-num system=)"onscreen">(\d+), odc. (\d+)\/?(\d+)?/) {my $s=$2-1; my $e=$3-1; print qq!$1"xmltv_ns"> $s . $e . </episode-num>\n!}' epg.xml
I deliberately left the xmltv_ns tag without indentation so it is easy for you to see in the convrted file. Tvheaend will parse both onscreen and xmltv_ns, but should prefer xmltv_ns. (If not, then assign to $_ instead of doing a print).
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Joe User wrote:
Oleg _ wrote:
I still need to write a script of how to rename existing dvr log entries.
Just remember that Tvheadend only reads the dvr log files on start up. So Tvheadend will have to be restarted in order for it to see any changes you make to the files - and probably best to stop Tvheadend before making any outside changes to the files...
Thank you Joe for pointing it out. I restart tvheadend on nightly basis so could incorporate into that script, stop service, apply changes and start again. My whole aim is to have it as little maintenance as possible and nightly restart does help with that. I am not watching tv myself and do not monitor anything until some occasions when i happen to ask my partner if she saw this recording before. I have weekly backups of config directories so it shouldn't be a problem restoring if something breaks.
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Em Smith wrote:
Your script looks ok to me.
The xmltv_ns season/episode are zero-based (for odd reasons),
https://github.com/XMLTV/xmltv/blob/master/xmltv.dtd#L367To convert onscreen to xmltv_ns, this appears to work for the examples you have:
[...]I deliberately left the xmltv_ns tag without indentation so it is easy for you to see in the convrted file. Tvheaend will parse both onscreen and xmltv_ns, but should prefer xmltv_ns. (If not, then assign to $_ instead of doing a print).
Thank you Em for clarifying xmltv_ns naming standard, season.episode.Program part number if multiple(format "episode part"/"total episode parts") and providing way of manipulating it.
I think i will stick to renaming title as it should be less prone to any errors and will stick with current tvheadend build. I am not good at programming or shell scripting, just copy paste from online forums and modify a bit.
I have managed to get the replacement part with $title variable which will be retrieved from autorecs. I still need to find a way to get a list of files that contain multiline regex that contains "^\s*\"title\"\:\ \{" and title on next line that matches title but it is for another day.
sed -i -Ee "/^\s*\"title\"\:\ \{/{n;s/(^\s*\".*\"\:\ \")(.*)/\1${title}\"/;}" $logFile
Have a good weekend
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Em Smith wrote:
Your script looks ok to me.
Hello Em,
Just an update on what i have done, i tested both scripts and they do what they supposed to but only time will tell if it gives benefit, as tvheadend keeps epg and i will only find out in the next few days by comparing non modified with modified and scheduled entries in tvheadend.
I am sure below scripts are full of possible problems but they do what i need them to for now and hopefully for long time to come.
Below is nightly restart of tvheadend and updates dvr log entries with autorec Title (regexp) value.
#!/bin/bash
#set -x
#stop tvheadend service first, uncomment the one that works in shell
docker stop --time 30 tvheadend4.2
#service tvheadend stop
#systemctl stop tvheadend
#declare file/path variables, directory has to end with /
autorecDir=/sharedfolders/Appdata/tvheadend42/dvr/autorec/
dvrlogDir=/sharedfolders/Appdata/tvheadend42/dvr/log/
#temp files to store autorec titles and file paths
autorecShows=/tmp/autorecShows.txt
dvrFiles=/tmp/dvrLogFiles.txt
logFile=/tmp/logfile.txt
#delete temp files
rm -f $autorecShows
rm -f $dvrFiles
rm -f $logFile
#extract Title from autorec files
for f in $autorecDir*
do
cat $f|sed -n -Ee 's/(^\s*\"title\"\:\ \")(.*)(\"\,)/\2/p'>>$autorecShows
done
#remove any spaces at the end
perl -pi -e 's/\ *$/$2/g' $autorecShows
#for each show scan dvr/log files and record them,
#they will match any file that contains title from autorecs anywhere in file
#couldn't find easy way of doing it correctly but next step with sed will match and replace accurately.
cat $autorecShows | while read show || [[ -n $show ]];
do
rm -f $dvrFiles
find $dvrlogDir -type f |xargs grep -li "$show" >>$dvrFiles
#for each log file entry compare with current show name, match exactly after title and trim anything after
#format it is looking for is a 2 line below with spaces at the start.
# "title": {
# "any characters": "Title with any characters, that will be replaced with title form autorec"
cat $dvrFiles | while read filePath || [[ -n $filePath ]];
do
sed -i -Ee "/^\s*\"title\"\:\ \{/{n;s/(^\s*\".*\"\:\ \")(.*)/\1${show}\"/;}" $filePath
echo $show $filePath>>$logFile
done
done
#start tvheadend service, uncomment the one that works in shell
docker start tvheadend4.2
#service tvheadend start
#systemctl start tvheadend
Below is the one i use to download epg and modify titles to make them identical as in autorecs Title (regexp).
#!/bin/bash
#script downloads epg and makes title identical as in Autorec, this is needed if tv shows are recorded based
#on different description as tvheadend compares titles first and if different doesn't compare description of allready recorded one.
#declare variables, work directory, etc
epgFile=/tmp/epg.xml
logFile=/tmp/epg.log
autorecFiles=/sharedfolders/Appdata/tvheadend42/dvr/autorec/*
#external xmltv epg grabber path
xmltvGrabber=/sharedfolders/Appdata/tvheadend42/epggrab/xmltv.sock
#temp file to store autorec titles
autorecShows=/tmp/autorecShows.txt
#cleanup temp files
rm -f $autorecShows
rm -f $epgFile
rm -f $logFile
echo "Download starting..." >> $logFile
echo "Time: $(date)" >> $logFile
#download epg
wget http://somewebsite.org -O $epgFile --no-check-certificate --no-proxy
#extract Title from autorec files
for f in $autorecFiles
do
cat $f|sed -n -Ee 's/(^\s*\"title\"\:\ \")(.*)(\"\,)/\2/p'>>$autorecShows
done
#remove any spaces at the end
perl -pi -e 's/\ *$/$2/g' $autorecShows
#for each show in epg keep title as in autorec entry, use double quotes and \1 instead of $1
cat $autorecShows | while read line || [[ -n $line ]];
do
perl -pi -e "s/(^\s*\<title\ lang\=\"pl\"\>${line})(.*\<)/\1\</g" $epgFile
done
# give a seccond to finish
sleep 1
#update tvheadend
cat $epgFile | socat - UNIX-CONNECT:$xmltvGrabber
echo "Finished Time: $(date)" >> $logFile
RE: Autorecs duplicate handling, description question - Added by Em Smith almost 5 years ago
They look reasonable to me.
However, I've been thinking that it may have problems in the future with some series. For example, if you have a recording rule for "CSI" then I think your scripts will correctly alter "CSI 3, odc. 5" to be "CSI", but would also change "CSI: Miami" to be "CSI".
RE: Autorecs duplicate handling, description question - Added by Oleg _ almost 5 years ago
Em Smith wrote:
They look reasonable to me.
However, I've been thinking that it may have problems in the future with some series. For example, if you have a recording rule for "CSI" then I think your scripts will correctly alter "CSI 3, odc. 5" to be "CSI", but would also change "CSI: Miami" to be "CSI".
It is a very valid point and I will keep a note of that when scheduling recordings and in case of any problems check original epg entries and compare to amended one.
Thank you for all the help!