Can not see PID's in .ts file
Added by Anonymous over 12 years ago
Hi,
I am currently trying to use a post process script that utilizes ProjectX to extract teletext subtitles into a text format. I am using the tvheadend build that allows .ts recording and for some reason the different PID's in the file don't get identified like they would if they were recorded using other apps such as MythTV ror DVBViewer. My script needs to be able to identify the PID for the teletext for it to work, eventhough a manual run of projectx allows you to extract them.
Anyway around this?
Thanx
Dave
Replies (4)
RE: Can not see PID's in .ts file - Added by Hein Rigolo over 12 years ago
the .ts recording branch for tvheadend is not supported at the moment. But if you use the mkv recording option then the teletext subtitle stream is already available as a simple ascii text stream in the mkv file. You could just extract that and re-process it for conversion to srt? or take the S_DVB_SUB datastream for the dvb subtitles and use OCR and other script magic to convert that to a srt subtitle file.
Hein
RE: Can not see PID's in .ts file - Added by Anonymous over 12 years ago
I didn't realise you could extract the teletext subtitle stream from mkv. Is it in the same format as an .srt with the timestamps and everything?
I suppose the script could be altered to use a different program other than ProjectX to get them, but I wouldnt know how :(. Also, the script cleans up the text to remove word duplicates and uneccessary spacing that you get from live subtitles. All this needs to be done so an app I have can read them more easily.
The script can also perform OCR but i dont want to use that feature as it takes up too much system resources and is prone to error.
RE: Can not see PID's in .ts file - Added by Anonymous over 12 years ago
Here is a copy of the code. Any advice / tips on how to get the thing to work with TVheadend (just for teletext subtitles not the mentioned DVB subpict ones) would be greatly appreciated.
#!/usr/bin/pythoncoding: iso8859-1
"""
uksub2srt
-----------
This script converts .son bitmap subtitles to the text-based .srt
format. It can be used to process subtitle files generated with the ProjectX
DVB demuxing tool. With a good symbol database the number of errors can easily
be under 10 in a two-hour movie. However, corrupted subtitle images will cause
problems.
Requires:
- Custom gocr 0.48
Usage: uksub2srt.py [-v] [-d path [-a] [-c]] [-s px] -i sub.son -o sub.srt
Where:
-d path Force use of a particular database. Not recommnded.
-a Build a new database or add data to an existing one. The
user will be prompted for new symbols. This is not needed
for normal operation.
-c Correct usual mistakes made by gocr. Because the
filter rules depend on subtitle font and language, they
should be in file correct.py in the database directory.
Recommended.
-s <px> Space width. Defaults to 11 for UK DVB subtitles
-i sub.son The input SON file.
-o sub.srt Output SRT.
-n Do not delete intermediate files
-v Print some info to standard out
-h, --help Show this information.
Example: python uksub2srt.py -i sub.son -o sub.srt
"""- CONFIGURATION ############################################################
- Used as constants
GOCR = "uksubgocr"
CORRECTION_MODULE = "correct"
FAILURE_DIR = "failure1"
FAIL_CHAR = '^'
- Globals, used to store cmd-line arguments
db = None
dbdir = "./db/"
builddb = False
space_width = 11
verbose = False
correction = False
correct = None
sondir = "./"
nodelete = False
mergeduplicates = True
utils = None
##############################################################################
import sys
import getopt
import os
import re
import subprocess
import colorsys
import pdb
import shutil
import logging
import time
import hashlib
import shlex
import glob
import tempfile
from PIL import Image, ImageChops
from operator import itemgetter
- Global constants
hist_re = re.compile(r"^ 00([\da-f]{2}) 00([\da-f]{2}) 00([\da-f]{2})")
son_re = re.compile(r"^(\d+)\s+(\d{2}):(\d{2}):(\d{2}):(\d{2})"
r"\s+(\d{2}):(\d{2}):(\d{2}):(\d{2})\s+(.+)$")
NOSTATE = 0
VIDEO = 1
AUDIO = 2
TELETEXT = 3
SUBPICT = 4
NUMBER = 0
TIMESTAMP = 1
TEXT = 2
- Exceptions
class ExternalError(Exception):
def init(self, cmd, err, stdout, stderr, returncode):
self.cmd = cmd
self.err = err
self.stdout = stdout
self.stderr = stderr
self.returncode = returncode
def str(self):
return "Command: %s\nError: %s\nStdout:\n%s\nStderr:\n%s\nReturn: %d\n" %\
(self.cmd, self.err, self.stdout, self.stderr, self.returncode)
###########################
- Records
class OCROutput():
def init(self, start, end, lines):
self.start = start
self.end = end
self.lines = lines
- Utility functions
class Utils:
def init(self, verbose, remove_files):
self.verbose = verbose
self.remove_files = remove_files
self.script_dir = None
self.command_times = {}
def get_script_dir(self):
if not self.script_dir:
self.script_dir = sys.path[0]
return self.script_dir
def run_command_real(self, cmd, err, display=False):
logging.info("Running %s" % cmd)
if self.verbose:
print cmd
if display:
proc = subprocess.Popen(cmd, shell=True)
proc.wait()
else:
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
stdout, stderr = proc.communicate()
if proc.returncode != 0:
raise ExternalError(cmd, err, stdout, stderr, proc.returncode)
def run_command(self, cmd, err, display=False):
time_start = time.time()
self.run_command_real(cmd, err, display)
time_end = time.time()
k = cmd.split(' ')[0]
if not self.command_times.has_key(k):
self.command_times[k] = []
self.command_times[k].append(time_end - time_start)
def get_output_real(self, cmd, err):
logging.info("Getting output of %s" % cmd)
if self.verbose:
print cmd
proc = subprocess.Popen(cmd, stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
stdout, stderr = proc.communicate()
if proc.returncode != 0:
raise ExternalError(cmd, err, stdout, stderr, proc.returncode)
return stdout
def get_output(self, cmd, err):
time_start = time.time()
output = self.get_output_real(cmd, err)
time_end = time.time()
k = cmd.split(' ')[0]
if not self.command_times.has_key(k):
self.command_times[k] = []
self.command_times[k].append(time_end - time_start)
return output
def log_command_timings(self):
for cmd in self.command_times:
avg = sum(self.command_times[cmd])
logging.warn("Command %s took %f in total" % (cmd, avg))
def rm(self, fname):
logging.debug("Removing %s" % fname)
if self.remove_files:
try:
os.remove(fname)
except:
logging.error("Failed to delete: %s" % fname)
def v_print(self, s):
if self.verbose:
print s
def load_correct(db):
sys.path.append(db)
try:
mod = import(CORRECTION_MODULE, globals(), locals(), \
["correct"])
return getattr(mod, "correct")
except:
logging.critical("Error: Unable to load correction filter " + \
db + "/" + CORRECTION_MODULE + ".py\n")
logging.shutdown()
sys.exit(1)
def get_correct(db):
if not get_correct.cache.has_key(db):
get_correct.cache[db] = load_correct(db)
return get_correct.cache[db]
get_correct.cache = {}
def save_failures(files):
for f in files:
shutil.copy(f, FAILURE_DIR)
def getTsPids(file):
pids = {}
streams = utils.get_output('projectx -demux -id 0 "%s"' % file,
'Error running projectx: check it is on your path')
streams = streams.split('\n')
state = NOSTATE
for line in streams:
line = line.rstrip()
m = re.search("PID: (0x[\dABCDEF]+)", line)
if m:
if state == VIDEO:
pids['video'] = m.group(1)
elif state == AUDIO:
pids['audio'] = m.group(1)
elif state == TELETEXT:
pids['teletext'] = m.group(1)
elif state == SUBPICT:
pids['subpict'] = m.group(1)
state = NOSTATE
if line == "Video:":
state = VIDEO
elif line == "Audio:":
state = AUDIO
elif line == "Teletext:":
state = TELETEXT
elif line == "Subpict.:":
state = SUBPICT
return pids
class HashStore:
def init(self):
self.store = {}
self.hits = 0
def get_hash(self, im):
m = hashlib.md5()
m.update(im.tostring())
m.update(str(im.getpalette))
return m.digest()
def add(self, h, data):
self.store[h] = data
def check(self, h):
if self.store.has_key(h):
self.hits += 1
return self.store[h]
else:
return None
hashstore = HashStore()
###############################- Main function to parse a line
- of the .son file
def convertline(line): # parse .son times and the bmp file name
m = son_re.search(line)
if m:
start = m.group(2) + ":" + m.group(3) + ":" + m.group(4) + \
"," + m.group(5) + "0"
end = m.group(6) + ":" + m.group(7) + ":" + m.group(8) + \
"," + m.group(9) + "0"
bmpfile = (sondir + m.group(10)).rstrip()
text = ""
db = None
bmpfileq = '"%s"' % bmpfile
ppmfile = '%s.ppm' % bmpfile
ppmfileq = '"%s"' % ppmfile
im = Image.open(bmpfile)
assert im.mode == "P"
width = im.size[0]
height = im.size[1]
factor = 0
for x in range(30,40,2):
if height % x == 0:
height /= x
factor = x
break
if (height > 10):
height = 1
linetexts = []
for i in range(height):
ppmline = '%s.line%d.ppm' % (bmpfile, i)
ppmlineq = '"%s"' % ppmline
if factor: # Multiple lines
# TODO: Copy seems to be necessary, not really sure why
# Not a bottleneck though..
im_line = im.crop((0, i*factor, width, (i+1)*factor-1)).copy()
else:
im_line = im
- Test if hashes match
linehash = hashstore.get_hash(im_line)
hashtext = hashstore.check(linehash)
if hashtext:
linetexts.append(hashtext)
continue
db = flatten(im_line)
linetext = ""
if db:
im_line_rgb = im_line.convert("RGB")
im_line_rgb.save(ppmline)
linetext = ocr(ppmline, db)
if FAIL_CHAR in linetext or linetext == "":
logging.warning("ocr(%s, %s) contained an error or no output" % (ppmline, db))
save_failures([ppmline, bmpfile])
utils.rm(ppmline)
hashstore.add(linehash, linetext)
linetexts.append(linetext)
else:
logging.warning("flatten_im(%s) failed" % ppmline)
save_failures([bmpfile])
return OCROutput(start, end, linetexts)
else:
return None
def flatten(im):
# TODO: Could cache resulting image transformation and apply to those
# images with the same palette
p = im.getpalette()
colours = im.getcolors()
hues = {}
for _, c in colours:
r,g,b = p[c*3], p[c*3+1], p[c*3+2]
if r 0 and g 0 and b == 0:
# Change black to white
p[c*3] = 255
p[c*3+1] = 255
p[c*3+2] = 255
elif r 0 and g 0 and b == 96:
# Change dark blue to white
p[c*3] = 255
p[c*3+1] = 255
p[c*3+2] = 255
elif r 31 and g 31 and b == 31:
# Change dark grey to white
p[c*3] = 255
p[c*3+1] = 255
p[c*3+2] = 255
else:
h,s,v = colorsys.rgb_to_hsv(float(r)/255,
float(g)/255,
float(b)/255)
huename = "%.1f" % h
if hues.has_key(huename):
hues[huename].append((c, v))
else:
hues[huename] = [(c, v)]
- Check we can deal with image
if len(hues) < 1:
logging.error("Image with 0 hues")
return None
firsthuelen = len(hues.values()[0])
if firsthuelen == 4:
cutoff = 2
db = "%s/db5_3" % utils.get_script_dir()
elif firsthuelen == 6:
cutoff = 3
db = "%s/db7_4" % utils.get_script_dir()
else:
logging.error("Image with hues not length 4 or 6 (one of %d)" % firsthuelen)
logging.error(str(hues))
return None
for h in hues.values():
if not len(h) == firsthuelen:
logging.error("Image with different hue lengths (%d and %d)" % (len(h), firsthuelen))
return None
for huecolours in hues.values():
huecolours.sort(key=itemgetter(1))
for i, (c, _) in enumerate(huecolours):
if i < cutoff: # Change to white (background)
p[c*3] = 255
p[c*3+1] = 255
p[c*3+2] = 255
else: # Change to black (text)
p[c*3] = 0
p[c*3+1] = 0
p[c*3+2] = 0
im.putpalette(p)
return db
def ocr(file, db):
ocrparam = 2 + 8 + 256 + (128 if builddb else 0)
if db: # Use that specified on the cmdline
db = dbdir
if not os.access("%s/db.lst" % db, os.F_OK):
os.mkdir(db)
open("%s/db.lst" % db, "a").close()
fileq = '"%s"' % file
txtfile = '%s.txt' % file
txtfileq = '"%s"' % txtfile
cmd = '%s -m %d -s %d -u "%c" -a 95 -d 0 -p "%s/" -i %s -o %s' % \
(GOCR, ocrparam, space_width, FAIL_CHAR, db, fileq, txtfileq)
if builddb:
returncode = subprocess.call(shlex.split(cmd))
else:
devnull = open(os.devnull, 'a+')
returncode = subprocess.call(shlex.split(cmd),
stdout=devnull,
stderr=subprocess.STDOUT)
devnull.close()
text = ""
try:
textfile = open(file + ".txt", "r")
text = textfile.read().rstrip()
textfile.close()
except:
logging.warning("Unable to read OCR results from " + file + ".txt\n")
utils.rm(txtfile)
return text
def mergeLive(prevoutput, ocroutput):
outputnow = True
if mergeduplicates:
if ocroutput.start == prevoutput.start:
outputnow = False
if ocroutput.lines[0] == prevoutput.lines[0]:
prevoutput.lines += ocroutput.lines[1:]
else:
prevoutput.lines += ocroutput.lines
prevoutput.end = ocroutput.end
else:
prevtext = " ".join(prevoutput.lines).rstrip()
ocrtext = " ".join(ocroutput.lines).rstrip()
overlap = 0
for i in range(1, len(prevtext)+1):
if prevtext[-i:] == ocrtext[:i]:
overlap = i
if overlap != 0:
prevoutput.lines = [prevtext[:-overlap]]
return prevoutput, outputnow
def outputOCR(f, ocroutput):
if len(ocroutput.lines) > 0 and \
any(map(lambda s: len(s) != 0, ocroutput.lines)):
for l in ocroutput.lines:
if '\n' in l:
logging.warning("\\n found at %s: %s" % (ocroutput.start, l))
f.write("%s --> %s: %s\n" % \
(ocroutput.start, ocroutput.end, (" ".join(ocroutput.lines).rstrip().replace("\n", " "))))
def extractTeletextSubtitles(mpgfile, pids, output):
curdir = os.getcwd()
os.chdir(tempfile.gettempdir())
inifile = os.path.join(tempfile.gettempdir(), "X.ini")
f = open(inifile, "w")
f.write("SubtitlePanel.TtxPage1=888\n");
f.write("SubtitlePanel.SubtitleExportFormat=srt\n")
f.close()
utils.run_command('projectx "%s" -log -ini %s -out %s ' % (mpgfile, inifile, tempfile.gettempdir()) + \
"-demux -id %s " % (pids['teletext']),
"Error running projectx: check it is on your path",
display=_verbose_)
mpgroot, _ = os.path.splitext(os.path.basename(mpgfile))
telesrt = os.path.join(tempfile.gettempdir(), mpgroot + "[888].srt")
f = open(telesrt)
g = open(output, "w")
state = NUMBER
prevoutput = OCROutput("", "", [])
for line in f.xreadlines():
line = line.rstrip()
if state == NUMBER:
state = TIMESTAMP
elif state == TIMESTAMP:
ocroutput = OCROutput("", "", [])
ocroutput.start, ocroutput.end = line.split(" --> ")
ocroutput.lines = []
state = TEXT
elif state TEXT:
if line "":
state = NUMBER
prevoutput, outputnow = mergeLive(prevoutput, ocroutput)
if outputnow:
outputOCR(g, prevoutput)
prevoutput = ocroutput
else:
ocroutput.lines.append(line)
outputOCR(g, prevoutput)
g.close()
f.close()
os.chdir(curdir)
return [inifile, os.path.join(tempfile.gettempdir(), mpgroot + "[888].srt")]
def extractDVBSubtitles(mpgfile, pids):
curdir = os.getcwd()
os.chdir(tempfile.gettempdir())
inifile = os.path.join(tempfile.gettempdir(), "X.ini")
f = open(inifile, "w")
f.write("SubtitlePanel.SubtitleExportFormat=SON\n")
f.write("SubtitlePanel.SubtitleExportFormat_2=null\n")
f.close()
utils.run_command('projectx "%s" -log -ini %s -out . ' % (mpgfile, inifile) + \
"-demux -id %s " % (pids['subpict']),
"Error running projectx: check it is on your path",
display=_verbose_)
os.chdir(curdir)
mpgroot, _ = os.path.splitext(mpgfile)
return [inifile, mpgroot + ".spf", mpgroot + ".son"] + glob.glob(mpgroot + "*.bmp")
def convertSonFile(input, output):
try:
sonfile = open(input, "r")
srtfile = open(output, "w")
prevoutput = OCROutput("", "", [])
for line in sonfile.xreadlines():
try:
ocroutput = convertline(line)
except ExternalError as e:
logging.error(str(e))
if ocroutput:
prevoutput, outputnow = mergeLive(prevoutput, ocroutput)
- Output
if outputnow:
outputOCR(srtfile, prevoutput)
prevoutput = ocroutput
outputOCR(srtfile, prevoutput)
except IOError, e:
logging.critical("Fatal IO Error: " + str(e))
logging.shutdown()
raise
def main():
# parse command line options
try:
opts, args = getopt.getopt(sys.argv[1:], "hvd:acns:i:o:", ["help", "no-merge"])
except getopt.error, msg:
print msg
print "Parameter --help prints usage information."
sys.exit(2)
input = ""
output = ""
global db
global dbdir
global builddb
global correction
global correct
global space_width
global verbose
global nodelete
global mergeduplicates
- process options
for o, a in opts:
if o in ("-h", "--help"):
print doc
sys.exit(0)
if o in ("-d"):
db = True
dbdir = a + "/"
if o in ("-a"):
builddb = True
if o in ("-c"):
correction = True
if o in ("-s"):
space_width = a
if o in ("-i"):
input = a
if o in ("-n"):
nodelete = True
if o in ("--no-merge"):
mergeduplicates = False
if o in ("-o"):
output = os.path.abspath(a)
if o in ("-v"):
verbose = True
global utils
utils = Utils(verbose, not nodelete)
start = time.time()
global sondir
sondir = os.path.dirname(input) + "/"
if sondir == "/":
sondir = "./"
os.chdir(sondir)
if not os.path.exists(FAILURE_DIR):
os.makedirs(FAILURE_DIR)
logfile = "%s/%s.txt" % (FAILURE_DIR, os.path.basename(input))
logging.basicConfig(filename=logfile,
format="%(asctime)s - %(levelname)s - %(message)s",
level=logging.INFO)
logging.info("Starting subtitle conversion")
logging.info("Version: $Revision: 519 $")
logging.info("Options: %s" % str(sys.argv))
logging.info("Input: %s" % input)
logging.info("Output: %s" % output)
inroot, ext = os.path.splitext(input)
cruft = []
if len(input) > 0 and len(output) > 0:
if ext ".ts" or ext ".mpg":
logging.info("Input is a video file, extracting")
try:
pids = getTsPids(input)
cruft.append(inroot + "_log.txt")
logging.info("Pids: %s" % str(pids))
if pids.has_key('teletext'):
logging.info("Teletext subtitles found")
cruft.extend(extractTeletextSubtitles(input, pids, output))
elif pids.has_key('subpict'):
logging.info("DVB subtitles found")
cruft.extend(extractDVBSubtitles(input, pids))
convertSonFile(inroot + ".son", output)
else:
logging.error("No subtitles found!")
except ExternalError as e:
logging.error(str(e))
elif ext == ".son":
logging.info("Input is a subtitle file")
convertSonFile(input, output)
else:
print "Failed to parse parameters. Try --help."
for f in cruft:
utils.rm(f)
utils.log_command_timings()
logging.info("Hash hits: %d" % hashstore.hits)
logging.info("Completed subtitle conversion")
end = time.time()
logging.info("Total time: %fs" % (end - start))
logging.shutdown()
if name == "__main__":
try:
main()
except Exception as e:
logging.critical("uncaught exception: %s" % str(e))
logging.shutdown()
raise
RE: Can not see PID's in .ts file - Added by Hein Rigolo over 12 years ago
dave c wrote:
I didn't realise you could extract the teletext subtitle stream from mkv. Is it in the same format as an .srt with the timestamps and everything?
I suppose the script could be altered to use a different program other than ProjectX to get them, but I wouldnt know how :(. Also, the script cleans up the text to remove word duplicates and uneccessary spacing that you get from live subtitles. All this needs to be done so an app I have can read them more easily.
The script can also perform OCR but i dont want to use that feature as it takes up too much system resources and is prone to error.
Just make a recording of a show with teletext subtitles and have a look at the recorded .mkv and the txt stream in there to see how it is structured and what you can do with it. I do not think the script can be adjusted as easily as you think.
It is not in a .srt format, but based on the timestamps in the .mkv you can probably convert it and strip it out as an .srt
Hein