Can not see PID's in .ts file

Added by Anonymous about 13 years ago

Hi,

I am currently trying to use a post process script that utilizes ProjectX to extract teletext subtitles into a text format. I am using the tvheadend build that allows .ts recording and for some reason the different PID's in the file don't get identified like they would if they were recorded using other apps such as MythTV ror DVBViewer. My script needs to be able to identify the PID for the teletext for it to work, eventhough a manual run of projectx allows you to extract them.

Anyway around this?

Thanx
Dave

Replies (4)

RE: Can not see PID's in .ts file - Added by Hein Rigolo about 13 years ago

the .ts recording branch for tvheadend is not supported at the moment. But if you use the mkv recording option then the teletext subtitle stream is already available as a simple ascii text stream in the mkv file. You could just extract that and re-process it for conversion to srt? or take the S_DVB_SUB datastream for the dvb subtitles and use OCR and other script magic to convert that to a srt subtitle file.

Hein

RE: Can not see PID's in .ts file - Added by Anonymous about 13 years ago

I didn't realise you could extract the teletext subtitle stream from mkv. Is it in the same format as an .srt with the timestamps and everything?

I suppose the script could be altered to use a different program other than ProjectX to get them, but I wouldnt know how :(. Also, the script cleans up the text to remove word duplicates and uneccessary spacing that you get from live subtitles. All this needs to be done so an app I have can read them more easily.

The script can also perform OCR but i dont want to use that feature as it takes up too much system resources and is prone to error.

RE: Can not see PID's in .ts file - Added by Anonymous about 13 years ago

Here is a copy of the code. Any advice / tips on how to get the thing to work with TVheadend (just for teletext subtitles not the mentioned DVB subpict ones) would be greatly appreciated.

#!/usr/bin/python

coding: iso8859-1

"""
uksub2srt
-----------

This script converts .son bitmap subtitles to the text-based .srt
format. It can be used to process subtitle files generated with the ProjectX
DVB demuxing tool. With a good symbol database the number of errors can easily
be under 10 in a two-hour movie. However, corrupted subtitle images will cause
problems.

Requires:
- Custom gocr 0.48

Usage: uksub2srt.py [-v] [-d path [-a] [-c]] [-s px] -i sub.son -o sub.srt
Where:
-d path Force use of a particular database. Not recommnded.

-a     Build a new database or add data to an existing one. The
                user will be prompted for new symbols. This is not needed
                for normal operation.

-c     Correct usual mistakes made by gocr. Because the
                filter rules depend on subtitle font and language, they
                should be in file correct.py in the database directory.
                Recommended.

-s &lt;px&gt;  Space width. Defaults to 11 for UK DVB subtitles

-i sub.son  The input SON file.

-o sub.srt  Output SRT.

-n     Do not delete intermediate files

-v     Print some info to standard out

-h, --help  Show this information.

Example: python uksub2srt.py -i sub.son -o sub.srt

"""

CONFIGURATION ############################################################

Used as constants
GOCR = "uksubgocr"
CORRECTION_MODULE = "correct"
FAILURE_DIR = "failure1"
FAIL_CHAR = '^'

Globals, used to store cmd-line arguments
db = None
dbdir = "./db/"
builddb = False
space_width = 11
verbose = False
correction = False
correct = None
sondir = "./"
nodelete = False
mergeduplicates = True

utils = None

##############################################################################

import sys
import getopt
import os
import re
import subprocess
import colorsys
import pdb
import shutil
import logging
import time
import hashlib
import shlex
import glob
import tempfile
from PIL import Image, ImageChops
from operator import itemgetter

###########################

Global constants

hist_re = re.compile(r"^ 00([\da-f]{2}) 00([\da-f]{2}) 00([\da-f]{2})")
son_re = re.compile(r"^(\d+)\s+(\d{2}):(\d{2}):(\d{2}):(\d{2})"
r"\s+(\d{2}):(\d{2}):(\d{2}):(\d{2})\s+(.+)$")

NOSTATE = 0
VIDEO = 1
AUDIO = 2
TELETEXT = 3
SUBPICT = 4

NUMBER = 0
TIMESTAMP = 1
TEXT = 2

###########################

Exceptions

class ExternalError(Exception):
def init(self, cmd, err, stdout, stderr, returncode):
self.cmd = cmd
self.err = err
self.stdout = stdout
self.stderr = stderr
self.returncode = returncode

def str(self):
        return "Command: %s\nError: %s\nStdout:\n%s\nStderr:\n%s\nReturn: %d\n" %\
            (self.cmd, self.err, self.stdout, self.stderr, self.returncode)

###########################

Records

class OCROutput():
def init(self, start, end, lines):
self.start = start
self.end = end
self.lines = lines

###########################

Utility functions

class Utils:

def init(self, verbose, remove_files):
        self.verbose = verbose
        self.remove_files = remove_files
        self.script_dir = None
        self.command_times = {}

def get_script_dir(self):
        if not self.script_dir:
            self.script_dir = sys.path[0]
        return self.script_dir

def run_command_real(self, cmd, err, display=False):
        logging.info("Running %s" % cmd)
        if self.verbose:
            print cmd
        if display:
            proc = subprocess.Popen(cmd, shell=True)
            proc.wait()
        else:
            proc = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                                                  stderr=subprocess.PIPE, shell=True)
            stdout, stderr = proc.communicate()

if proc.returncode != 0:
                raise ExternalError(cmd, err, stdout, stderr, proc.returncode)

def run_command(self, cmd, err, display=False):
        time_start = time.time()
        self.run_command_real(cmd, err, display)
        time_end = time.time()

k = cmd.split(' ')[0]
        if not self.command_times.has_key(k):
            self.command_times[k] = []
        self.command_times[k].append(time_end - time_start)

def get_output_real(self, cmd, err):
        logging.info("Getting output of %s" % cmd)
        if self.verbose:
            print cmd
        proc = subprocess.Popen(cmd, stdout=subprocess.PIPE,
                                              stderr=subprocess.PIPE, shell=True)
        stdout, stderr = proc.communicate()
        if proc.returncode != 0:
            raise ExternalError(cmd, err, stdout, stderr, proc.returncode)
        return stdout

def get_output(self, cmd, err):
        time_start = time.time()
        output = self.get_output_real(cmd, err)
        time_end = time.time()

k = cmd.split(' ')[0]
        if not self.command_times.has_key(k):
            self.command_times[k] = []
        self.command_times[k].append(time_end - time_start)
        return output

def log_command_timings(self):
        for cmd in self.command_times:
            avg = sum(self.command_times[cmd])
            logging.warn("Command %s took %f in total" % (cmd, avg))

def rm(self, fname):
        logging.debug("Removing %s" % fname)

if self.remove_files:
            try:
                os.remove(fname)
            except:
                logging.error("Failed to delete: %s" % fname)

def v_print(self, s):
        if self.verbose:
            print s

def load_correct(db):
sys.path.append(db)
try:
mod = import(CORRECTION_MODULE, globals(), locals(), \
["correct"])
return getattr(mod, "correct")
except:
logging.critical("Error: Unable to load correction filter " + \
db + "/" + CORRECTION_MODULE + ".py\n")
logging.shutdown()
sys.exit(1)

def get_correct(db):
if not get_correct.cache.has_key(db):
get_correct.cache[db] = load_correct(db)
return get_correct.cache[db]
get_correct.cache = {}

def save_failures(files):
for f in files:
shutil.copy(f, FAILURE_DIR)

def getTsPids(file):
pids = {}
streams = utils.get_output('projectx -demux -id 0 "%s"' % file,
'Error running projectx: check it is on your path')
streams = streams.split('\n')
state = NOSTATE
for line in streams:
line = line.rstrip()
m = re.search("PID: (0x[\dABCDEF]+)", line)
if m:
if state == VIDEO:
pids['video'] = m.group(1)
elif state == AUDIO:
pids['audio'] = m.group(1)
elif state == TELETEXT:
pids['teletext'] = m.group(1)
elif state == SUBPICT:
pids['subpict'] = m.group(1)

state = NOSTATE

if line == "Video:":
            state = VIDEO
        elif line == "Audio:":
            state = AUDIO
        elif line == "Teletext:":
            state = TELETEXT
        elif line == "Subpict.:":
            state = SUBPICT

return pids

class HashStore:

def init(self):
        self.store = {}
        self.hits = 0

def get_hash(self, im):
        m = hashlib.md5()
        m.update(im.tostring())
        m.update(str(im.getpalette))
        return m.digest()

def add(self, h, data):
        self.store[h] = data

def check(self, h):
        if self.store.has_key(h):
            self.hits += 1
            return self.store[h]
        else:
            return None

hashstore = HashStore()

###############################

Main function to parse a line
of the .son file

def convertline(line): # parse .son times and the bmp file name

m = son_re.search(line)
    if m:
        start = m.group(2) + ":" + m.group(3) + ":" + m.group(4) + \
                            "," + m.group(5) + "0" 
        end = m.group(6) + ":" + m.group(7) + ":" + m.group(8) + \
                         "," + m.group(9) + "0" 
        bmpfile = (sondir + m.group(10)).rstrip()

text = "" 
        db = None

bmpfileq = '"%s"' % bmpfile
        ppmfile  = '%s.ppm' % bmpfile
        ppmfileq = '"%s"' % ppmfile

im = Image.open(bmpfile)
        assert im.mode == "P"

width  = im.size[0]
        height = im.size[1]
        factor = 0
        for x in range(30,40,2):
            if height % x == 0:
                height /= x
                factor = x
                break
        if (height > 10):
            height = 1

linetexts = []
        for i in range(height):
            ppmline  = '%s.line%d.ppm' % (bmpfile, i)
            ppmlineq = '"%s"' % ppmline

if factor: # Multiple lines
    # TODO: Copy seems to be necessary, not really sure why
    #         Not a bottleneck though..
                im_line = im.crop((0, i*factor, width, (i+1)*factor-1)).copy()
            else:
                im_line = im

Test if hashes match
linehash = hashstore.get_hash(im_line)
hashtext = hashstore.check(linehash)

if hashtext:
                linetexts.append(hashtext)
                continue

db = flatten(im_line)

linetext = "" 
            if db:
                im_line_rgb = im_line.convert("RGB")
                im_line_rgb.save(ppmline)

linetext = ocr(ppmline, db)

if FAIL_CHAR in linetext or linetext == "":
                    logging.warning("ocr(%s, %s) contained an error or no output" % (ppmline, db))
                    save_failures([ppmline, bmpfile])

utils.rm(ppmline)
                hashstore.add(linehash, linetext)
                linetexts.append(linetext)
            else:
                logging.warning("flatten_im(%s) failed" % ppmline)
                save_failures([bmpfile])

return OCROutput(start, end, linetexts)
    else:
        return None

def flatten(im): # TODO: Could cache resulting image transformation and apply to those # images with the same palette
p = im.getpalette()
colours = im.getcolors()

hues = {}
    for _, c in colours:
        r,g,b = p[c*3], p[c*3+1], p[c*3+2]
        if r  0 and g  0 and b == 0:
        # Change black to white
            p[c*3]    = 255
            p[c*3+1] = 255
            p[c*3+2] = 255
        elif r  0 and g  0 and b == 96:
        # Change dark blue to white
            p[c*3]    = 255
            p[c*3+1] = 255
            p[c*3+2] = 255
        elif r  31 and g  31 and b == 31:
        # Change dark grey to white
            p[c*3]    = 255
            p[c*3+1] = 255
            p[c*3+2] = 255
        else:
            h,s,v = colorsys.rgb_to_hsv(float(r)/255, 
                                                 float(g)/255,    
                                                 float(b)/255)
            huename = "%.1f" % h
            if hues.has_key(huename):
                hues[huename].append((c, v))
            else:
                hues[huename] = [(c, v)]

Check we can deal with image
if len(hues) < 1:
logging.error("Image with 0 hues")
return None

firsthuelen = len(hues.values()[0])
    if firsthuelen == 4:
        cutoff = 2
        db = "%s/db5_3" % utils.get_script_dir()
    elif firsthuelen == 6:
        cutoff = 3
        db = "%s/db7_4" % utils.get_script_dir()
    else:
        logging.error("Image with hues not length 4 or 6 (one of %d)" % firsthuelen)
        logging.error(str(hues))
        return None

for h in hues.values():
        if not len(h) == firsthuelen:
            logging.error("Image with different hue lengths (%d and %d)" % (len(h), firsthuelen))
            return None

for huecolours in hues.values():
        huecolours.sort(key=itemgetter(1))

for i, (c, _) in enumerate(huecolours):
            if i < cutoff: # Change to white (background)
                p[c*3]    = 255
                p[c*3+1] = 255
                p[c*3+2] = 255
            else:             # Change to black (text)
                p[c*3]    = 0
                p[c*3+1] = 0
                p[c*3+2] = 0
    im.putpalette(p)
    return db

def ocr(file, db):
ocrparam = 2 + 8 + 256 + (128 if builddb else 0)
if db: # Use that specified on the cmdline
db = dbdir

if not os.access("%s/db.lst" % db, os.F_OK):
        os.mkdir(db)
        open("%s/db.lst" % db, "a").close()

fileq     = '"%s"'    % file
    txtfile  = '%s.txt' % file
    txtfileq = '"%s"'    % txtfile

cmd = '%s -m %d -s %d -u "%c" -a 95 -d 0 -p "%s/" -i %s -o %s' % \
        (GOCR, ocrparam, space_width, FAIL_CHAR, db, fileq, txtfileq)

if builddb:
        returncode = subprocess.call(shlex.split(cmd))
    else:
        devnull = open(os.devnull, 'a+')
        returncode = subprocess.call(shlex.split(cmd),
                                              stdout=devnull,
                                              stderr=subprocess.STDOUT)
        devnull.close()

text = "" 
    try:
        textfile = open(file + ".txt", "r")
        text = textfile.read().rstrip()
        textfile.close()
    except:
        logging.warning("Unable to read OCR results from " + file + ".txt\n")

utils.rm(txtfile)
    return text

def mergeLive(prevoutput, ocroutput):
outputnow = True

if mergeduplicates:
        if ocroutput.start == prevoutput.start:
            outputnow = False
            if ocroutput.lines[0] == prevoutput.lines[0]:
                prevoutput.lines += ocroutput.lines[1:]
            else:
                prevoutput.lines += ocroutput.lines

prevoutput.end = ocroutput.end
        else:
            prevtext = " ".join(prevoutput.lines).rstrip()
            ocrtext  = " ".join(ocroutput.lines).rstrip()

overlap = 0
            for i in range(1, len(prevtext)+1):
                if prevtext[-i:] == ocrtext[:i]:
                    overlap = i

if overlap != 0:
                prevoutput.lines = [prevtext[:-overlap]]

return prevoutput, outputnow

def outputOCR(f, ocroutput):
if len(ocroutput.lines) > 0 and \
any(map(lambda s: len(s) != 0, ocroutput.lines)):
for l in ocroutput.lines:
if '\n' in l:
logging.warning("\\n found at %s: %s" % (ocroutput.start, l))

f.write("%s --> %s: %s\n" % \
                     (ocroutput.start, ocroutput.end, (" ".join(ocroutput.lines).rstrip().replace("\n", " "))))

def extractTeletextSubtitles(mpgfile, pids, output):
curdir = os.getcwd()
os.chdir(tempfile.gettempdir())

inifile = os.path.join(tempfile.gettempdir(), "X.ini") 
    f = open(inifile, "w")
    f.write("SubtitlePanel.TtxPage1=888\n");
    f.write("SubtitlePanel.SubtitleExportFormat=srt\n")
    f.close()

utils.run_command('projectx "%s" -log -ini %s -out %s ' % (mpgfile, inifile, tempfile.gettempdir()) + \
                "-demux -id %s " % (pids['teletext']),
                "Error running projectx: check it is on your path",
                display=_verbose_)

mpgroot, _ = os.path.splitext(os.path.basename(mpgfile))

telesrt = os.path.join(tempfile.gettempdir(), mpgroot + "[888].srt")

f = open(telesrt)
    g = open(output, "w")

state = NUMBER
    prevoutput = OCROutput("", "", [])
    for line in f.xreadlines():
        line = line.rstrip()
        if state == NUMBER:
            state = TIMESTAMP
        elif state == TIMESTAMP:
            ocroutput = OCROutput("", "", [])
            ocroutput.start, ocroutput.end = line.split(" --> ")
            ocroutput.lines = []
            state = TEXT
        elif state  TEXT:
            if line  "":
                state = NUMBER
                prevoutput, outputnow = mergeLive(prevoutput, ocroutput)

if outputnow:
                    outputOCR(g, prevoutput)
                    prevoutput = ocroutput

else:
                ocroutput.lines.append(line)

outputOCR(g, prevoutput)
    g.close()
    f.close()

os.chdir(curdir)
    return [inifile, os.path.join(tempfile.gettempdir(), mpgroot + "[888].srt")]

def extractDVBSubtitles(mpgfile, pids):
curdir = os.getcwd()
os.chdir(tempfile.gettempdir())

inifile = os.path.join(tempfile.gettempdir(), "X.ini") 
    f = open(inifile, "w")
    f.write("SubtitlePanel.SubtitleExportFormat=SON\n")
    f.write("SubtitlePanel.SubtitleExportFormat_2=null\n")
    f.close()

utils.run_command('projectx "%s" -log -ini %s -out . ' % (mpgfile, inifile) + \
                    "-demux -id %s " % (pids['subpict']),
                    "Error running projectx: check it is on your path",
                    display=_verbose_)

os.chdir(curdir)
    mpgroot, _ = os.path.splitext(mpgfile)
    return [inifile, mpgroot + ".spf", mpgroot + ".son"] + glob.glob(mpgroot + "*.bmp")

def convertSonFile(input, output):
try:
sonfile = open(input, "r")
srtfile = open(output, "w")

prevoutput = OCROutput("", "", [])
        for line in sonfile.xreadlines():
            try:
                ocroutput = convertline(line)
            except ExternalError as e:
                logging.error(str(e))

if ocroutput:
                prevoutput, outputnow = mergeLive(prevoutput, ocroutput)

Output
if outputnow:
outputOCR(srtfile, prevoutput)
prevoutput = ocroutput

outputOCR(srtfile, prevoutput)
    except IOError, e:
        logging.critical("Fatal IO Error: " + str(e))
        logging.shutdown()
        raise

def main(): # parse command line options
try:
opts, args = getopt.getopt(sys.argv[1:], "hvd:acns:i:o:", ["help", "no-merge"])
except getopt.error, msg:
print msg
print "Parameter --help prints usage information."
sys.exit(2)

input = "" 
    output = ""

global db
    global dbdir
    global builddb
    global correction
    global correct
    global space_width
    global verbose
    global nodelete
    global mergeduplicates

process options
for o, a in opts:
if o in ("-h", "--help"):
print doc
sys.exit(0)
if o in ("-d"):
db = True
dbdir = a + "/"
if o in ("-a"):
builddb = True
if o in ("-c"):
correction = True
if o in ("-s"):
space_width = a
if o in ("-i"):
input = a
if o in ("-n"):
nodelete = True
if o in ("--no-merge"):
mergeduplicates = False
if o in ("-o"):
output = os.path.abspath(a)
if o in ("-v"):
verbose = True

global utils
    utils = Utils(verbose, not nodelete)

start = time.time()

global sondir
    sondir = os.path.dirname(input) + "/" 
    if sondir == "/":
        sondir = "./" 
    os.chdir(sondir)
    if not os.path.exists(FAILURE_DIR):
        os.makedirs(FAILURE_DIR)

logfile = "%s/%s.txt" % (FAILURE_DIR, os.path.basename(input))
    logging.basicConfig(filename=logfile,
                        format="%(asctime)s - %(levelname)s - %(message)s",
                        level=logging.INFO)
    logging.info("Starting subtitle conversion")
    logging.info("Version: $Revision: 519 $")
    logging.info("Options: %s" % str(sys.argv))
    logging.info("Input: %s" % input)
    logging.info("Output: %s" % output)
    inroot, ext = os.path.splitext(input)

cruft = []
    if len(input) > 0 and len(output) > 0:
        if ext  ".ts" or ext  ".mpg":
            logging.info("Input is a video file, extracting")
            try:
                pids = getTsPids(input)
                cruft.append(inroot + "_log.txt")
                logging.info("Pids: %s" % str(pids))

if pids.has_key('teletext'):
                    logging.info("Teletext subtitles found")
                    cruft.extend(extractTeletextSubtitles(input, pids, output))

elif pids.has_key('subpict'):
                    logging.info("DVB subtitles found")
                    cruft.extend(extractDVBSubtitles(input, pids))

convertSonFile(inroot + ".son", output)

else:
                    logging.error("No subtitles found!")
            except ExternalError as e:
                logging.error(str(e))

elif ext == ".son":
            logging.info("Input is a subtitle file")
            convertSonFile(input, output)
    else:
        print "Failed to parse parameters. Try --help."

for f in cruft:
        utils.rm(f)

utils.log_command_timings()
    logging.info("Hash hits: %d" % hashstore.hits)
    logging.info("Completed subtitle conversion")
    end = time.time()
    logging.info("Total time: %fs" % (end - start))
    logging.shutdown()

if name == "__main__":
try:
main()
except Exception as e:
logging.critical("uncaught exception: %s" % str(e))
logging.shutdown()
raise

RE: Can not see PID's in .ts file - Added by Hein Rigolo about 13 years ago

dave c wrote:

I didn't realise you could extract the teletext subtitle stream from mkv. Is it in the same format as an .srt with the timestamps and everything?

I suppose the script could be altered to use a different program other than ProjectX to get them, but I wouldnt know how :(. Also, the script cleans up the text to remove word duplicates and uneccessary spacing that you get from live subtitles. All this needs to be done so an app I have can read them more easily.

The script can also perform OCR but i dont want to use that feature as it takes up too much system resources and is prone to error.

Just make a recording of a show with teletext subtitles and have a look at the recorded .mkv and the txt stream in there to see how it is structured and what you can do with it. I do not think the script can be adjusted as easily as you think.
It is not in a .srt format, but based on the timestamps in the .mkv you can probably convert it and strip it out as an .srt

Hein

(1-4/4)

Project

General

Profile

Tvheadend