Video Text Detection and Recognition

This is an implementation of an end to end pipeline to detect and recognize text from youtube videos. The text detection is based on SSD: Single Shot MultiBox Detector retrained on single text class using Coco-Text dataset and text recognition is based on Convolutional Recurrent Neural Network as described in An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

Slide deck

Update: Code is removed from the repo.

Please see Demo notebook as a starting point. Use it to provide your youtube url to either:

  1. Get text detection/recognition results in JSON format (or)
  2. Generate a new video with overlayed bounding boxes for all text and their respective transcriptions.

Requirements

All requirements are captured in the requirements.txt. Please switch to your virtual environment based on your preferences and install them (pip install -r requirements.txt)

Directory structure:

Detection

Our detection model is based on Tensorflow’s object detection models and the detection model zoo

We transfer learn on Mobile SSD network. The original network was trained on coco dataset (natural objects) for detection task. We retrain the network for text detection (single class) using Coco-Text dataset.

Inference

detection.py loads frozen Tensorflow inference graph and runs inference for our data.

Training

Please follow instructions provided by Tensorflow’s object detection along with scripts and configs provided in detection/ folder.

Model configs

We have also experimented with faster-RCNN pretrained on Coco, for which we provide the config file as well.

  1. ssd_mobilenet_v1_coco.config
  2. faster_rcnn_resnet101_pets_coco.config

Class definition

See text.pbtxt

Generate TFRecords

Script used to generate TF records for use with this model is at coco-text/Coco-Text%20to%20TFRecords.ipynb

Recognition

We leverage Convolutional Recurrent Network for recognition purposes.

Inference

recognition.py holds helper functions for recognition task. Loads Convolutional Recurrent Network weights file and runs inference. It is adapted from caffe implementation from paper authors Shi etal and pytorch implementation by @meijieru model. See crnn.pytorch folder for more details. Please see the original implementation for training instructions.

Web server

We have a basic web server serving video analysis requests. To start, execute following in this directory:

$ python flask_server.py

Files:

Data Explore and evaluation

See directory data_explore_eval/

Coco-text:

SynthText:

Also contains script to generate submissions for ICDAR17 and run evaluations offline

Unit Tests

Unit testing for video functionality is added. More tests need to be added. To run them:

$ python -m pytest test_utilities.py

Assets

Download weights from Google drive and put it into a folder named weights/