Background

Subtitles for the hearing disabled
Legal dictation system

Background

sarra

Motivation

Post-processing of the output of the speech recognition system.

     where to add punctuation    in subtitles 
                              ^                ^
                              |                |

Punctuation restoration

adds missing punctuation:

comma, period
question mark, exclamation, colon, semicolon,

Context to class mapping

context to class mapping

ahoj ako sa máš  ->  ?
                     .
                     ,

The training data

stovky miliárd korún . dnes vo fonde , kde sa potrebné peniaze 

stovky miliárd korún      ->    PERIOD
miliárd korún .           ->    NO
korún . dnes              ->    NO
dnes vo fonde             ->    COMMA
vo fonde ,                ->    NO

Context

 ahoj   ako   sa   máš    words

   12    56   78   123    integers

 0.21  0.34 0.35  0.87    embedding
 0.01  0.35 0.45  0.51    vectors
 0.61  0.74 0.15  0.87

The recurrent neural network model

maps a matrix of embedding vectors into possible punctuation marks.

long-short term memory (LSTM)
gated recurrent unit (GRU)

RNN Architecture

Input Layer: 20
Embedding: 300
Recurrent Layer (LSTM or GRU): 100
Output Softmax Layer: 3

LSTM*

LSTM

By Guillaume Chevalier - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=71836793

The training data size

Dataset	# Tokens	Sentences	Size (B)
train	43,693,819	2,587,896	305,660,294
test	6,168	346	43,372

Uni-Directional Results

After 20 rounds of learning

Network	Comma F1	Period F1
LSTM	0.81	0.65
GRU	0.81	0.67

Bi-Directional Results

After 10 rounds of learning

Network	Comma F1	Period F1
LSTM	0.82	0.68
GRU	0.88	0.86

Problems

Overtraining after 20 rounds
Slow network training (several days)

Future research

Dropout
Convolutional networks

Future research

https://marhula.fei.tuke.sk/sarra/

https://nlp.kemt.fei.tuke.sk