Comparison of Recurrent Neural Networks for Slovak Punctuation Restoration

Daniel Hládek, Ján Staš and Stanislav Ondáš

Technical University of Košice


Background


Background

sarra


Motivation

Post-processing of the output of the speech recognition system.

     where to add punctuation    in subtitles 
                              ^                ^
                              |                |

Punctuation restoration

adds missing punctuation:


Context to class mapping

context to class mapping

ahoj ako sa máš  ->  ?
                     .
                     ,

The training data

stovky miliárd korún . dnes vo fonde , kde sa potrebné peniaze 

stovky miliárd korún      ->    PERIOD
miliárd korún .           ->    NO
korún . dnes              ->    NO
dnes vo fonde             ->    COMMA
vo fonde ,                ->    NO

Context

 ahoj   ako   sa   máš    words

   12    56   78   123    integers

 0.21  0.34 0.35  0.87    embedding
 0.01  0.35 0.45  0.51    vectors
 0.61  0.74 0.15  0.87

The recurrent neural network model

maps a matrix of embedding vectors into possible punctuation marks.


RNN Architecture


LSTM*

LSTM

By Guillaume Chevalier - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=71836793


The training data size

Dataset # Tokens Sentences Size (B)
train 43,693,819 2,587,896 305,660,294
test 6,168 346 43,372

Uni-Directional Results

After 20 rounds of learning

Network Comma F1 Period F1
LSTM 0.81 0.65
GRU 0.81 0.67

Bi-Directional Results

After 10 rounds of learning

Network Comma F1 Period F1
LSTM 0.82 0.68
GRU 0.88 0.86

Problems


Future research


Future research

https://marhula.fei.tuke.sk/sarra/

https://nlp.kemt.fei.tuke.sk