Daniel Hládek, Ján Staš and Stanislav Ondáš
Technical University of Košice
Post-processing of the output of the speech recognition system.
where to add punctuation in subtitles
^ ^
| |
adds missing punctuation:
context to class mapping
ahoj ako sa máš -> ?
.
,
stovky miliárd korún . dnes vo fonde , kde sa potrebné peniaze
stovky miliárd korún -> PERIOD
miliárd korún . -> NO
korún . dnes -> NO
dnes vo fonde -> COMMA
vo fonde , -> NO
ahoj ako sa máš words
12 56 78 123 integers
0.21 0.34 0.35 0.87 embedding
0.01 0.35 0.45 0.51 vectors
0.61 0.74 0.15 0.87
maps a matrix of embedding vectors into possible punctuation marks.
By Guillaume Chevalier - Own work, CC BY 4.0, https://commons.wikimedia.org/w/index.php?curid=71836793
Dataset | # Tokens | Sentences | Size (B) |
---|---|---|---|
train | 43,693,819 | 2,587,896 | 305,660,294 |
test | 6,168 | 346 | 43,372 |
After 20 rounds of learning
Network | Comma F1 | Period F1 |
---|---|---|
LSTM | 0.81 | 0.65 |
GRU | 0.81 | 0.67 |
After 10 rounds of learning
Network | Comma F1 | Period F1 |
---|---|---|
LSTM | 0.82 | 0.68 |
GRU | 0.88 | 0.86 |
https://marhula.fei.tuke.sk/sarra/
https://nlp.kemt.fei.tuke.sk