Slovak Categorized News Corpus

09 Sep

Slovak Categorized News Corpus

This corpus aims to be the first attempt to create a representative sample of the contemporary Slovak language from various domains with easy searching and automated processing. It contains a selection of news articles, processed by our NLP tools.

The corpus consists of two parts. The first part contains text files and annotations:

Token boundary identification
Sentence boundary identification
Stop-Words
Morphological Analysis
Named Entity Recognition
Named Entity Transcription
Lemma

The second part contains am evaluation for information retrieval.

Downloads

Slovak Categorized News Corpus
Information Retrieval Evaluation Set for Slovak Categorized News Corpus

Bibliography

D. Hládek, J. Staš, J. Juhár: Slovak Categorized News Corpus, LREC 2014 pp. 1705–1708, 2014. Paper PDF poster PDF
D. Hládek, J. Staš, J. Juhár: "Evaluation Set for Slovak News Information Retrieval." LREC. 2016. PDF