This corpus aims to be the first attempt to create a representative sample of the contemporary Slovak language from various domains with easy searching and automated processing. It contains a selection of news articles, processed by our NLP tools.
The corpus consists of two parts. The first part contains text files and annotations:
- Token boundary identification
- Sentence boundary identification
- Stop-Words
- Morphological Analysis
- Named Entity Recognition
- Named Entity Transcription
- Lemma
The second part contains am evaluation for information retrieval.
Downloads
Bibliography
- D. Hládek, J. Staš, J. Juhár: Slovak Categorized News Corpus, LREC 2014 pp. 1705–1708, 2014. Paper PDF poster PDF
- D. Hládek, J. Staš, J. Juhár: "Evaluation Set for Slovak News Information Retrieval." LREC. 2016. PDF