Slovak Categorized News Corpus

This corpus aims to be the first attempt to create a representative sample of the contemporary Slovak language from various domains with easy searching and automated processing.

It contains a selection of news articles, processed by our NLP tools.

  • Token boundary identification
  • Sentence boundary identification
  • Stop-Words
  • Morphological Analysis
  • Named Entity Recognition
  • Named Entity Transcription
  • Lemma

The second part of the effort is the information retrieval evaluation set for the corpus.

Information Retrieval Evaluation Set for Slovak Categorized News Corpus

This is the first Slovak information retrieval evaluation set. It contains a set of queries (information need) together with corresponding relevant documents from the Slovak Categorized News Corpus.


Please write a request on for download link.


D. Hládek, J. Staš, J. Juhár: Slovak Categorized News Corpus, LREC 2014 PDF poster

Hládek, Daniel, Ján Staš, and Jozef Juhár. "Evaluation Set for Slovak News Information Retrieval." LREC. 2016. PDF