This corpus aims to be the first attempt to create a representative sample of the contemporary Slovak language from various domains with easy searching and automated processing. It contains a selection of news articles, processed by our NLP tools.

The corpus consists of two parts. The first part contains text files and annotations:

  • Token boundary identification
  • Sentence boundary identification
  • Stop-Words
  • Morphological Analysis
  • Named Entity Recognition
  • Named Entity Transcription
  • Lemma

The second part contains am evaluation for information retrieval.

Downloads

Bibliography

  • D. Hládek, J. Staš, J. Juhár: Slovak Categorized News Corpus, LREC 2014 pp. 1705–1708, 2014. Paper PDF poster PDF
  • D. Hládek, J. Staš, J. Juhár: "Evaluation Set for Slovak News Information Retrieval." LREC. 2016. PDF

Next Post