Slovak Question Answering Dataset

Try it:

Downloads

License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.

Motivation

The goal of the project was to write a large number of questions about articles and mark the answers on the Slovak Wikipedia. This database will teach the neural network to automatically answer even a question it has not seen before. This will enable Slovak companies and organizations to automatically understand the question and search in their own texts, similar to what a person does.

A native neural network has several advantages over large-scale models such as ChatGPT. It is safer because the model can be used directly in the organization and the data does not have to be sent to a foreign company. A custom model can be better adapted to a specific task and can run on a commonly available machine.

We created the database so that it would be possible to measure and compare how many Slovak or English questions the system can answer. Thanks to this, existing systems that understand Slovak or English will be improved.

In media

Description

SK-QuAD is the first manually annotated dataset of questions and answers in Slovak. It consists of more than 91k factual questions and answers from various fields. Each question has an answer marked in the corresponding paragraph. It also contains negative examples in the form of “unanswered questions” and “plausible answers”. The dataset is published free of charge for scientific use. We aim to contribute to the creation of Slovak or multilingual systems for generating an answer to a question in a natural language. The paper provides an overview of the existing datasets for question answering. It describes the annotation process and statistically analyzes the created content. The dataset expands the possibilities of training and evaluation of multilingual language models. Experiments show that the dataset achieves state-of-the-art results for Slovak and improves question answering for other languages in zero-shot learning. We compare the effect of machine-translated data with manually annotated. Additional data improve the modeling for low-resourced languages.

Bibliography

Credits

The authors thank to:

  • Deutsche Telekom IT Solutions Slovakia for fruitful cooperation, personal and financial support.
  • Ministry of Education, Science, Research and Sport of the Slovak Republic and the Slovak Academy of Sciences under Project VEGA2/0165/2
  • Slovak Research and Development Agency through the Project of Bilateral Cooperation under Grant APVV-SK-TW-21-0002
  • all annotators, part-timers, and students of the Department of Electronics and Multimedia Communications, Faculty of Electrical Engineering and Informatics, Technical University of Košice,
  • the students of the Elementary School Belehradská 21, Košice, under the supervision of Dr. Lenka Macková, Ph.D., and their friends,