Try it:
License: Attribution-ShareAlike 4.0 International (CC BY-SA 4.
The goal of the project was to write a large number of questions about articles and mark the answers on the Slovak Wikipedia. This database will teach the neural network to automatically answer even a question it has not seen before. This will enable Slovak companies and organizations to automatically understand the question and search in their own texts, similar to what a person does.
A native neural network has several advantages over large-scale models such as ChatGPT. It is safer because the model can be used directly in the organization and the data does not have to be sent to a foreign company. A custom model can be better adapted to a specific task and can run on a commonly available machine.
We created the database so that it would be possible to measure and compare how many Slovak or English questions the system can answer. Thanks to this, existing systems that understand Slovak or English will be improved.
SK-QuAD is the first manually annotated dataset of questions and answers in Slovak. It consists of more than 91k factual questions and answers from various fields. Each question has an answer marked in the corresponding paragraph. It also contains negative examples in the form of “unanswered questions” and “plausible answers”. The dataset is published free of charge for scientific use. We aim to contribute to the creation of Slovak or multilingual systems for generating an answer to a question in a natural language. The paper provides an overview of the existing datasets for question answering. It describes the annotation process and statistically analyzes the created content. The dataset expands the possibilities of training and evaluation of multilingual language models. Experiments show that the dataset achieves state-of-the-art results for Slovak and improves question answering for other languages in zero-shot learning. We compare the effect of machine-translated data with manually annotated. Additional data improve the modeling for low-resourced languages.
The authors thank to: