TY  - JOUR
T1  - Korean Paper Based Retrieval Augmented Generation Dataset
AU  - Han, Junho 
AU  - Choi, Minjun 
AU  - Kim, Keunha 
AU  - Ko, Youngjoong 
JO  - Journal of KIISE, JOK
PY  - 2026
DA  - 2026/1/14
DO  - 10.5626/JOK.2026.53.3.205
KW  - large language model
KW  - retrieval-augmented generation
KW  - information retrieval
KW  - keyphrase extraction
KW  - response generation evaluation
AB  - Large language models (LLMs) trained on general domain data have limitations in specialized fields that are rich in information and technical terminology. Retrieval-augmented generation (RAG) improves answer accuracy and reliability by referencing external knowledge, making it particularly effective in specialized domains where pre-training data is scarce. However, there is a lack of public datasets for Korean specialized domains, highlighting the need for a dedicated retrieval-augmented generation dataset. This paper introduces a new Korean RAG dataset based on scientific and technical papers to support research in this area. We preprocessed existing document-query data to create a searchable corpus and extracted key phrases and key sentences suited for specialized applications. Additionally, we conducted a comprehensive quantitative evaluation of the dataset‘s quality. By reflecting the unique characteristics of scientific and technical papers, this dataset serves as a robust foundation for Korean RAG systems.