TY - JOUR T1 - Korean Paper Based Retrieval Augmented Generation Dataset AU - Han, Junho AU - Choi, Minjun AU - Kim, Keunha AU - Ko, Youngjoong JO - Journal of KIISE, JOK PY - 2026 DA - 2026/1/14 DO - 10.5626/JOK.2026.53.3.205 KW - large language model KW - retrieval-augmented generation KW - information retrieval KW - keyphrase extraction KW - response generation evaluation AB - Large language models (LLMs) trained on general domain data have limitations in specialized fields that are rich in information and technical terminology. Retrieval-augmented generation (RAG) improves answer accuracy and reliability by referencing external knowledge, making it particularly effective in specialized domains where pre-training data is scarce. However, there is a lack of public datasets for Korean specialized domains, highlighting the need for a dedicated retrieval-augmented generation dataset. This paper introduces a new Korean RAG dataset based on scientific and technical papers to support research in this area. We preprocessed existing document-query data to create a searchable corpus and extracted key phrases and key sentences suited for specialized applications. Additionally, we conducted a comprehensive quantitative evaluation of the dataset‘s quality. By reflecting the unique characteristics of scientific and technical papers, this dataset serves as a robust foundation for Korean RAG systems.