Search : [ keyword: Information Retrieval ] (10)

Efficient Large Language Model Based Passage Re-Ranking Using Single Token Representations

Jeongwoo Na, Jun Kwon, Eunseong Choi, Jongwuk Lee

http://doi.org/10.5626/JOK.2025.52.5.395

In information retrieval systems, document re-ranking involves reordering a set of candidate documents based on evaluation of their relevance to a given query. Leveraging extensive natural language understanding capabilities of large language models(LLMs), numerous studies on document re-ranking have been conducted, demonstrating groundbreaking performance. However, studies utilizing large language models focus solely on improving reranking performance, resulting in degraded efficiency due to excessively long input sequences and the need for repetitive inference. To address these limitations, we propose ListT5++, a novel model that represents the relevance between a query and a passage using single token embedding and significantly improves the efficiency of LLM-based reranking through a single-step decoding strategy that minimizes the decoding process. Experimental results showed that ListT5++ could maintain accuracy levels comparable to existing methods while reducing inference latency by a factor of 29.4 relative to the baseline. Moreover, our approach demonstrates robust characteristics by being insensitive to th initial ordering of candidate documents, thereby ensuring high practicality in real-time retrieval environments.

Improving Retrieval Models through Reinforcement Learning with Feedback

Min-Taek Seo, Joon-Ho Lim, Tae-Hyeong Kim, Hwi-Jung Ryu, Du-Seong Chang, Seung-Hoon Na

http://doi.org/10.5626/JOK.2024.51.10.900

Open-domain question answering involves the process of retrieving clues through search to solve problems. In such tasks, it is crucial that the search model provides appropriate clues, as this directly impacts the final performance. Moreover, information retrieval is an important function frequently used in everyday life. This paper recognizes the significance of these challenges and aims to improve performances of search models. Just as the recent trend involves adjusting outputs in decoder models using Reinforcement Learning from Human Feedback (RLHF), this study seeks to enhance search models through the use of reinforcement learning. Specifically, we defined two rewards: the loss of the answer model and the similarity between the retrieved documents and the correct document. Based on these, we applied reinforcement learning to adjust the probability score of the top-ranked document in the search model's document probability distribution. Through this approach, we confirmed the generality of the reinforcement learning method and its potential for further performance improvements.

Information Retrieval-based Bug Localization for Korean Bug Reports using Translation

Misoo Kim

http://doi.org/10.5626/JOK.2024.51.9.827

Information retrieval-based bug localization technique uses bug reports as queries to automatically identify faulty source files, significantly reducing the time developers spend locating bugs. The core of this technique lies in calculating text similarity between bug reports and source files. However, for bug reports written in Korean, the text similarity might not be effective due to difficulty of matching words with source codes primarily written in English. This study proposed an information retrieval-based bug localization technique for Korean bug reports using translation, enabling Korean developers to effectively use this technique. We also applied a soft voting method to effectively leverage outputs of multiple translators. To validate the performance of the proposed technique, we collected 269 Korean bug reports and conducted experiments using three translators and two ranking models. Experimental results showed that the proposed method improved bug localization performance by 44% compared to baselines.

A Comparative Study on the Performance of Named Entity Recognition in Materials and Chemistry Fields through Multiple Embedding Combination Based on a Pre-trained Neural Network Language Model

Myunghoon Lee, Hyeonho Shin, Hong-Woo Chun, Jae-Min Lee, Taehyun Ha, Sung-Pil Choi

http://doi.org/10.5626/JOK.2021.48.6.696

Recently, with the rapid development of materials and chemistry fields, the academic literature has increased exponentially. Accordingly, studies are being conducted to extract meaningful information from the existing accumulated data, and Named Entity Recognition (NER) is being utilized as one of the methodologies. NER in materials and chemistry fields is a task of extracting standardized entities such as materials, material property information, and experimental conditions from academic literature and classifying types of the entities. In this paper, we studied the NER in materials and chemistry fields using a combination of embedding and a Bi-direction LSTM-CRF model with an existing published language model without pre-training a neural network language model. As a result, we found the best performing embedding combinations and analyzed their performance. Additionally, the pre-trained language model was used as a NER model to compare performance through fine-tuning. The process showed that the use of a public pre-trained language model for embedding combinations could derive meaningful results in NER in the materials and chemistry fields.

2-Phase Passage Re-ranking Model based on Neural-Symbolic Ranking Models

Yongjin Bae, Hyun Kim, Joon-Ho Lim, Hyun-ki Kim, Kong Joo Lee

http://doi.org/10.5626/JOK.2021.48.5.501

Previous researches related to the QA system have focused on extracting exact answers for the given questions and passages. However, when expanding the problem from machine reading comprehension to open domain question answering, finding the passage containing the correct answer is as important as machine reading comprehension. DrQA reported that Exact Match@Top1 performance decreased from 69.5 to 27.1 when the QA system had the initial search step. In the present work, we have proposed the 2-phase passage reranking model to improve the performance of the question answering system. The proposed model integrates the results of the symbolic and neural ranking models to re-rank them again. The symbolic ranking model was trained based on the CatBoost algorithm and manual features between the question and passage. The neural model was trained based on the KorBERT model by fine-tuning. The second stage model was trained based on the neural regression model. We maximized the performance by combining ranking models with different characters. Finally, the proposed model showed the performance of 85.8% via MRR and 82.2% via BinaryRecall@Top1 measure while evaluating 1,000 questions. Each performance was improved by 17.3%(MRR) and 22.3%(BR@Top1) compared with the baseline model.

Passage Re-ranking Method Based on Sentence Similarity Through Multitask Learning

Youngjin Jang, Hyeon-gu Lee, Jihyun Wang, Chunghee Lee, Harksoo Kim

http://doi.org/10.5626/JOK.2020.47.4.416

The machine reading comprehension(MRC) system is a question answering system in which a computer understands a given passage and respond questions. Recently, with the development of the deep neural network, research on the machine reading system has been actively conducted, and the open domain machine reading system that identifies the correct answer from the results of the information retrieval(IR) model rather than the given passage is in progress. However, if the IR model fails to identify a passage comprising the correct answer, the MRC system cannot respond to the question. That is, the performance of the open domain MRC system depends on the performance of the IR model. Thus, for an open domain MRC system to record high performance, a high performance IR model must be preceded. The previous IR model has been studied through query expansion and reranking. In this paper, we propose a re-ranking method using deep neural networks. The proposed model re-ranks the retrieval results (passages) through multi-task learning-based sentence similarity, and improves the performance by approximately 8% compared to the performance of the existing IR model with experimental results of 58,980 pairs of MRC data.

Biomedical Named Entity Recognition using Multi-head Attention with Highway Network

Minsoo Cho, Jinuk Park, Jihwan Ha, Chanhee Park, Sanghyun Park

http://doi.org/10.5626/JOK.2019.46.6.544

Biomedical named entity recognition(BioNER) is the process of extracting biomedical entities such as diseases, genes, proteins, and chemicals from biomedical literature. BioNER is an indispensable technique for the extraction of meaningful data from biomedical domains. The proposed model employs deep learning based Bi-LSTM-CRF model which eliminates the need for hand-crafted feature engineering. Additionally, the model contains multi-head attention to capture the relevance between words, which is used when predicting the label of each input token. Also, in the input embedding layer, the model integrates character-level embedding with word-level embedding and applies the combined word embedding into the highway network to adaptively carry each embedding to the input of the Bi-LSTM model. Two English biomedical benchmark datasets were employed in the present research to evaluate the level of performance. The proposed model resulted in higher f1-score compared to other previously studied models. The results demonstrate the effectiveness of the proposed methods in biomedical named entity recognition study.

Bug Report Quality Prediction for Enhancing Performance of Information Retrieval-based Bug Localization

Misoo Kim, June Ahn, Eunseok Lee

http://doi.org/10.5626/JOK.2017.44.8.832

Bug reports are essential documents for developers to localize and fix bugs. These reports contain information regarding software bugs or failures that occur during software operation and maintenance phase. Information Retrieval-based Bug Localization (IR-BL) techniques have been proposed to reduce the time and cost it takes for developers to resolve bug reports. However, if a low-quality bug report is submitted, the performance of such techniques can be significantly degraded. To address this problem, we propose a quality prediction method that selects low-quality bug reports. This process; defines a Quality property of a Bug report as a Query (Q4BaQ) and predicts the quality of the bug reports using machine learning. We evaluated the proposed method with 3 open source projects. The results of the experiment show that the proposed method achieved an average F-measure of 87.31% and outperformed previous prediction techniques by up to 6.62% in the F-measure. Finally, a combination of the proposed method and traditional automatic query reformulation method improved the MRR and MAP by 0.9% and 1.3%, respectively.

Inverse Document Frequency-Based Word Embedding of Unseen Words for Question Answering Systems

Wooin Lee, Gwangho Song, Kyuseok Shim

http://doi.org/

Question answering system (QA system) is a system that finds an actual answer to the question posed by a user, whereas a typical search engine would only find the links to the relevant documents. Recent works related to the open domain QA systems are receiving much attention in the fields of natural language processing, artificial intelligence, and data mining. However, the prior works on QA systems simply replace all words that are not in the training data with a single token, even though such unseen words are likely to play crucial roles in differentiating the candidate answers from the actual answers. In this paper, we propose a method to compute vectors of such unseen words by taking into account the context in which the words have occurred. Next, we also propose a model which utilizes inverse document frequencies (IDF) to efficiently process unseen words by expanding the system’s vocabulary. Finally, we validate that the proposed method and model improve the performance of a QA system through experiments.

A Semi-automatic Construction method of a Named Entity Dictionary Based on Wikipedia

Yeongkil Song, Seokwon Jeong, Harksoo Kim

http://doi.org/

A named entity(NE) dictionary is an important resource for the performance of NE recognition. However, it is not easy to construct a NE dictionary manually since human annotation is time consuming and labor-intensive. To save construction time and reduce human labor, we propose a semi-automatic system for the construction of a NE dictionary. The proposed system constructs a pseudo-document with Wiki-categories per NE class by using an active learning technique. Then, it calculates similarities between Wiki entries and pseudo-documents using the BM25 model, a well-known information retrieval model. Finally, it classifies each Wiki entry into NE classes based on similarities. In experiments with three different types of NE class sets, the proposed system showed high performance(macro-average F1-score of 0.9028 and micro-average F1-score 0.9554).


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr