Search : [ keyword: 신경 기계 번역 ] (4)

Mini-Batching with Similar-Length Sentences to Quickly Train NMT Models

Daniela N. Rim, Richard Kimera, Heeyoul Choi

http://doi.org/10.5626/JOK.2023.50.7.614

The Transformer model has revolutionized Natural Language Processing tasks such as Neural Machine Translation. Many efforts have been made to study the Transformer architecture to increase its efficiency and accuracy. One potential area for improvement is to address the computation of empty tokens that the Transformer computes only to discard them later, leading to an unnecessary computational burden. To tackle this, we propose an algorithm that sorts translation sentence pairs based on their length before batching and mini-batch with similar-length sentences, which minimizes the waste of computing power. Since the amount of sorting could violate the independent and identically distributed (i.i.d) data assumption, we sort the data partially. In experiments, we apply the proposed method to English-Korean and English-Luganda language pairs for machine translation and show that there are gains in computational time while maintaining the performance. Our method is independent of architectures, so that it can be easily integrated into any training process with flexible data lengths.

Korean-English Neural Machine Translation Using Korean Alphabet Characteristics and Honorific Expressions

Jeonghui Kim, Jaemu Heo, Joowhan Kim, Heeyoul Choi

http://doi.org/10.5626/JOK.2022.49.11.1017

Recently, deep learning has improved the performance of machine translation, but in most cases, it does not reflect the characteristics of the languages. In particular, Korean has unique linguistic word and expression features, which might cause mistranslation. For example, in Google Translate from Korean to English, mistranslations occur when a noun in Korean ends with the postposition (josa) in the form of a single consonant. Also, in the English-Korean translations, the honorifics and casual expressions are mixed in the translated results. This is because the alphabetic characteristics and honorifics of the Korean language are not reflected. In this paper, to address these problems, we propose to train a model with sub-words composed of units of letters (jamo) and unifying honorific and casual expressions in the corpus. The experimental results confirmed that the proposed method resolved the problems mentioned above, and had a similar or slightly higher BLEU score compared to the existing method and the corpus.

Building a Parallel Corpus and Training Translation Models Between Luganda and English

Richard Kimera, Daniela N. Rim, Heeyoul Choi

http://doi.org/10.5626/JOK.2022.49.11.1009

Recently, neural machine translation (NMT) which has achieved great successes needs large datasets, so NMT is more premised on high-resource languages. This continuously underpins the low resource languages such as Luganda due to the lack of high-quality parallel corpora, so even ‘Google translate’ does not serve Luganda at the time of this writing. In this paper, we build a parallel corpus with 41,070 pairwise sentences for Luganda and English which is based on three different open-sourced corpora. Then, we train NMT models with hyper-parameter search on the dataset. Experiments gave us a BLEU score of 21.28 from Luganda to English and 17.47 from English to Luganda. Some translation examples show high quality of the translation. We believe that our model is the first Luganda-English NMT model. The bilingual dataset we built will be available to the public.

Kor-Eng NMT using Symbolization of Proper Nouns

Myungjin Kim, Junyeong Nam, Heeseok Jung, Heeyoul Choi

http://doi.org/10.5626/JOK.2021.48.10.1084

There is progress in the field of neural machine translation, but there are cases where the translation of sentences containing proper nouns, such as, names, new words, and words that are used only within a specific group, is not accurate. To handle such cases, this paper uses the Korean-English proper noun dictionary and the symbolization method in addition to the recently proposed translation model, Transformer Model. In the proposed method, some of the words in the sentences used for learning are symbolized using a proper noun dictionary, and the translation model is trained with sentences including the symbolized words. When translating a new sentence, the translation is completed by symbolizing, translation, and desymbolizing. The proposed method was compared with a model without symbolization, and for some cases improvement was quantitatively confirmed with the BLEU score. In addition, several examples of translation were also presented along with commercial service results.


Search




Journal of KIISE

  • ISSN : 2383-630X(Print)
  • ISSN : 2383-6296(Electronic)
  • KCI Accredited Journal

Editorial Office

  • Tel. +82-2-588-9240
  • Fax. +82-2-521-1352
  • E-mail. chwoo@kiise.or.kr