Affiliation:
1. School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China
2. School of Software, Zhengzhou University of Light Industry, Zhengzhou 450002, China
Abstract
Existing research on multimodal machine translation (MMT) has typically enhanced bilingual translation by introducing additional alignment visual information. However, picture form requirements in multimodal datasets pose important constraints on the development of MMT because this requires a form of alignment between image, source text, and target text. This limitation is especially compounded by the fact that the inference phase, when aligning images, is not directly available in a conventional neural machine translation (NMT) setup. Therefore, we propose an innovative MMT framework called the DSKP-MMT model, which supports machine translation by enhancing knowledge distillation and feature refinement methods in the absence of images. Our model first generates multimodal features from the source text. Then, the purified features are obtained through the multimodal feature generator and knowledge distillation module. The features generated through image feature enhancement are subsequently further purified. Finally, the image–text fusion features are generated as input in the transformer-based machine translation reasoning task. In the Multi30K dataset test, the DSKP-MMT model has achieved a BLEU of 40.42 and a METEOR of 58.15, showing its ability to improve translation effectiveness and facilitating utterance communication.
Funder
Henan Provincial Science and Technology Research Project
Zhengzhou University of Light Industry Science and Technology Innovation Team Program Project
Reference54 articles.
1. (2017, November 06). Weighted Transformer Network for Machine Translation. Available online: https://arxiv.org/abs/1711.02132.
2. Laskar, S.R., Paul, B., Paudwal, S., Gautam, P., Biswas, N., and Pakray, P. (2021, January 1–3). Multimodal Neural Machine Translation for English—Assamese Pair. Proceedings of the 2021 International Conference on Computational Performance Evluation (ComPE), Shillong, India.
3. Chen, J.R., He, T.L., Zhuo, W.P., Ma, L., Ha, S.T., and Chan, S.H.G. (2022, January 19–20). Tvconv: Efficient translation variant convolution for layout-aware visual processing. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.
4. Video-guided machine translation via dual-level back-translation;Chen;Knowl. -Based Syst.,2022
5. Attention is all you need;Vaswani;Adv. Neural Inf. Process. Syst.,2017