Images Classification Integrating Transformer with Convolutional Neural Network

Authors

  • Yulin Peng

DOI:

https://doi.org/10.56028/aetr.6.1.621.2023

Keywords:

Classification; Transformer; Convolutional Neural Network.

Abstract

Convolutional neural networks (CNN) are one of the most widely used deep learning methods in computer vision, which can effectively extract local spatial information from images, but lack global understanding and dependency modelling of image features. As a result, contextual information cannot be fully utilized by the network. For example, on coordinate modelling tasks (such as object detection, image generation, etc.), CNN may not be able to accurately locate or reconstruct the position and shape of objects. In contrast to traditional CNN models such as ResNet, Transformers rely on their global attention mechanism to capture long-distance dependencies between patches. The thesis presents an enhanced lightweight method which integrates Transformer with five convolutional neural layers. Model based on CNN and Transformer is tested on the two benchmark datasets MNIST and CIFAR-10. After a few epochs, the model is convergent and reaches high accuracy of 99.34% in MNIST and 92.04% in CIFAR-10. This model outperforms the single CNN and some state-of-the-art models in classifying both datasets, especially in distinguishing similar images like ‘6’ and ‘9’, ‘bird’ and ‘plane’. These results indicate the model's good robustness and generality.

Downloads

Published

2023-08-01