Hierarchical Vision Transformer Model-based Lung Cancer Detection with Multiscale Patch Embedding and Cross-Attention Fusion

Main Article Content

K Yogeswara Rao, K Srinivasa Rao

Abstract

Lung cancer has become one of the most complex tumors to diagnose early, particularly with CT imaging, due to the intricacy and unpredictable nature of malignant patterns. Vision Transformers (ViTs) significantly improve feature extraction that captures insights in complex images for accurate diagnosis. However, extracting local spatial in small nodules while maintaining global features is challenging due to the patch merging in the hierarchical ViTs, which is ineffective for diversified images. Thus, this work introduces a Convolutional Neural Network (CNN) and Hierarchical Vision Transformer (ViT)-assisted hybrid model for lung cancer detection, enriched by multiscale patch embedding and cross-attention fusion to improve feature extraction and analysis of lung PET/CT images. Initially, the proposed approach applies the preprocessing and augmentation procedure to improve the generalization for lung cancer detection tasks. In the hybrid model, the CNN model extracts the local spatial features from the integrated multimodal PET/CT images and divides the feature maps of images into multiple scales to provide input to the hierarchical ViT succeeded by the multiscale patch embedding and position encoding. Moreover, the design of cross-attention fusion in hierarchical ViT combines the multiscale information, allowing the model to concentrate on relevant patterns and enhance diagnostic accuracy. Thus, experimental results show that the proposed model outperforms the existing lung cancer detection approaches, particularly in cases with small or indistinct lesions, by efficiently merging multiscale embeddings.

Article Details

Section
Articles