← Back

Anti-melanoma effect of ruthenium(II)-diphosphine complexes containing naphthoquinone ligand.

PMID: 34090039
Brain Tumor Classification from MRI using Vision Transformers Ensembling Sudhakar Tummala (  sudhakar.t@srmap.edu.in ) SRM University AP https://orcid.org/0000-0001-5735-9418 Research Article Keywords: brain tumor, MRI, vision transformer, diagnosis Posted Date: April 26th, 2022 DOI: https://doi.org/10.21203/rs.3.rs-1593662/v1 License:   This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Page 1/17 Abstract Automated classification of brain tumors plays an important role in supporting radiologists in decision making. Recently, Vision Transformer (ViT) based deep neural network architectures have gained attention in the computer vision research domain owing to the tremendous success of transformer models in natural language processing. However, studies involving vision transformers for various tasks in the medical imaging domain, including in the field of neuroimaging, are still growing. Many methods have been developed for the classification of brain tumors using traditional machine learning and deep learning methods. In particular, there are several convolutional neural network based transfer learning approaches for achieving good tumor classification accuracy. In this study, pretrained and finetuned ViT models on the ImageNet were adopted for the classification task. A brain tumor dataset from figshare consisting of 3064 T1-weighted contrast-enhanced (CE) magnetic resonance imaging (MRI) slices with meningioma, glioma, and pituitary tumor was used for cross-validation and testing of ensembled ViT models ability for 3-class classification task. The ensemble of all four ViT models B/16, B/32, L/16, and L/32, has demonstrated an overall testing accuracy of 98.7% at 384 × 384 resolution. Therefore, an ensemble of ViT models could be deployed for the computer-aided diagnosis of brain tumors based on T1w CE MRI leading to radiologist relief. Introduction Brain tumors (BT) are characterized by the abnormal growth of neural and glial cells in the brain. BT causes several medical conditions including loss of sensation, hearing and vision problems, headaches, nausea and seizures [1, 2]. There exist several types of brain tumors and the most prevalent cases include meningioma (originates from the membrane surrounding the brain) which is non-cancerous, glioma (starts from glial cells and spinal cord) and glioblastoma (grows from the brain) which are cancerous [3, 4]. Sometimes cancer can spread from other parts of the body which is called brain metastasis [5]. Pituitary tumor is another type of brain tumor that develops in the pituitary gland in the brain which primarily regulates other glands of the body [6]. Magnetic resonance imaging (MRI) is a versatile imaging method that enables to visualize inside the body noninvasively and it is in extensive use in the field of neuroimaging [7]. There exist several structural MRI protocols to visualize inside the brain but prime modalities include T1-weighted (T1w), T2-weighted, and T1w contrast-enhanced (CE) MRI. BTs appear with different pixel intensity contrasts in structural MRI images compared with neighboring normal tissues enabling clinical radiologists to diagnose the tumor [8]. There were several studies to classify brain tumors automatically using MRI images starting with traditional machine learning classifiers such as support vector machines (SVM), k-nearest-neighbor (kNN), and Random Forest from hand crafted features of the MRI slices [9–12]. With the rise of convolutional neural network (CNN) deep learning model architectures since 2012 and along with emerging advanced computational resources such as GPUs and TPUs, during the past decade, several methods have been proposed for the classification of brain tumors based on finetuning the existing stateof-the-art CNN models such as AlexNet, VGG16, ResNets, Inception, DenseNets, Xception, which were Page 2/17 already successful for various computer vision tasks [13–22]. These aforementioned pretrained CNN models based on localized convolutions demonstrated excellent performance in the brain tumor classification that were tested on different datasets [23–26]. CNNs generally have inductive bias i.e., the translation equivariance of the local receptive field, due to the inductive bias, the CNN models have issues learning long range information, and moreover, data augmentation is generally required for CNNs to improve their performance due to their dependency on local pixel variations during learning. Lately, transformers [27] have become the de facto models for natural language processing. An adapted version of the transformer for images, the vision transformer (ViT), has been proposed in [28] and it seemingly performed superior to CNN models under a huge data regime as demonstrated by its improved performance when it was trained on JFT dataset with 300M images [28]. Therefore, to fully exploit the power of ViTs, a large amount of data is required and it may not be possible in medical imaging domains to collect such an amount of data. To deal with this, transfer learning approaches can be applied using pretrained and finetuned ViT models. These approaches were already successful in a few existing medical imaging diagnostics [29–33]. Hence, in this work, the ability of pretrained and finetuned ViT models both individually and in an ensemble manner is evaluated for the classification of meningioma, glioma and pituitary tumors from T1w CE MRI at 224 × 224 and 384 × 384 resolutions. Related Work ML and CNN based networks Before the feasibility of using deep CNN based models in brain tumors classification, ML classifiers based on feature engineering from MRI images are the standard. In [9], several texture features extracted from MRI images were used to train SVM, kNN and extreme learning machine classifiers. In another study [11], tumor classification was conducted based on tumor shape, image intensity characteristics and rotation invariant texture features along with an SVM classifier and this method was applied to classify 102 types of brain tumors. In a study based on features extracted from structural MRI, diffusion-weighted MRI, and perfusion MRI, a four-class tumor classification system was developed using an SVM classifier [12]. With the rise of several state-of-the-art deep CNN models and the advent of transfer learning, neural network architectures have emerged as the standard for brain tumor classification from MRI. In [10], ensemble deep features extracted from 13 pretrained CNN models along with 9 machine learning classifiers are employed for improved classification. A Siamese network based tumor identification was performed based on GoogLeNet encodings and contrastive loss in a medical image retrieval study [13]. Similar GoogLeNet encodings along with ML classifiers were employed for brain tumor classification on the internet of medical things setup in other studies [22, 24]. Glioma classification using the data from the multimodal brain tumor image segmentation benchmark 2018 and the cancer imaging archive low grade glioma was performed using 2D mask regional CNN and 3D CNN models [15]. Since CNN models are also data hungry, variational autoencoders along with generative adversarial networks were used for synthetic Page 3/17 data generation and ResNet50 for tumor classification in a very recent study [18]. Using the figshare brain tumor dataset, transfer learning from VGG16, VGG19, ResNet50, and DenseNet21 models with four different optimization algorithms were implemented and the authors concluded that ResNet50 with Adadelta performed better among all [19]. Despite the success of CNN models, they have an inherent inductive bias which limits their performance towards unseen data when the object of interest in the image has different orientations and scales. ViT based networks ViT models proposed by [28] have a less inductive bias due to global patch-based learning and they learn more appropriate inductive biases specific to the requirement. Usage of ViT models for medical imaging diagnostics is sparse and still in its infancy because ViTs were recently introduced and they require large amounts of data and higher computational resources for training to exhibit their full potential. In [34], several pretrained and finetuned models on ImageNet21k and ImageNet2012 datasets with various patch sizes and the different number of multi-head self-attention layers allowing finetuning to a downstream task are provided and are openly available. In [35], ViTs ability to classify breast cancer from ultrasound image is presented where the authors compared the performance of several pretrained and finetuned models and concluded that ViTs performed better than conventional CNNs; in particular, ViT-B/32 achieved superior performance among all. In another recent work [36], a ViT based explainable covid-19 and pneumonia classification model was developed from chest X-rays and computed tomography images. To address ViT demands in terms of large data and computational resources, a data efficient transformer was introduced based on regularization and data augmentation methods similar to CNNs [37]. Swin transformer is another variant of ViT based on shifted windows technique [38] and recently a Swin-Unet has been proposed for multiorgan and cardiac image segmentation tasks [39]. In another recent work [40], the Segtran model was developed for medical image segmentation tasks such as optic disc segmentation in digital fundus images, polyp segmentation in colonoscopy images and brain tumor segmentation using MRI based on squeeze-and-expansion transformers. More recent advances in ViT based models in various fields including the medical imaging field could be found here [41]. Methods This section describes the dataset, the vision transformer architecture, computational infrastructure for model training, hyperparameter tuning using the validation set, and testing. The ViT models ensembling and the performance metrics employed are also discussed. Dataset An openly available dataset from figshare consists of 3064 T1-weighted CE MRI slices from 233 patients with meningioma or glioma or pituitary tumors. The images are available in all sagittal, coronal and axial directions with spatial resolutions of 512 × 512 or 256 × 256. More details about the dataset are available Page 4/17 at [42, 43]. A few MRI images from the figshare dataset are illustrated in Fig. 1. Further, a brief clinical description of the three types of tumors is given below. Meningioma: these are mostly benign tumors originating from the arachnoid cap cells and often occur in older age individuals and females. These tumors account for 13–26% of all intracranial tumors [44]. Glioma: gliomas are the most frequent and primary intracranial tumors which are malignant. They represent 81% of all intracranial tumors which can cause significant mortality and morbidity [45]. Pituitary Tumor: it originates in the pituitary gland and is mostly benign. Since this gland regulates different hormones, tumors present in it may cause severe changes in the body. These tumors contribute to 10–15% of all intracranial tumors [3]. The number of images for each tumor category and the number of images used for training, validation and testing in a 70:10:20 ratio respectively are described in Table 1. Table 1 Figshare dataset showing the number of MRI slices for each tumor category. MRI: magnetic resonance imaging, BT: brain tumor, N: number of images. BT type Total Images Training Validation Testing Meningioma 708 502 75 131 Glioma 1426 988 148 290 Pituitary Tumor 930 647 91 192 Total (N) 3064 2137 314 613 Vision transformer The ViT proposed by [28] works by treating image patches as words to mimic the original transformer model developed for natural language processing tasks [27]. Although the original transformer model was the combination of both an encoder and a decoder, the ViT model has only an encoder in its architecture. In ViT, the input image I is R H × W × C, it is divided into N patches of size P × P × C where N= HW P2 (H: height, W: width, C: number of channels). Afterward, linear embeddings are computed for these image patches and position embeddings are added to them to keep the patch positional information (Fig. 2). An extra learnable patch embedding is added for final classification by a multilayer perceptron (MLP) head. Further, these combined patch and position embeddings are fed to the transformer encoder model which has alternating layers of multiheaded self-attention and MLP blocks (Fig. 3). In this work, the pretrained and finetuned ViT base (B) and large (L) models: B/16, L/16, B/32 and L/32 (16 and 32 indicate square patch size) on ImageNet-21k and ImageNet-1k datasets respectively were used. Hence, the MRI images were resized to the resolution of 224 × 224 and 384 × 384. Since these Page 5/17 pretrained ViT models require three channels in the input and as the MRI slice has a single channel, the same grayscale MRI image is copied into the other two channels. Like [class] in BERT [46], a learnable embedding is concatenated to the sequence of patch embeddings 0 (z0 = Iclass). Mathematically, the working principle of ViT is given below in Equations (1)-(4). In Eq. (1), E pos is the positional embeddings and x N p E is the embedding of patch N which was a learnable linear projection. The first block of the transformer encoder layer starts with layer normalization (LN) followed by multi-head self-attention (MSA), and a residual connection follows that; the second block also starts with the LN layer followed by an MLP and a residual connection as shown in Fig. 3 and Equations (2) and (3). The MLP in the transformer block contains two fully connected layers with GELU (gaussian error linear unit) nonlinearity. The output of the final transformer encoder layer will be z0 L which is further layer normalized as described in Eq. (4) to get the final latent representation y (with dimension D) of the input image I. The MLP head or the final classification head is attached to this final latent representation (Fig. 2) during both pretraining and finetuning. z0 = ' [I 1 2 N class ; x pE; x pE; …; x p E ( ( zl = MSA LN zl − 1 ]+E pos E ∈ R ( P2 . C ) × D , E pos ∈ R ( N + 1 ) × D ) ) + zl − 1l = 1…L ' (2) (3) ( ( )) + z l = 1…L y = LN (z ) zl = MLP LN zl (1) ' l (4) 0 L More details about the pretraining and finetuning of ViT models on larger datasets are described in detail in [28]. Computational infrastructure Google Colab Pro cloud environment which provides about 25 GB RAM along with nvidia T4 GPU accelerator was used. The model training, validation and testing were implemented in TensorFlow 2.8.0 which has Keras as a high-level API. The pretrained and finetuned ViT models available at the vit-keras module are used for the downstream task of 3-class classification of brain tumors from the figshare dataset. Custom Python scripts were written where and when necessary. Model ensembling To evaluate the ensemble models for class prediction, the procedure described in Equations (5) and (6) is followed. The softmax outputs of each model (softmaxi) are dot-wise added and finally divided by the number of individual models (N) to obtain the final output (softmaxe) of the ensemble classifier. Two Page 6/17 ensembling procedures are evaluated, where the first one is the ensemble of all models at 224 × 224 resolution and the second ensemble is combining all models at 384 × 384 resolutions. 1 (5) N softmax e = N ∑ i = 1softmax i ( finalclassprediction = argmax softmax e ) (6) Performance metrics Since it is a multi-class classification, sparse categorical cross-entropy was used as the loss metric and sparse categorical accuracy was used as the performance metric during training and validation. Confusion matrix and overall sparse categorical accuracy are used as model evaluation metrics during testing. The model’s hyperparameters that were tuned are optimizer (RMSprop/Adam/Adadelta), learning rate (lr), number of epochs (ne), and mini-batch size (mbs). Optimization of the hyperparameters was conducted using the validation set. For calculating performance metrics on the test set, the hyperparameters that gave the best accuracy values during 5-fold cross-validation are considered. Results Initially, the image intensities were rescaled to have values between − 1 and 1, which is a requirement for ViT models. During training, all parameters of the ViT models were allowed to be finetuned. For the input image resolution of 224 × 224, the optimized hyperparameters with respect to validation accuracy are Adam optimizer with lr = 0.0001, ne = 25, and mbs = 16. B/16 model performed best at this resolution with a validation accuracy of 97.83%. For the rest of the models, performance at different hyperparameter combinations is given in Table 2 and the best hyperparameters and accuracy values are highlighted. Similarly, at 384 × 384 resolution, the optimized hyperparameters for the best validation accuracy of 98.64% from L/16 model are Adadelta with lr = 0.1, ne = 10 and mbs = 8. Adadelta was solely the best optimizer at this resolution. The optimized hyperparameters and validation accuracies for all other models B/16, B/32, L/16, and L/32 are 98.10%, 98.04%, and 98.55% respectively. Due to computational constraints, training at 384 resolution is implemented with lower mbs values. Page 7/17 Table 2 Validation accuracy values for different optimizers and hyperparameters for ViT-B/16, ViT-B/32, ViTL/16 and ViT-L/32 for both input image resolutions of 224 × 224 and 384 × 384. ViT: vision transformer, ne = number of epochs, mbs = mini-batch size, lr = learning rate. B: base, L: large. Resolution Optimizers & Hyperparameters Validation accuracy in percentage ViT-B/16 224 × 224 RMSprop { lr = 0.0001, ne = 25, mbs = 16 lr = 0.0001, ne = 20, mbs = 32 lr = 0.00005, ne = 15, mbs = 32 Adam { lr = 0.0001, ne = 25, mbs = 16 lr = 0.0001, ne = 20, mbs = 32 lr = 0.00005, ne = 15, mbs = 32 Adadelta { 384 × 384 lr = 0.1, ne = 15, mbs = 16 lr = 0.1, ne = 20, mbs = 32 lr = 0.05, ne = 15, mbs = 32 RMSprop { lr = 0.0001, ne = 15, mbs = 8 lr = 0.0001, ne = 10, mbs = 16 lr = 0.00005, ne = 10, mbs = 8 Adam { lr = 0.0001, ne = 15, mbs = 8 lr = 0.0001, ne = 10, mbs = 16 lr = 0.00005, ne = 10, mbs = 8 Adadelta { lr = 0.1, ne = 10, mbs = 8 lr = 0.1, ne = 15, mbs = 16 lr = 0.05, ne = 10, mbs = 8 Page 8/17 ViT-B/32 ViT-L/16 ViT-L/32 { { { { 96.20 95.92 95.65 { { { { 97.25 96.20 97.25 { { { { 97.28 97.25 96.20 { { { { 96.51 96.60 97.60 { { { { 97.01 97.40 96.60 { { { { 98.55 97.60 98.01 96.20 96.41 97.06 97.83 97.55 96.47 97.25 97.01 97.55 97.31 96.60 97.63 97.30 97.54 96.90 97.10 97.80 98.10 97.28 97.01 96.47 95.92 96.74 96.74 96.01 96.01 96.20 97.55 97.21 96.74 97.11 96.65 97.01 98.04 97.83 96.84 96.10 96.47 95.92 96.82 96.40 96.50 97.28 97.25 97.55 97.40 96.95 97.60 96.82 97.40 97.70 97.90 97.50 98.64 The test accuracy values for both the input image resolutions of 224 × 224 and 384 × 384 for all ViT models are given in Table 3. ViT-B/16 performed well among all with an overall accuracy of 97.06% at 224 × 224. Similarly, at the resolution of 384 × 384, the ViT-L/32 emerged as the single best classifier with overall test accuracy of 98.21%. Table 3 Test accuracy values are given in percentages for ViT-B/16, ViT-B/32, ViT-L/16 and ViT-L/32 for both resolutions of 224 × 224 and 384 × 384. ViT: vision transformer, B: base, L: large. Resolution Test accuracy in percentage ViT-B/16 ViT-B/32 ViT-L/16 ViT-L/32 224 × 224 97.06 96.25 96.74 96.01 384 × 384 97.72 97.87 97.55 98.21 The performance of the average ensembling on the test set is given in Table 4. The ensembling of the models at 224 × 224 resolution resulted in an overall accuracy of 97.71% and the overall test accuracy of the ensemble model at 384 × 384 resolution is 98.7%. Table 4 Test accuracy are values given in percentages for ensemble classification at a) resolution of 224 × 224, b) resolution of 384 × 384. ViT: vision transformer. Ensembled Models Test accuracy in percentage All ViT models at 224 × 224 resolution 97.71 All ViT models at 384 × 384 resolution 98.70 The performance of ViT models on the test set in the form of confusion matrices is given in Figs. 4 and 5 for 224 × 224 and 384 × 384 resolutions respectively. The number of false predictions is higher for meningioma and glioma compared to the pituitary tumor. A similar trend was observed at the two resolutions. However, the number of false predictions is relatively lower at 384 × 384 resolution. Figure 6 shows the confusion matrices for the ensemble models performance at both resolutions on the test set. The number of false predictions for the ensemble model at 384 resolution was just eight and the ensemble model achieved 100% accuracy in the identification of glioma. Discussion In this study, the ability of pretrained and finetuned ViT models is investigated both individually and in an ensemble manner for 3-class classification of brain tumors including meningioma, glioma and pituitary tumor from T1w CE MRI. In general, all ViT models demonstrated the ability to classify with validation and test accuracies above 97% during most scenarios (refer to Tables 2 and 3). Based on the hyperparameter tuning using the validation set, the performance of all the models is good irrespective of Page 9/17 the choice of the model hyperparameters, such as optimizer, lr, ne, and mbs which indicates that the ViT models are robust across different hyperparameter settings; however, the Adadelta optimizer outperformed all other optimizers at 384 × 384 resolution. Nevertheless, to evaluate the performance of the models on the test set, the models that yielded the highest accuracy values on the validation set was considered which is the standard procedure. The individual model’s performance on both the validation and test sets is slightly better at the image resolution of 384 × 384 compared to 224 × 224, which could be because the general performance of the ViT models is better at higher resolutions, as indicated by the experiments from [28]. Similarly, the ensemble model’s performance at 384 × 384 was better than that of the ensemble model’s performance at 224 × 224 because average ensembling was used and the ensemble model’s performance depends on the individual model’s performance in the group. Comparing the performance of the ViT models with previous studies based on the same dataset given in Table 5, the ensemble of ViTs at 384 × 384 resolution performed better, with an overall test accuracy of 98.7%. Based on the confusion matrices on the test set from all the models at both input image resolutions (Figs. 4 and 5), meningioma has a higher number of misclassifications than glioma and pituitary tumors possibly because there could be feature overlapping between the image encodings of meningioma and glioma, as well as meningioma and pituitary tumor. Previous studies have documented a similar trend in misclassification in test set results [19, 22]. Our study outperformed all previous studies based on custom CNNs and transfer learning methods indicating that the pretrained and finetuned ViT models are superior to CNN based models. The only study that performed marginally better was from [19]; however, the number of false predictions in [19] was 9 whereas, in our study, the number of false predictions was 8 using ensemble model with 384 × 384 resolution, as shown in Fig. 6B. Table 5 Previous related work using figshare dataset and performance comparison in terms of overall test accuracy. ViT: vision transformer. Work Method Image resolution Training Test Accuracy J Cheng [42] GLCM-BoW 512 × 512 80% 91.28% MR Ismael [47] DWT-2D Gabor 512 × 512 70% 91.90% A Pashaei [48] CNN-ELM 512 × 512 70% 93.68% P Afshar [49] CapsuleNet 128 × 128 - 90.89% S Deepak [22] CNN-SVM-kNN 224 × 224 80% 97.80% O Polat [19] Transfer Learning 224 × 224 70% 99.02% B Ahmad [18] GAN-VAEs 512 × 512 60% 96.25% Our method Ensemble of ViTs 224 × 224 70% 97.71% 384 ×384 70% 98.70% Page 10/17 During training, all the model parameters starting from the patch embeddings layer were allowed to be finetuned since based on a few experiments conducted by freezing the initial layers including some transformer encoder block layers of the ViT models, the validation and test accuracies are around a couple of percentage points lower than the accuracy values obtained by unfreezing parameters of all layers. Even though the model’s performance improved at 384 × 384 resolution, training at this resolution was computationally demanding and hence implemented in a TPU environment. Further, the performance of the ViTs at the original input image resolution of 512 × 512 may become better and this hypothesis could be investigated in a high-level computing environment. Furthermore, the cross-validated models from this study can be finetuned to deal with other brain tumor datasets. In addition, in a future study, it could be interesting to investigate the ability of other vision transformer variants such as swin vision transformers [38], data-efficient vision transformers [37], and transformer in transformer models [50] for the brain tumor classification from MRI. A python notebook with specific code and cross-validated ViT models pertaining to this study can be provided upon reasonable request. Conclusions The performance of the ensemble model at 384 × 384 resolution is on par and better than previous CNN models for the classification of brain tumors from MRI achieving an overall test accuracy of 98.7%. Using the same ensemble model, the test classification accuracy for gliomas is 100%. Therefore, computeraided diagnosis of brain tumors from T1w CE MRI using the ensemble of finetuned ViT models can be an alternative to manual diagnosis by a clinical radiologist. Declarations Competing Interests: The authors have no competing interests to declare. Funding: No funding was received for conducting this study. Compliance with Ethical Standards This research study was conducted retrospectively using human subject data made available in open access by Figshare. Ethical approval was not required as confirmed by the license attached with the open access data. References 1. S. Rasheed, K. Rehman, M.S.H. Akash, An insight into the risk factors of brain tumors and their therapeutic interventions, Biomed. Pharmacother. 143 (2021). Page 11/17 https://doi.org/10.1016/J.BIOPHA.2021.112119. 2. I. Sánchez Fernández, T. Loddenkemper, Seizures caused by brain tumors in children, Seizure. 44 (2017) 98–107. https://doi.org/10.1016/J.SEIZURE.2016.11.028. 3. M. Chintagumpala, A. Gajjar, Brain tumors, Pediatr. Clin. North Am. 62 (2015) 167–178. https://doi.org/10.1016/J.PCL.2014.09.011. 4. K. Herholz, K.J. Langen, C. Schiepers, J.M. Mountz, Brain tumors, Semin. Nucl. Med. 42 (2012) 356– 370. https://doi.org/10.1053/J.SEMNUCLMED.2012.06.001. 5. A. Boire, P.K. Brastianos, L. Garzia, M. Valiente, Brain metastasis, Nat. Rev. Cancer. 20 (2020) 4–11. https://doi.org/10.1038/S41568-019-0220-Y. 6. G. Kontogeorgos, Classification and pathology of pituitary tumors, Endocrine. 28 (2005) 27–35. https://doi.org/10.1385/ENDO:28:1:027. 7. M. Viallon, V. Cuvinciuc, B. Delattre, L. Merlini, I. Barnaure-Nachbar, S. Toso-Patel, M. Becker, K.O. Lovblad, S. Haller, State-of-the-art MRI techniques in neuroradiology: principles, pitfalls, and clinical applications, Neuroradiology. 57 (2015) 441–467. https://doi.org/10.1007/S00234-015-1500-1. 8. J.E. Villanueva-Meyer, M.C. Mabray, S. Cha, Current Clinical Brain Tumor Imaging, Neurosurgery. 81 (2017) 397–415. https://doi.org/10.1093/NEUROS/NYX103. 9. K. Kavin Kumar, T. Meera Devi, S. Maheswaran, An Efficient Method for Brain Tumor Detection Using Texture Features and SVM Classifier in MR Images, Asian Pac. J. Cancer Prev. 19 (2018) 2789–2794. https://doi.org/10.22034/APJCP.2018.19.10.2789. 10. J. Kang, Z. Ullah, J. Gwak, MRI-Based Brain Tumor Classification Using Ensemble of Deep Features and Machine Learning Classifiers, Sensors (Basel). 21 (2021) 1–21. https://doi.org/10.3390/S21062222. 11. E.I. Zacharaki, S. Wang, S. Chawla, D.S. Yoo, R. Wolf, E.R. Melhem, C. Davatzikos, Classification of brain tumor type and grade using MRI texture and shape in a machine learning scheme, Magn. Reson. Med. 62 (2009) 1609. https://doi.org/10.1002/MRM.22147. 12. S. Shrot, M. Salhov, N. Dvorski, E. Konen, A. Averbuch, C. Hoffmann, Application of MR morphologic, diffusion tensor, and perfusion imaging in the classification of brain tumors using machine learning scheme, Neuroradiology. 61 (2019) 757–765. https://doi.org/10.1007/S00234-019-02195-Z. 13. S. Deepak, P.M. Ameer, Retrieval of brain MRI with tumor using contrastive loss based similarity on GoogLeNet encodings, Comput. Biol. Med. 125 (2020) 103993. https://doi.org/10.1016/J.COMPBIOMED.2020.103993. 14. Z.N.K. Swati, Q. Zhao, M. Kabir, F. Ali, Z. Ali, S. Ahmed, J. Lu, Brain tumor classification for MR images using transfer learning and fine-tuning, Comput. Med. Imaging Graph. 75 (2019) 34–46. https://doi.org/10.1016/J.COMPMEDIMAG.2019.05.001. 15. Y. Zhuge, H. Ning, P. Mathen, J.Y. Cheng, A. V. Krauze, K. Camphausen, R.W. Miller, Automated glioma grading on conventional MRI images using deep convolutional neural networks, Med. Phys. 47 (2020) 3044–3053. https://doi.org/10.1002/MP.14168. Page 12/17 16. R. Pomponio, G. Erus, M. Habes, J. Doshi, D. Srinivasan, E. Mamourian, V. Bashyam, I.M. Nasrallah, T.D. Satterthwaite, Y. Fan, L.J. Launer, C.L. Masters, P. Maruff, C. Zhuo, S.C. Johnson, J. Fripp, N. Koutsouleris, D.H. Wolf, R. Gur, R. Gur, J. Morris, M.S. Albert, H.J. Grabe, S.M. Resnick, R. Nick Bryan, D.A. Wolk, R.T. Shinohara, H. Shou, C. Davatzikos, Harmonization of large MRI datasets for the analysis of brain imaging patterns throughout the lifespan, (2019). https://doi.org/10.1016/j.neuroimage.2019.116450. 17. M.A. Naser, M.J. Deen, Brain tumor segmentation and grading of lower-grade glioma using deep learning in MRI images, Comput. Biol. Med. 121 (2020). https://doi.org/10.1016/J.COMPBIOMED.2020.103758. 18. B. Ahmad, J. Sun, Q. You, V. Palade, Z. Mao, Brain Tumor Classification Using a Combination of Variational Autoencoders and Generative Adversarial Networks, Biomedicines. 10 (2022). https://doi.org/10.3390/BIOMEDICINES10020223. 19. Ö. Polat, C. Güngen, Classification of brain tumors from MR images using deep transfer learning, J. Supercomput. 2021 777. 77 (2021) 7236–7252. https://doi.org/10.1007/S11227-020-03572-9. 20. H.A. Khan, W. Jue, M. Mushtaq, M.U. Mushtaq, H.A. Khan, W. Jue, M. Mushtaq, M.U. Mushtaq, Brain tumor classification in MRI image using convolutional neural network, Math. Biosci. Eng. 2020 56203. 17 (2020) 6203–6216. https://doi.org/10.3934/MBE.2020328. 21. M.M. Badža, M.C. Barjaktarović, Classification of Brain Tumors from MRI Images Using a Convolutional Neural Network, Appl. Sci. 2020, Vol. 10, Page 1999. 10 (2020) 1999. https://doi.org/10.3390/APP10061999. 22. S. Deepak, P.M. Ameer, Brain tumor classification using deep CNN features via transfer learning, Comput. Biol. Med. 111 (2019) 103345. https://doi.org/10.1016/J.COMPBIOMED.2019.103345. 23. E.U. Haq, H. Jianjun, K. Li, H.U. Haq, T. Zhang, An MRI-based deep learning approach for efficient classification of brain tumors, J. Ambient Intell. Humaniz. Comput. 2021. (2021) 1–22. https://doi.org/10.1007/S12652-021-03535-9. 24. A. Sekhar, S. Biswas, R. Hazra, A.K. Sunaniya, A. Mukherjee, L. Yang, Brain tumor classification using fine-tuned GoogLeNet features and machine learning algorithms: IoMT enabled CAD system, IEEE J. Biomed. Heal. Informatics. PP (2021). https://doi.org/10.1109/JBHI.2021.3100758. 25. N.S. Shaik, T.K. Cherukuri, Multi-level attention network: application to brain tumor classification, Signal, Image Video Process. 2021. (2021) 1–8. https://doi.org/10.1007/S11760-021-02022-0. 26. M.F. Alanazi, M.U. Ali, S.J. Hussain, A. Zafar, M. Mohatram, M. Irfan, R. Alruwaili, M. Alruwaili, N.H. Ali, A.M. Albarrak, Brain Tumor/Mass Classification Framework Using Magnetic-Resonance-ImagingBased Isolated and Developed Transfer Deep-Learning Model, Sensors (Basel). 22 (2022). https://doi.org/10.3390/S22010372. 27. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, I. Polosukhin, Attention Is All You Need, Adv. Neural Inf. Process. Syst. 2017-December (2017) 5999–6009. https://doi.org/10.48550/arxiv.1706.03762. Page 13/17 28. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, (2020). https://doi.org/10.48550/arxiv.2010.11929. 29. Y. Wu, S. Qi, Y. Sun, S. Xia, Y. Yao, W. Qian, A vision transformer for emphysema classification using CT images, Phys. Med. Biol. 66 (2021). https://doi.org/10.1088/1361-6560/AC3DC8. 30. B. Gheflati, H. Rivaz, Vision Transformer for Classification of Breast Ultrasound Images, (2021). https://doi.org/10.48550/arxiv.2110.14731. 31. F. Shamshad, S. Khan, S.W. Zamir, M.H. Khan, M. Hayat, F.S. Khan, H. Fu, Transformers in Medical Imaging: A Survey, (2022). https://doi.org/10.48550/arxiv.2201.09873. 32. J. Wang, Z. Fang, N. Lang, H. Yuan, M.Y. Su, P. Baldi, A multi-resolution approach for spinal metastasis detection using deep Siamese neural networks, Comput. Biol. Med. 84 (2017) 137–146. https://doi.org/10.1016/J.COMPBIOMED.2017.03.024. 33. Y. Dai, Y. Gao, F. Liu, TransMed: Transformers Advance Multi-modal Medical Image Classification, Diagnostics. 11 (2021). https://doi.org/10.48550/arxiv.2103.05940. 34. A. Steiner, A. Kolesnikov, X. Zhai, R. Wightman, J. Uszkoreit, L. Beyer, How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers, (n.d.). https://github.com/rwightman/pytorch-image-models. (accessed March 10, 2022). 35. B. Gheflati, H. Rivaz, VISION TRANSFORMERS FOR CLASSIFICATION OF BREAST ULTRASOUND IMAGES, (n.d.). 36. A.K. Mondal, A. Bhattacharjee, P. Singla, A.P. Prathosh, xViTCOS: Explainable Vision Transformer Based COVID-19 Screening Using Radiography, IEEE J. Transl. Eng. Heal. Med. 10 (2022). https://doi.org/10.1109/JTEHM.2021.3134096. 37. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, H. Jégou, F. Ai, Training data-efficient image transformers & distillation through attention, (2020). https://doi.org/10.48550/arxiv.2012.12877. 38. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin Transformer: Hierarchical Vision Transformer using Shifted Windows, (2021). https://doi.org/10.48550/arxiv.2103.14030. 39. H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, M. Wang, Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation, (2021). http://arxiv.org/abs/2105.05537 (accessed March 10, 2022). 40. S. Li, X. Sui, X. Luo, X. Xu, Y. Liu, R. Goh, Medical Image Segmentation Using Squeeze-and-Expansion Transformers, (2021). https://github.com/askerlee/segtran. (accessed March 10, 2022). 41. K. Islam, Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work, (2022). https://doi.org/10.48550/arxiv.2203.01536. 42. J. Cheng, W. Huang, S. Cao, R. Yang, W. Yang, Z. Yun, Z. Wang, Q. Feng, Enhanced Performance of Brain Tumor Classification via Tumor Region Augmentation and Partition, PLoS One. 10 (2015). https://doi.org/10.1371/JOURNAL.PONE.0140381. Page 14/17 43. J. Cheng, W. Yang, M. Huang, W. Huang, J. Jiang, Y. Zhou, R. Yang, J. Zhao, Y. Feng, Q. Feng, W. Chen, Retrieval of Brain Tumors by Adaptive Spatial Pooling and Fisher Vector Representation, PLoS One. 11 (2016). https://doi.org/10.1371/JOURNAL.PONE.0157112. 44. C. Marosi, M. Hassler, K. Roessler, M. Reni, M. Sant, E. Mazza, C. Vecht, Meningioma, Crit. Rev. Oncol. Hematol. 67 (2008) 153–171. https://doi.org/10.1016/J.CRITREVONC.2008.01.010. 45. Q.T. Ostrom, H. Gittleman, L. Stetson, S.M. Virk, J.S. Barnholtz-Sloan, Epidemiology of gliomas, Cancer Treat. Res. 163 (2015) 1–14. https://doi.org/10.1007/978-3-319-12048-5_1. 46. J. Devlin, M.W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL HLT 2019–2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. - Proc. Conf. 1 (2018) 4171–4186. https://doi.org/10.48550/arxiv.1810.04805. 47. M.R. Ismael, I. Abdel-Qader, Brain Tumor Classification via Statistical Features and Back-Propagation Neural Network, IEEE Int. Conf. Electro Inf. Technol. 2018-May (2018) 252–257. https://doi.org/10.1109/EIT.2018.8500308. 48. A. Pashaei, H. Sajedi, N. Jazayeri, Brain tumor classification via convolutional neural network and extreme learning machines, 2018 8th Int. Conf. Comput. Knowl. Eng. ICCKE 2018. (2018) 314–319. https://doi.org/10.1109/ICCKE.2018.8566571. 49. P. Afshar, K.N. Plataniotis, A. Mohammadi, Capsule Networks for Brain Tumor Classification Based on MRI Images and Coarse Tumor Boundaries, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 2019-May (2019) 1368–1372. https://doi.org/10.1109/ICASSP.2019.8683759. 50. K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, Y. Wang, Transformer in Transformer, (2021). https://doi.org/10.48550/arxiv.2103.00112. Figures Figure 1 MRI images from the figshare dataset are shown in sagittal, coronal and axial cut planes for meningioma, glioma and pituitary tumors. Page 15/17 Figure 2 Vision transformer model adopted for classification of brain tumors from MRI. MLP: multilayer perceptron. *is extra learnable patch embedding to be used by the final classification head. Figure 3 The vision transformer encoder with multi-head self-attention. LN: layer normalization, MLP: multilayer perceptron, Lx: transformer encoder ‘x’ at layer L. Figure 4 Confusion matrix for classification of three types of tumors on the test set using ViT models A) B/16, B) B/32, C) L/16, and D) L/32 at the image resolution of 224 × 224. Page 16/17 Figure 5 Confusion matrix for classification of three types of tumors on the test set using ViT models A) B/16, B) B/32, C) L/16, and D) L/32 at the image resolution of 384 × 384. Figure 6 Confusion matrix for classification of three types of tumors on the test set using an ensemble of ViT models B/16, B/32, L/16, and L/32 at a) 224 × 224 resolution, b) 384 × 384 resolution. Page 17/17