地球什么时候毁灭| 莫名是什么意思| 早上出虚汗是什么原因| 脖子淋巴结发炎吃什么药| 晚上睡觉小腿抽筋是什么原因| 尖嘴猴腮是什么生肖| acs是什么| 澳大利亚属于什么国家| 什么茶养肝护肝| 口唇发绀是什么意思| hushpuppies是什么牌子| 阴道息肉长什么样| 屡试不爽是什么意思| 一什么方式| 林五行属什么| 网络拒绝接入什么意思| 什么手什么脚| 家里出现蜈蚣是什么预兆| 一周年祭日有什么讲究| ivf是什么意思| 1月17日是什么星座| 反酸烧心吃什么药效果好| 肾衰竭有什么症状| 内痔疮吃什么药好得快| 什么动物跑得快| 一什么桥| 窍是什么意思| 腰椎ct能查出什么| 舌系带短会有什么影响| 魔芋爽是什么做的| 胃部检查除了胃镜还有什么方法| 花漾是什么意思| 什么人容易得圆锥角膜| 他们吃什么| 为什么喜欢秋天| 鹅翅膀下垂是什么原因| 咸鸭蛋为什么会出油| 老公生日送什么礼物| 何首乌泡酒有什么作用| 翟读什么| 为什么做b超要憋尿| 什么是葡萄胎| 麻药过敏什么症状| 为什么会宫颈糜烂| 人瘦是什么原因造成的| 蜕膜是什么| 官员出狱后靠什么生活| 指甲盖凹陷是什么原因| 保胎针是什么药| 什么叫肠易激综合征| dha什么时候吃| 便溏是什么原因引起的| 拉黄水是什么原因| 蚊子咬了用什么药膏| 武汉有什么好玩的| 扁桃体结石是什么原因引起的| 欧尼什么意思| 山楂可以和什么一起泡水喝| 吃什么排气| 宫颈多发纳氏囊肿是什么意思| 动则气喘是什么原因| 什么星座最好| 修复子宫内膜吃什么药| 晚餐吃什么菜谱大全| 什么人不能坐飞机| 为什么男人喜欢邓文迪| 斤是什么单位| 95年属什么| 过期的酸奶有什么用途| 龙生九子下一句是什么| 贪污是什么意思| 四海扬名是什么生肖| 类风湿关节炎吃什么好| 紫苏有什么功效| 眼皮发黑是什么原因| 飞水是什么意思| only什么意思| skp什么意思| 割伤用什么药愈合伤口| 吃什么能增肥最快| 消炎药不能和什么一起吃| 角质层是什么| 口嫌体正直是什么意思| 骨皮质扭曲是什么意思啊| 眼红是什么意思| 2001年属什么生肖| 24岁属什么| 糖尿病人可以吃什么| CAT是什么| benny是什么意思| 随波逐流是什么意思| 宜是什么意思| 什么时候排卵| 月经有点黑是什么原因| 食用葡萄糖是什么| 脑梗病人吃什么营养恢复最好| 乙肝表面抗体阴性什么意思| 小肠气挂什么科| hpv通过什么传播| 三个火读什么| 腿酸是什么原因引起的| 一级甲等医院是什么意思| 83年五行属什么| 澳大利亚属于什么国家| 佰草集适合什么年龄| 蟹黄是螃蟹的什么东西| 马提尼是什么酒| 左侧卵巢内囊性回声是什么意思| 什么是五谷| 青稞是什么东西| 榴莲什么味道| 铁皮石斛治什么病| 深v是什么意思| 湿疹是什么原因引起的起的| 空气栓塞取什么卧位| 上传下达什么意思| 异性是什么意思| 什么散步填词语| 心肌梗塞有什么症状| 吃完饭就打嗝是什么原因| 后脖子出汗多是什么原因| 肾结石去医院挂什么科| 痱子粉和爽身粉有什么区别| 鹦鹉鱼吃什么| 肾阳虚是什么原因引起的| 猕猴桃对身体有什么好处| 垂体是什么| 什么水什么什么| 什么样的牙齿需要矫正| 女菩萨是什么意思| 猫咪拉稀吃什么药| 消瘦挂什么科| 什么是美育| 什么情况下吃救心丸| 胃气胀是什么原因怎么解决| 什么时候割包皮最好| 什么牌子的充电宝好| 缘木求鱼是什么意思| 什么是双相情感障碍| 百废待兴是什么意思| 刀枪不入是什么生肖| 生命线分叉代表什么| 霍乱是什么病| 义字少一点念什么| 孕妇前三个月吃什么对胎儿好| 十二月十二日是什么星座| 身上长疣是什么原因| 类风湿是什么原因引起的| 针眼长什么样| 衣字旁有什么字| 天天晚上睡觉做梦是什么原因| 水弹是什么材料| 收官是什么意思| eblan是什么品牌| 发烧怕冷是什么原因| 羊传染人的病叫什么名| 女人漏尿是什么原因| 古曼童是什么| 咸鱼是什么意思| 当归和党参有什么区别| 子宫肌瘤长在什么位置| 凝血高是什么原因| 子欲养而亲不待是什么意思| 铋剂是什么药| 大便干结是什么原因| 白身是什么意思| 转载是什么意思| 果脯是什么| 诸事顺遂是什么意思| 生殖器疱疹是什么病| hcg低有什么补救的办法| 三十八岁属什么生肖| 女人梦到小蛇什么预兆| 女人多吃什么补黄体酮| 鸳鸯戏水是什么意思| 盆腔炎挂什么科| 梅核气西医叫什么| 吃什么食物可以降低尿酸| 不惑之年是什么意思| 球镜是什么意思| 骨折一个月能恢复到什么程度| 包头古代叫什么| 吃什么可以拉肚子通便| 什么是达人| 聚乙二醇是什么| 射手座是什么象星座| 怀孕什么时候吃鹅蛋最好| 烤鱼放什么配菜好吃| 海纳百川什么意思| ber什么意思| 肾虚吃什么药最有效| 脸黑的人适合穿什么颜色的衣服| 阴虚吃什么好| 咽喉炎有什么症状| 怕什么来什么| 阻滞是什么意思| 棺材用什么木材做最好| 侧重点是什么意思| 50pcs是什么意思| 什么牌子的洗发水好用| 贝前列素钠片主治什么病| 张学良为什么叫小六子| 螃蟹过街的歇后语是什么| 肚子疼想吐是什么原因| 全麦粉是什么面粉| rangerover是什么车| 开背鱼是什么鱼| 雪貂吃什么| 水分是什么意思| 下眼袋发青是什么原因| 扁桃体发炎吃什么药比较好| 乐高是什么| 十月一日是什么日子| 994是什么意思| 韩国买什么东西划算| 一级警长是什么级别| 宫腔占位什么意思| 肺炎不能吃什么东西| 尿毒症的尿是什么颜色| 电视剧靠什么赚钱| ny什么牌子| happy halloween是什么意思| medicine什么意思| 蚊子代表什么生肖| 女性脚冰凉是什么原因| 怀孕该吃什么补充营养| 厌世是什么意思| 足金是什么意思| 耳朵发热是什么预兆| 牙疳是什么意思| 胃火吃什么中成药| 卖是什么意思| 胃酸过多吃什么好| 韭菜什么时候种最合适| mru是什么检查| 松鼠吃什么食物| 华语是什么语言| abs是什么意思| 龙和什么属相最配| 尿是红色的是什么原因| 什么人适合吃红参| 属鼠的贵人是什么属相| 生理需要是什么意思| apc是什么药| 潘海利根香水什么档次| 黎字五行属什么| 一什么椅子| 药店为什么不让卖高锰酸钾| 什么原因造成痫性发作| 康乃馨适合送什么人| 肛裂涂什么药膏能愈合| 吊销驾驶证是什么意思| delsey是什么牌子| 劲爆是什么意思| 爱长闭口用什么护肤品| kkp什么意思| 单脐动脉对胎儿有什么影响| 箔是什么意思| 桂枝茯苓丸主治什么病| 布五行属什么| 遗精吃什么药| 什么时候开始胎教| 阴郁是什么意思| 掩耳盗什么| 百度Jump to content

广州南沙举办千人公益长跑活动 传递文明新风

From Wikipedia, the free encyclopedia
The architecture of vision transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.
百度   陈雷指出,老干部是党执政兴国的重要资源。

A vision transformer (ViT) is a transformer designed for computer vision.[1] A ViT decomposes an input image into a series of patches (rather than text into tokens), serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

ViTs were designed as alternatives to convolutional neural networks (CNNs) in computer vision applications. They have different inductive biases, training stability, and data efficiency.[2] Compared to CNNs, ViTs are less data efficient, but have higher capacity. Some of the largest modern computer vision models are ViTs, such as one with 22B parameters.[3][4]

Subsequent to its publication, many variants were proposed, with hybrid architectures with both features of ViTs and CNNs. ViTs have found application in image recognition, image segmentation, weather prediction, and autonomous driving.[5][6]

History

[edit]

Transformers were introduced in Attention Is All You Need (2017),[7] and have found widespread use in natural language processing. A 2019 paper[8] applied ideas from the Transformer to computer vision. Specifically, they started with a ResNet, a standard convolutional neural network used for computer vision, and replaced all convolutional kernels by the self-attention mechanism found in a Transformer. It resulted in superior performance. However, it is not a Vision Transformer.

In 2020, an encoder-only Transformer was adapted for computer vision, yielding the ViT, which reached state of the art in image classification, overcoming the previous dominance of CNN.[1] The masked autoencoder (2022) extended ViT to work with unsupervised training. The vision transformer and the masked autoencoder, in turn, stimulated new developments in convolutional neural networks.[9][10]

Subsequently, there was cross-fertilization between the previous CNN approach and the ViT approach.

In 2021, some important variants of the Vision Transformers were proposed. These variants are mainly intended to be more efficient, more accurate or better suited to a specific domain. Two studies [11][12] improved efficiency and robustness of ViT by adding a CNN as a preprocessor. The Swin Transformer[13] achieved state-of-the-art results on some object detection datasets such as COCO, by using convolution-like sliding windows of attention mechanism, and the pyramid process in classical computer vision.

Overview

[edit]
Vision Transformer architecture, showing the encoder-only Transformer blocks inside

The basic architecture, used by the original 2020 paper,[1] is as follows. In summary, it is a BERT-like encoder-only Transformer.

The input image is of type , where are height, width, channel (RGB). It is then split into square-shaped patches of type .

For each patch, the patch is pushed through a linear operator, to obtain a vector ("patch embedding"). The position of the patch is also transformed into a vector by "position encoding". The two vectors are added, then pushed through several Transformer encoders.

The attention mechanism in a ViT repeatedly transforms representation vectors of image patches, incorporating more and more semantic relations between image patches in an image. This is analogous to how in natural language processing, as representation vectors flow through a transformer, they incorporate more and more semantic relations between words, from syntax to semantics.

The above architecture turns an image into a sequence of vector representations. To use these for downstream applications, an additional head needs to be trained to interpret them.

For example, to use it for classification, one can add a shallow MLP on top of it that outputs a probability distribution over classes. The original paper uses a linear-GeLU-linear-softmax network.[1]

Variants

[edit]

Original ViT

[edit]

The original ViT was an encoder-only Transformer supervise-trained to predict the image label from the patches of the image. As in the case of BERT, it uses a special token <CLS> in the input side, and the corresponding output vector is used as the only input of the final output MLP head. The special token is an architectural hack to allow the model to compress all information relevant for predicting the image label into one vector.

Animation of ViT. The 0th token is the special <CLS>. The other 9 patches are projected by a linear layer before being fed into the Transformer encoder as input tokens 1 to 9.

Transformers found their initial applications in natural language processing tasks, as demonstrated by language models such as BERT and GPT-3. By contrast the typical image processing system uses a convolutional neural network (CNN). Well-known projects include Xception, ResNet, EfficientNet,[14] DenseNet,[15] and Inception.[16]

Transformers measure the relationships between pairs of input tokens (words in the case of text strings), termed attention. The cost is quadratic in the number of tokens. For images, the basic unit of analysis is the pixel. However, computing relationships for every pixel pair in a typical image is prohibitive in terms of memory and computation. Instead, ViT computes relationships among pixels in various small sections of the image (e.g., 16x16 pixels), at a drastically reduced cost. The sections (with positional embeddings) are placed in a sequence. The embeddings are learnable vectors. Each section is arranged into a linear sequence and multiplied by the embedding matrix. The result, with the position embedding is fed to the transformer.[16]

Architectural improvements

[edit]

Pooling

[edit]

After the ViT processes an image, it produces some embedding vectors. These must be converted to a single class probability prediction by some kind of network. In the original ViT and Masked Autoencoder, they used a dummy [CLS] token , in emulation of the BERT language model. The output at [CLS] is the classification token, which is then processed by a LayerNorm-feedforward-softmax module into a probability distribution.

Global average pooling (GAP) does not use the dummy token, but simply takes the average of all output tokens as the classification token. It was mentioned in the original ViT as being equally good.[1]

Multihead attention pooling (MAP) applies a multiheaded attention block to pooling. Specifically, it takes as input a list of vectors , which might be thought of as the output vectors of a layer of a ViT. The output from MAP is , where is a trainable query vector, and is the matrix with rows being .[17] This was first proposed in the Set Transformer architecture.[18]

Later papers demonstrated that GAP and MAP both perform better than BERT-like pooling.[17][19] A variant of MAP was proposed as class attention, which applies MAP, then feedforward, then MAP again.[20]

Re-attention was proposed to allow training deep ViT. It changes the multiheaded attention module.[21]

Masked Autoencoder

[edit]
Masked Autoencoder architecture

The Masked Autoencoder[22] took inspiration from denoising autoencoders and context encoders.[23] It has two ViTs put end-to-end. The first one ("encoder") takes in image patches with positional encoding, and outputs vectors representing each patch. The second one (called "decoder", even though it is still an encoder-only Transformer) takes in vectors with positional encoding and outputs image patches again. During training, both the encoder and the decoder ViTs are used. During inference, only the encoder ViT is used.

During training, each image is cut into patches, and with their positional embeddings added. Of these, only 25% of the patches are selected. The encoder ViT processes the selected patches. No mask tokens are used. Then, mask tokens are added back in, and positional embeddings added again. These are processed by the decoder ViT, which outputs a reconstruction of the full image. The loss is the total mean-squared loss in pixel-space for all masked patches (reconstruction loss is not computed for non-masked patches).

A similar architecture was BERT ViT (BEiT), published concurrently.[24]

DINO

[edit]

Like the Masked Autoencoder, the DINO (self-distillation with no labels) method is a way to train a ViT by self-supervision.[25] DINO is a form of teacher-student self-distillation. In DINO, the student is the model itself, and the teacher is an exponential average of the student's past states. The method is similar to previous works like momentum contrast[26] and bootstrap your own latent (BYOL).[27]

The loss function used in DINO is the cross-entropy loss between the output of the teacher network () and the output of the student network (). The teacher network is an exponentially decaying average of the student network's past parameters: . The inputs to the networks are two different crops of the same image, represented as and , where is the original image. The loss function is written asOne issue is that the network can "collapse" by always outputting the same value (), regardless of the input. To prevent this collapse, DINO employs two strategies:

  • Sharpening: The teacher network's output is sharpened using a softmax function with a lower temperature. This makes the teacher more "confident" in its predictions, forcing the student to learn more meaningful representations to match the teacher's sharpened output.
  • Centering: The teacher network's output is centered by averaging it with its previous outputs. This prevents the teacher from becoming biased towards any particular output value, encouraging the student to learn a more diverse set of features.

In January 2024, Meta AI Research released an updated version called DINOv2[28] with improvements in architecture, loss function, and optimization technique. It was trained on a larger and more diverse dataset. The features learned by DINOv2 were more transferable, meaning it had better performance in downstream tasks.

Swin Transformer

[edit]

The Swin Transformer ("Shifted windows")[13] took inspiration from standard CNNs:

  • Instead of performing self-attention over the entire sequence of tokens, one for each patch, it performs "shifted window based" self-attention, which means only performing attention over square-shaped blocks of patches. One block of patches is analogous to the receptive field of one convolution.
  • After every few attention blocks, there is a "merge layer", which merges neighboring 2x2 tokens into a single token. This is analogous to pooling (by 2x2 convolution kernels, with stride 2). Merging means concatenation followed by multiplication with a matrix.

It is improved by Swin Transformer V2,[29] which modifies upon the ViT by a different attention mechanism[13]:?Figure 1?:

  • LayerNorm immediately after each attention and feedforward layer ("res-post-norm");
  • scaled cosine attention to replace the original dot product attention;
  • log-spaced continuous relative position bias, which allows transfer learning across different window resolutions.

TimeSformer

[edit]

The TimeSformer[30] was designed for video understanding tasks, and it applied a factorized self-attention, similar to the factorized convolution kernels found in the Inception CNN architecture.[31] Schematically, it divides a video into frames, and each frame into a square grid of patches (same as ViT). Let each patch coordinate be denoted by , denoting horizontal, vertical, and time.

  • A space attention layer is a self-attention layer where each query patch attends to only the key and value patches such that .
  • A time attention layer is where the requirement is instead.

The TimeSformer also considered other attention layer designs, such as the "height attention layer" where the requirement is . However, they found empirically that the best design interleaves one space attention layer and one time attention layer.

ViT-VQGAN

[edit]

In ViT-VQGAN,[32] there are two ViT encoders and a discriminator. One encodes 8x8 patches of an image into a list of vectors, one for each patch. The vectors can only come from a discrete set of "codebook", as in vector quantization. Another encodes the quantized vectors back to image patches. The training objective attempts to make the reconstruction image (the output image) faithful to the input image. The discriminator (usually a convolutional network, but other networks are allowed) attempts to decide if an image is an original real image, or a reconstructed image by the ViT.

The idea is essentially the same as vector quantized variational autoencoder (VQVAE) plus generative adversarial network (GAN).

After such a ViT-VQGAN is trained, it can be used to code an arbitrary image into a list of symbols, and code an arbitrary list of symbols into an image. The list of symbols can be used to train into a standard autoregressive transformer (like GPT), for autoregressively generating an image. Further, one can take a list of caption-image pairs, convert the images into strings of symbols, and train a standard GPT-style transformer. Then at test time, one can just give an image caption, and have it autoregressively generate the image. This is the structure of Google Parti.[33]

Others

[edit]

Other examples include the visual transformer,[34] CoAtNet,[35] CvT,[36] the data-efficient ViT (DeiT),[37] etc.

In the Transformer in Transformer architecture, each layer applies a vision Transformer layer on each image patch embedding, add back the resulting tokens to the embedding, then applies another vision Transformer layer.[38]

Comparison with CNNs

[edit]

Typically, ViT uses patch sizes larger than standard CNN kernels (3x3 to 7x7). ViT is more sensitive to the choice of the optimizer, hyperparameters, and network depth. Preprocessing with a layer of smaller-size, overlapping (stride < size) convolutional filters helps with performance and stability.[12]

This different behavior seems to derive from the different inductive biases they possess.

CNN applies the same set of filters for processing the entire image. This allows them to be more data efficient and less sensitive to local perturbations.[2] ViT applies self-attention, allowing them to easily capture long-range relationships between patches. They also require more data to train, but they can ingest more training data compared to CNN, which might not improve after training on a large enough training dataset. ViT also appears more robust to input image distortions such as adversarial patches or permutations.[39]

Applications

[edit]

ViT have been used in many Computer Vision tasks with excellent results and in some cases even state-of-the-art. Image Classification, Object Detection, Video Deepfake Detection,[40] Image segmentation,[41] Anomaly detection, Image Synthesis, Cluster analysis, Autonomous Driving.[5][6]

ViT had been used for image generation as backbones for GAN[42] and for diffusion models (diffusion transformer, or DiT).[43]

DINO[25] has been demonstrated to learn useful representations for clustering images and exploring morphological profiles on biological datasets, such as images generated with the Cell Painting assay.[44]

In 2024, a 113 billion-parameter ViT model was proposed (the largest ViT to date) for weather and climate prediction, and trained on the Frontier supercomputer with a throughput of 1.6 exaFLOPs.[45]

See also

[edit]

References

[edit]
  1. ^ a b c d e Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2025-08-06). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
  2. ^ a b Raghu, Maithra; Unterthiner, Thomas; Kornblith, Simon; Zhang, Chiyuan; Dosovitskiy, Alexey (2025-08-06). "Do Vision Transformers See Like Convolutional Neural Networks?". arXiv:2108.08810 [cs.CV].
  3. ^ Dehghani, Mostafa; Djolonga, Josip; Mustafa, Basil; Padlewski, Piotr; Heek, Jonathan; Gilmer, Justin; Steiner, Andreas; Caron, Mathilde; Geirhos, Robert (2025-08-06), Scaling Vision Transformers to 22 Billion Parameters, arXiv:2302.05442
  4. ^ "Scaling vision transformers to 22 billion parameters". research.google. Retrieved 2025-08-06.
  5. ^ a b Han, Kai; Wang, Yunhe; Chen, Hanting; Chen, Xinghao; Guo, Jianyuan; Liu, Zhenhua; Tang, Yehui; Xiao, An; Xu, Chunjing; Xu, Yixing; Yang, Zhaohui; Zhang, Yiman; Tao, Dacheng (2025-08-06). "A Survey on Vision Transformer". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (1): 87–110. arXiv:2012.12556. doi:10.1109/TPAMI.2022.3152247. ISSN 0162-8828. PMID 35180075.
  6. ^ a b Khan, Salman; Naseer, Muzammal; Hayat, Munawar; Zamir, Syed Waqas; Khan, Fahad Shahbaz; Shah, Mubarak (2025-08-06). "Transformers in Vision: A Survey". ACM Comput. Surv. 54 (10s): 200:1–200:41. arXiv:2101.01169. doi:10.1145/3505244. ISSN 0360-0300.
  7. ^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, ?ukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  8. ^ Ramachandran, Prajit; Parmar, Niki; Vaswani, Ashish; Bello, Irwan; Levskaya, Anselm; Shlens, Jon (2019). "Stand-Alone Self-Attention in Vision Models". Advances in Neural Information Processing Systems. 32. Curran Associates, Inc. arXiv:1906.05909.
  9. ^ Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s": 11976–11986. arXiv:2201.03545. {{cite journal}}: Cite journal requires |journal= (help)
  10. ^ Woo, Sanghyun; Debnath, Shoubhik; Hu, Ronghang; Chen, Xinlei; Liu, Zhuang; Kweon, In So; Xie, Saining (2023). "ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders": 16133–16142. arXiv:2301.00808. {{cite journal}}: Cite journal requires |journal= (help)
  11. ^ Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Yan, Zhicheng; Masayoshi, Tomizuka; Gonzalez, Joseph; Keutzer, Kurt; Vajda, Peter (2020). "Visual Transformers: Token-based Image Representation and Processing for Computer Vision". arXiv:2006.03677 [cs.CV].
  12. ^ a b Xiao, Tete; Singh, Mannat; Mintun, Eric; Darrell, Trevor; Dollár, Piotr; Girshick, Ross (2025-08-06). "Early Convolutions Help Transformers See Better". arXiv:2106.14881 [cs.CV].
  13. ^ a b c Liu, Ze; Lin, Yutong; Cao, Yue; Hu, Han; Wei, Yixuan; Zhang, Zheng; Lin, Stephen; Guo, Baining (2025-08-06). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows". arXiv:2103.14030 [cs.CV].
  14. ^ Tan, Mingxing; Le, Quoc (23 June 2021). "EfficientNetV2: Smaller Models and Faster Training" (PDF). Proceedings of the 38th International Conference on Machine Learning (PMLR). 139: 10096–10106. arXiv:2104.00298. Retrieved 31 October 2023.
  15. ^ Huang, Gao; Liu, Zhuang; van der Maaten, Laurens; Q. Weinberger, Kilian (28 Jan 2018). "Densely Connected Convolutional Networks". arXiv:1608.06993 [cs.CV].
  16. ^ a b Sarkar, Arjun (2025-08-06). "Are Transformers better than CNN's at Image Recognition?". Medium. Retrieved 2025-08-06.
  17. ^ a b Zhai, Xiaohua; Kolesnikov, Alexander; Houlsby, Neil; Beyer, Lucas (June 2022). "Scaling Vision Transformers". 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 1204–1213. arXiv:2106.04560. doi:10.1109/cvpr52688.2022.01179. ISBN 978-1-6654-6946-3.
  18. ^ Lee, Juho; Lee, Yoonho; Kim, Jungtaek; Kosiorek, Adam; Choi, Seungjin; Teh, Yee Whye (2025-08-06). "Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks". Proceedings of the 36th International Conference on Machine Learning. PMLR: 3744–3753. arXiv:1810.00825.
  19. ^ Karamcheti, Siddharth; Nair, Suraj; Chen, Annie S.; Kollar, Thomas; Finn, Chelsea; Sadigh, Dorsa; Liang, Percy (2025-08-06), Language-Driven Representation Learning for Robotics, arXiv:2302.12766
  20. ^ Touvron, Hugo; Cord, Matthieu; Sablayrolles, Alexandre; Synnaeve, Gabriel; Jégou, Hervé (2021). "Going Deeper With Image Transformers": 32–42. arXiv:2103.17239. {{cite journal}}: Cite journal requires |journal= (help)
  21. ^ Zhou, Daquan; Kang, Bingyi; Jin, Xiaojie; Yang, Linjie; Lian, Xiaochen; Jiang, Zihang; Hou, Qibin; Feng, Jiashi (2025-08-06), DeepViT: Towards Deeper Vision Transformer, arXiv:2103.11886
  22. ^ He, Kaiming; Chen, Xinlei; Xie, Saining; Li, Yanghao; Dollár, Piotr; Girshick, Ross (2021). "Masked Autoencoders Are Scalable Vision Learners". arXiv:2111.06377 [cs.CV].
  23. ^ Pathak, Deepak; Krahenbuhl, Philipp; Donahue, Jeff; Darrell, Trevor; Efros, Alexei A. (June 2016). "Context Encoders: Feature Learning by Inpainting". 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 2536–2544. arXiv:1604.07379. doi:10.1109/CVPR.2016.278. ISBN 978-1-4673-8851-1.
  24. ^ Bao, Hangbo; Dong, Li; Piao, Songhao; Wei, Furu (2025-08-06). "BEiT: BERT Pre-Training of Image Transformers". International Conference on Learning Representations. arXiv:2106.08254.
  25. ^ a b Caron, Mathilde; Touvron, Hugo; Misra, Ishan; Jegou, Herve; Mairal, Julien; Bojanowski, Piotr; Joulin, Armand (October 2021). "Emerging Properties in Self-Supervised Vision Transformers". 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 9630–9640. arXiv:2104.14294. doi:10.1109/iccv48922.2021.00951. ISBN 978-1-6654-2812-5.
  26. ^ He, Kaiming; Fan, Haoqi; Wu, Yuxin; Xie, Saining; Girshick, Ross (2020). "Momentum Contrast for Unsupervised Visual Representation Learning": 9729–9738. arXiv:1911.05722. {{cite journal}}: Cite journal requires |journal= (help)
  27. ^ Grill, Jean-Bastien; Strub, Florian; Altché, Florent; Tallec, Corentin; Richemond, Pierre; Buchatskaya, Elena; Doersch, Carl; Avila Pires, Bernardo; Guo, Zhaohan; Gheshlaghi Azar, Mohammad; Piot, Bilal; kavukcuoglu, koray; Munos, Remi; Valko, Michal (2020). "Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning". Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 21271–21284.
  28. ^ Oquab, Maxime; Darcet, Timothée; Moutakanni, Théo; Vo, Huy; Szafraniec, Marc; Khalidov, Vasil; Fernandez, Pierre; Haziza, Daniel; Massa, Francisco (2025-08-06). "DINOv2: Learning Robust Visual Features without Supervision". arXiv:2304.07193 [cs.CV].
  29. ^ Liu, Ze; Hu, Han; Lin, Yutong; Yao, Zhuliang; Xie, Zhenda; Wei, Yixuan; Ning, Jia; Cao, Yue; Zhang, Zheng; Dong, Li; Wei, Furu; Guo, Baining (2022). "Swin Transformer V2: Scaling Up Capacity and Resolution". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12009–12019.
  30. ^ Bertasius, Gedas; Wang, Heng; Torresani, Lorenzo (2025-08-06). "Is Space-Time Attention All You Need for Video Understanding?". arXiv:2102.05095 [cs.CV].
  31. ^ Szegedy, Christian; Vanhoucke, Vincent; Ioffe, Sergey; Shlens, Jon; Wojna, Zbigniew (2016). "Rethinking the Inception Architecture for Computer Vision": 2818–2826. arXiv:1512.00567. {{cite journal}}: Cite journal requires |journal= (help)
  32. ^ Yu, Jiahui; Li, Xin; Koh, Jing Yu; Zhang, Han; Pang, Ruoming; Qin, James; Ku, Alexander; Xu, Yuanzhong; Baldridge, Jason; Wu, Yonghui (2021). "Vector-quantized Image Modeling with Improved VQGAN". arXiv:2110.04627 [cs.CV].
  33. ^ "Parti: Pathways Autoregressive Text-to-Image Model". sites.research.google. Retrieved 2025-08-06.
  34. ^ Wu, Bichen; Xu, Chenfeng; Dai, Xiaoliang; Wan, Alvin; Zhang, Peizhao; Yan, Zhicheng; Tomizuka, Masayoshi; Gonzalez, Joseph; Keutzer, Kurt (2025-08-06), Visual Transformers: Token-based Image Representation and Processing for Computer Vision, arXiv:2006.03677
  35. ^ Dai, Zihang; Liu, Hanxiao; Le, Quoc V.; Tan, Mingxing (2025-08-06). "CoAtNet: Marrying Convolution and Attention for All Data Sizes". arXiv:2106.04803 [cs.CV].
  36. ^ Wu, Haiping; Xiao, Bin; Codella, Noel; Liu, Mengchen; Dai, Xiyang; Yuan, Lu; Zhang, Lei (2025-08-06). "CvT: Introducing Convolutions to Vision Transformers". arXiv:2103.15808 [cs.CV].
  37. ^ Touvron, Hugo; Cord, Matthieu; Jégou, Hervé (2022). "DeiT III: Revenge of the ViT". In Avidan, Shai; Brostow, Gabriel; Cissé, Moustapha; Farinella, Giovanni Maria; Hassner, Tal (eds.). Computer Vision – ECCV 2022. Lecture Notes in Computer Science. Vol. 13684. Cham: Springer Nature Switzerland. pp. 516–533. doi:10.1007/978-3-031-20053-3_30. ISBN 978-3-031-20053-3.
  38. ^ Han, Kai; Xiao, An; Wu, Enhua; Guo, Jianyuan; XU, Chunjing; Wang, Yunhe (2021). "Transformer in Transformer". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 15908–15919.
  39. ^ Naseer, Muzammal; Ranasinghe, Kanchana; Khan, Salman; Hayat, Munawar; Khan, Fahad Shahbaz; Yang, Ming-Hsuan (2025-08-06). "Intriguing Properties of Vision Transformers". arXiv:2105.10497 [cs.CV].
  40. ^ Coccomini, Davide; Messina, Nicola; Gennaro, Claudio; Falchi, Fabrizio (2022). "Combining Efficient Net and Vision Transformers for Video Deepfake Detection". Image Analysis and Processing – ICIAP 2022. Lecture Notes in Computer Science. Vol. 13233. pp. 219–229. arXiv:2107.02612. doi:10.1007/978-3-031-06433-3_19. ISBN 978-3-031-06432-6. S2CID 235742764.
  41. ^ Kirillov, Alexander; Mintun, Eric; Ravi, Nikhila; Mao, Hanzi; Rolland, Chloe; Gustafson, Laura; Xiao, Tete; Whitehead, Spencer; Berg, Alexander C.; Lo, Wan-Yen; Dollar, Piotr; Girshick, Ross (2023). "Segment Anything": 4015–4026. {{cite journal}}: Cite journal requires |journal= (help)
  42. ^ Jiang, Yifan; Chang, Shiyu; Wang, Zhangyang (2021). "TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up". Advances in Neural Information Processing Systems. 34. Curran Associates, Inc.: 14745–14758. arXiv:2102.07074.
  43. ^ Peebles, William; Xie, Saining (March 2023). "Scalable Diffusion Models with Transformers". arXiv:2212.09748v2 [cs.CV].
  44. ^ Doron, Michael; Moutakanni, Théo; Chen, Zitong S.; Moshkov, Nikita; Caron, Mathilde; Touvron, Hugo; Bojanowski, Piotr; Pernice, Wolfgang M.; Caicedo, Juan C. (2025-08-06). "Unbiased single-cell morphology with self-supervised vision transformers". BioRxiv: The Preprint Server for Biology: 2023.06.16.545359. doi:10.1101/2023.06.16.545359. PMC 10312751. PMID 37398158. Retrieved 2025-08-06.
  45. ^ Wang, Xiao; Liu, Siyan; Tsaris, Aristeidis; Choi, Jong-Youl; Aji, Ashwin; Fan, Ming; Zhang, Wei; Yin, Junqi; Ashfaq, Moetasim; Lu, Dan; Balaprakash, Prasanna (2024). "ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability". arXiv:2404.14712 [physics.ao-ph].

Further reading

[edit]
  • Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "11.8. Transformers for Vision". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
  • Steiner, Andreas; Kolesnikov, Alexander; Zhai, Xiaohua; Wightman, Ross; Uszkoreit, Jakob; Beyer, Lucas (June 18, 2021). "How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers". arXiv:2106.10270 [cs.CV].
尾插是什么 散片是什么意思 张什么结什么 生精补精吃什么药最快 畏手畏脚是什么意思
梦见去看病是什么意思 子宫内膜囊性增生是什么意思 高密度脂蛋白胆固醇高是什么意思 药流后可以吃什么水果 肾盂肾炎吃什么药
什么鸣什么吠 三黄鸡是什么鸡 戾气太重是什么意思 激凸是什么意思 四妙丸有什么功效与作用
炒房是什么意思 面粉是什么做的 头发出汗多是什么原因 老虔婆是什么意思 季字五行属什么
mmc什么意思hcv7jop6ns5r.cn 画什么点睛hcv8jop9ns9r.cn 急性前列腺炎吃什么药hcv8jop0ns9r.cn 什么植物吸收甲醛bfb118.com 果脯是什么东西hcv7jop6ns6r.cn
喝酒上头是什么原因hcv7jop5ns3r.cn 深情款款什么意思hcv8jop8ns2r.cn 刻舟求剑什么意思hcv9jop5ns0r.cn 不疼不痒的红疹是什么hcv7jop9ns6r.cn 大限是什么意思hcv9jop6ns3r.cn
乙肝小三阳是什么意思beikeqingting.com 什么情况下做肠镜hcv9jop4ns2r.cn 燕子每年从什么方飞往什么方过冬hcv8jop6ns1r.cn 胸前骨头疼是什么原因sanhestory.com 遗传代谢病是什么意思aiwuzhiyu.com
守护神是什么意思hcv9jop3ns4r.cn 香菜什么时候种最合适hcv9jop6ns1r.cn 胎儿宫内缺氧孕妇有什么症状hcv9jop1ns4r.cn 检查血糖挂什么科cj623037.com 秀气是什么意思jinxinzhichuang.com
百度