Topic · Updated 2026-07-14

Visual Representations and Backbones

比较 CNN、U-Net、ViT、层级 Transformer、GCN 与状态空间模型怎样编码局部、全局、多尺度和时序信息，以及怎样按任务与预算选主干。

#near-cvpr-2025 #vision-foundations #representation-learning

Foundation

Visual Representations and Backbones

比较 CNN、U-Net、ViT、层级 Transformer、GCN 与状态空间模型怎样编码局部、全局、多尺度和时序信息，以及怎样按任务与预算选主干。

25 Source notes2 Open questions

Learning path

DefinitionClarify what it is
MechanismFollow the information flow
EvolutionLocate landmark papers
ResearchContinue into active topics

Concept map

From intuition to research connections

25 Source notes

1DefinitionClarify what it is

2MechanismFollow the information flow

3EvolutionLocate landmark papers

4ResearchContinue into active topics

一句话定义

视觉主干网络（vision backbone）把图像、视频、关键点或潜变量编码为可复用表示；分类头、检测头、分割解码器、跟踪关联、生成过程或多模态模块再使用这些表示完成任务。

主干不是整套模型，也没有脱离任务与硬件的“绝对最好”。局部纹理、全局关系、多尺度细节、长时序、显存和延迟会推动不同架构取舍。

一张图看懂原理

为什么同一个主干可以被多个任务复用？

主干位于输入与任务头之间

主干负责把原始输入变成层级或 token 表示；任务头只读取需要的层和尺度。生成模型中的去噪器也可视为任务专用主干，但还接收时间和条件。

输入与切分像素、patch、帧、关键点或 latent token
主干编码通过卷积、注意力、图传播或状态更新形成表示
多尺度/时序特征兼顾语义抽象、空间细节与时间关系
任务适配分类、框、mask、轨迹、条件去噪或跨模态对齐

Diagram evidenceResNet、ViT、U-Net

前置概念与术语

感受野（receptive field）：某个特征能受多大输入区域影响。
下采样（downsampling）：降低空间尺寸以扩大语义范围和节省计算，也可能损失小目标与边界。
多尺度特征（multi-scale features）：同时保留不同分辨率的表示，常用于检测、分割和生成。
patch / token：把图像区域映射为向量后按序列处理。
自注意力（self-attention）：按内容动态汇聚 token；全局注意力计算通常随 token 数平方增长。
跳跃连接（skip connection）：跨层直接传递特征；残差相加与 U-Net 拼接是两种不同用法。
归一化（normalization）：控制激活尺度；BatchNorm、LayerNorm 等适用协议不同。
预训练表示：先从大数据学习的特征，其迁移能力取决于数据、目标和域匹配。
FLOPs、显存与延迟：理论运算、峰值内存和真实时间不是同一个效率指标。

历史演化

视觉主干为什么没有收敛为单一架构？

从卷积层级到多种表示家族并存

每次演化都在重新平衡可训练深度、空间细节、全局关系和计算。后来的架构通常继承前一代机制，而非完全清零。

2012 · 深卷积表示AlexNet 用大数据与 GPU 证明端到端层级特征
2015–2016 · 残差与多尺度ResNet 解决深度优化，U-Net 用对称跳连恢复空间细节
2020–2021 · 视觉 tokenViT 建立全局 token 表示，Swin 引入层级和局部窗口
2022 · 架构再平衡ConvNeXt 证明现代卷积仍能与 Transformer 竞争
当前 · 任务化主干GCN、DiT、视频 Transformer 与 SSM 按结构、生成和长序列需求分化

Diagram evidenceAlexNet、ResNet、U-Net、Swin、ConvNeXt

核心机制

CNN / ConvNeXt：共享局部卷积核，天然编码局部性和平移结构；层级下采样适合多尺度任务。
残差网络：让 block 学相对恒等映射的增量，改善深层优化；残差机制也被 Transformer 和生成模型继承。
U-Net：编码器抽象语义，解码器恢复分辨率，同尺度跳连保留细节；适合分割和条件去噪。
ViT / Swin：把图像转为 token；ViT 强调全局注意力与数据规模，Swin 用窗口和层级结构适配高分辨率密集任务。
GCN / Skeleton Transformer：按人体关节或其他非规则结构传播信息；输入拓扑与关键点质量是能力前提。
SSM / Mamba 类主干：用状态更新或扫描降低长序列成本；效率需要真实硬件验证。

主要方法分支

面对一个视觉任务，先问什么才能缩小主干选择？

按输入结构和瓶颈选择主干

先判断输出是否要求精细空间、长时间或结构化拓扑，再看数据与部署预算。模型流行度应排在这些约束之后。

实时局部感知优先 CNN/ConvNeXt，并测高分辨率吞吐与小目标召回
精细空间恢复使用 U-Net、特征金字塔或层级主干，保留浅层细节
全局关系与大规模预训练选择 ViT/Swin，检查 token 数、数据规模和注意力成本
关键点或图结构选择 GCN/Skeleton Transformer，先验证姿态前端质量
长视频与轨迹比较视频 Transformer、TCN、SSM/Mamba，报告有效时间覆盖
扩散去噪比较 U-Net、DiT 与 SSM，同时核算 NFE、单步成本与条件接口

Diagram evidenceConvNeXt、Swin、ST-GCN、DiffuSSM

奠基论文导读

AlexNet：端到端深卷积表示的历史转折。
ResNet：残差连接与深层优化，重点区分机制与具体家族。
U-Net：编码器—解码器与同尺度细节恢复。
ViT：patch token 与大数据条件下的纯 Transformer。
Swin Transformer：窗口注意力、跨窗口连接与层级表示。
ConvNeXt：现代化卷积对“注意力唯一性”的反证。
ST-GCN：非规则骨架图上的时空表示。
DiffuSSM：状态空间模型作为高分辨率扩散主干对照。

常见误区与局限

误区：backbone 越新越好。 数据量、任务头、分辨率和硬件不同会翻转排序。
误区：FLOPs 低就更快。 内存访问、算子融合、batch、输入形状和实现同样决定延迟。
误区：全局注意力必然理解全局关系。 能建立连接不等于训练目标要求正确属性绑定或时序因果。
误区：U-Net 就是扩散。 U-Net 是网络形状，扩散是概率过程与训练/采样机制。
误区：关键点模型天然可解释。 姿态估计错误、遮挡和视角偏差会进入 GCN，关节热图仍需证据校准。
局限：主干比较常不公平。 数据增强、预训练、分辨率、训练步数与下游 head 必须同时记录。

与研究专题的关系

topics/image-generation 和 topics/diffusion-efficiency-engineering 比较 U-Net、DiT 与 SSM 在生成质量和成本中的角色。
topics/video-understanding 关注主干能否从短 clip 扩展到长时事件和推理。
topics/sports-ai-video-understanding 把检测、轨迹、骨架和比赛状态接成系统。
topics/image-text-retrieval 比较双编码视觉主干的召回质量、局部对齐与索引成本。

证据基础

CNN 与残差：sources/2026-07-14-alexnet、sources/2026-07-14-resnet、sources/2026-07-14-convnext。
编码器—解码器：sources/2026-07-14-u-net、sources/2026-04-14-latent-diffusion-models。
视觉 Transformer：sources/2026-06-29-vision-transformer、sources/2026-07-14-swin-transformer、sources/2026-04-25-timesformer。
结构与长序列：sources/2026-04-25-st-gcn、sources/2026-04-23-sportmamba、sources/2026-04-15-diffusion-models-without-attention。

Referenced by23

CLIP 系列模型（CLIP Family）Entity
IndexEntry
TrackMAE：用轨迹遮挡与预测学习运动敏感的视频表征Source note
BlockGCN：保留骨架拓扑并轻量建模多种关节关系Source note
ProtoGCN：用运动原型重构放大相似骨架动作的局部差异Source note
SkateFormer：用四类骨架—时间分区实现高效联合注意力Source note
CLIP：用自然语言监督学习可迁移视觉表示Source note
DETR：把目标检测改写成 Transformer set predictionSource note
Segment Anything：把分割做成可提示的视觉基础模型Source note
Vision Transformer：把图像切成 token 的通用视觉主干Source note
ImageNet Classification with Deep Convolutional Neural Networks：AlexNet 与深度视觉转折Source note
A ConvNet for the 2020s：ConvNeXt 与现代卷积主干Source note
Deep Residual Learning for Image Recognition：ResNet 与残差学习Source note
Swin Transformer：移位窗口与层级视觉 TransformerSource note
Computer Vision OverviewTopic
Diffusion ModelsTopic
Image-Text RetrievalTopic
Sports AI Research RoadmapTopic
Sports AI Video UnderstandingTopic
Video Representation and Temporal ModelingTopic
Video UnderstandingTopic
Vision-Language and Multimodal Representation FoundationsTopic
Foundations of Core Vision TasksTopic

Metadata

{
  "id": "topic-vision-backbones",
  "type": "topic",
  "topic_kind": "concept",
  "title": "图像表示与视觉主干网络",
  "title_en": "Visual Representations and Backbones",
  "nav_title": "视觉主干",
  "nav_title_en": "Vision Backbones",
  "status": "active",
  "created": "2026-05-23",
  "updated": "2026-07-14",
  "tags": [
    "near-cvpr-2025",
    "vision-foundations",
    "representation-learning"
  ],
  "summary": "比较 CNN、U-Net、ViT、层级 Transformer、GCN 与状态空间模型怎样编码局部、全局、多尺度和时序信息，以及怎样按任务与预算选主干。",
  "source_notes": [
    "sources/2026-07-14-alexnet",
    "sources/2026-07-14-resnet",
    "sources/2026-06-29-vision-transformer",
    "sources/2026-07-14-swin-transformer",
    "sources/2026-07-14-convnext",
    "sources/2026-07-14-u-net",
    "sources/2026-04-14-latent-diffusion-models",
    "sources/2026-04-15-all-are-worth-words",
    "sources/2026-04-15-scalable-diffusion-models-with-transformers",
    "sources/2026-04-15-diffusion-models-without-attention",
    "sources/2026-04-23-sportmamba",
    "sources/2026-04-25-timesformer",
    "sources/2026-04-25-st-gcn",
    "sources/2026-04-25-videomae",
    "sources/2026-04-25-openpose",
    "sources/2026-04-25-mmpose",
    "sources/2026-05-05-tracknet-high-speed-tiny-objects",
    "sources/2026-05-12-trackmae",
    "sources/2026-05-16-tempose-badminton-fine-grained-motion",
    "sources/2026-05-16-blockgcn-topology-aware-skeleton-action-recognition",
    "sources/2026-05-16-skateformer-skeletal-temporal-transformer",
    "sources/2026-05-16-protogcn-skeleton-action-recognition",
    "sources/2026-06-29-clip",
    "sources/2026-06-29-detr",
    "sources/2026-06-29-segment-anything"
  ],
  "foundational_sources": [
    "sources/2026-07-14-alexnet",
    "sources/2026-07-14-resnet",
    "sources/2026-06-29-vision-transformer",
    "sources/2026-07-14-swin-transformer",
    "sources/2026-07-14-convnext",
    "sources/2026-07-14-u-net",
    "sources/2026-04-25-st-gcn",
    "sources/2026-04-15-diffusion-models-without-attention"
  ],
  "visuals": [
    "backbone-role",
    "backbone-history",
    "backbone-selection"
  ],
  "related_topics": [
    "topics/computer-vision-overview",
    "topics/vision-task-foundations",
    "topics/diffusion-models",
    "topics/video-representation-and-temporal-modeling",
    "topics/sports-ai-video-understanding"
  ],
  "related_entities": [
    "entities/diffusion-transformer"
  ],
  "open_questions": [
    "questions/question-data-vs-architecture-in-image-editing",
    "questions/question-badminton-stroke-correction-demo"
  ]
}

Visual Representations and Backbones

Visual Representations and Backbones

From intuition to research connections

一句话定义

一张图看懂原理

前置概念与术语

历史演化

核心机制

主要方法分支

奠基论文导读

常见误区与局限

与研究专题的关系

推荐阅读顺序

证据基础

相关页面

Referenced by23

Visual Representations and Backbones

From intuition to research connections

一句话定义

一张图看懂原理

前置概念与术语

历史演化

核心机制

主要方法分支

奠基论文导读

常见误区与局限

与研究专题的关系

推荐阅读顺序

证据基础

相关页面

Related pages

Related topics5

Related entities1

Open questions2

Source notes25

Referenced by23