Efficient Transformers for Visual Recognition

Pan, Zizheng

doi:10.26180/27246861.v1

Efficient Transformers for Visual Recognition

thesis

posted on 2024-10-17, 05:21 authored by Zizheng Pan

Vision Transformer (ViT) is type of deep neural networks that adopts multi-head self-attention mechanisms. Benefiting from the capability of capturing global dependancies, recent ViTs have shown strong performance in many computer vision tasks like image classification and object detection. However, ViTs are expensive to train and infer due to the quadratic complexity of self-attention, especially for high-resolution tasks that typically result in high computational costs and carbon emissions. In this thesis, we focus on designing efficient ViTs through token merging, efficient architecture design and deployment strategies, aiming to enhance training and inference efficiency and promote GreenAI.

History

Campus location

Australia

Principal supervisor

Bohan Zhuang

Additional supervisor 1

Jianfei Cai

Year of Award

2024

Department, School or Centre

Data Science & Artificial Intelligence

Course

Doctor of Philosophy

Degree Type

DOCTORATE

Faculty

Faculty of Information Technology

Efficient Transformers for Visual Recognition

History

Campus location

Principal supervisor

Additional supervisor 1

Year of Award

Department, School or Centre

Course

Degree Type

Faculty

Usage metrics

Categories

Keywords

Licence

Exports