Monash University
Browse

Efficient Transformers for Visual Recognition

Download (6.42 MB)
thesis
posted on 2024-10-17, 05:21 authored by Zizheng Pan
Vision Transformer (ViT) is type of deep neural networks that adopts multi-head self-attention mechanisms. Benefiting from the capability of capturing global dependancies, recent ViTs have shown strong performance in many computer vision tasks like image classification and object detection. However, ViTs are expensive to train and infer due to the quadratic complexity of self-attention, especially for high-resolution tasks that typically result in high computational costs and carbon emissions. In this thesis, we focus on designing efficient ViTs through token merging, efficient architecture design and deployment strategies, aiming to enhance training and inference efficiency and promote GreenAI.

History

Campus location

Australia

Principal supervisor

Bohan Zhuang

Additional supervisor 1

Jianfei Cai

Year of Award

2024

Department, School or Centre

Data Science & Artificial Intelligence

Course

Doctor of Philosophy

Degree Type

DOCTORATE

Faculty

Faculty of Information Technology

Usage metrics

    Faculty of Information Technology Theses

    Categories

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC