Vision Transformer (ViT) is type of deep neural networks that adopts multi-head self-attention mechanisms. Benefiting from the capability of capturing global dependancies, recent ViTs have shown strong performance in many computer vision tasks like image classification and object detection. However, ViTs are expensive to train and infer due to the quadratic complexity of self-attention, especially for high-resolution tasks that typically result in high computational costs and carbon emissions. In this thesis, we focus on designing efficient ViTs through token merging, efficient architecture design and deployment strategies, aiming to enhance training and inference efficiency and promote GreenAI.