posted on 2025-05-22, 18:55authored byYuetian Weng
While Vision Transformers and Large Language Models have advanced video understanding, their computational demands have pose significant challenges. This thesis explores the temporal redundancy issue in video understanding tasks, developing efficient video understanding models that improve inference efficiency while maintaining competitive model performance for real-world applications.