Visual object detection and tracking is a complex computer vision problem
that often requires multiple components and can be decomposed in various
ways. Traditional deep learning methods divide the high-level problem into
separate components optimized for surrogate tasks and combine them heuristically during inference. In this thesis, we explore a different methodology
that decomposes the problem in a way that as many aspects of the problem
can be addressed by an integrated machine learning process. As a result,
the model is learnt to directly optimize the high-level problem end-to-end.
This enables knowledge sharing between sub-task
modules and streamlines the model.