Authors - Sudha S K, Aji S Abstract - Rapid advancements in video surveillance and analysis require advanced frameworks capable of detecting, segmenting, and tracking objects in complex, dynamic scenes. This paper introduces DySAMRefine, a novel dynamic scene adaptive mask refinement strategy for robust video object segmentation and tracking (VOST) in dynamic environments. DySAMRefine is built upon a Mask R-CNN pipeline for instance-level segmentation and incorporates a long short-term memory (LSTM) network to capture temporal dependencies, ensuring smooth and consistent object tracking across frames. A spatio-temporal attention block (STAB) is introduced to maintain temporal coherence, supported by a temporal consistency loss (TCL) that penalizes abrupt changes in masks between consecutive frames, promoting temporal smoothness. DySAMRefine dynamically adjusts mask refinement based on the complexity of the scene and optimizes performance in static and highly dynamic environments through a deformable convolutional network (DCN). The training process employs an efficient mixed precision scheme to minimize computational overhead, enabling real-time performance without sacrificing tracking precision. Extensive experiments and ablation analysis demonstrate that DySAMRefine enhances the accuracy and robustness of VOST, achieving superior J&F scores on benchmark datasets.