Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking

Comparison between OneShot-GMOT (OS-GMOT) (left) and our Grounded-GMOT (right)

Introduction

In response to the limitations of traditional Multiple Object Tracking (MOT) systems, particularly their struggle with diverse and dynamic environments such as surveillance and autonomous driving, our research introduces groundbreaking advancements. Our motivation stems from the need for more adaptable, inclusive tracking technologies capable of handling the complexities of real-world scenarios. With this goal in mind, we present our contributions:

Grounded-GMOT: This novel tracking paradigm leverages natural language processing, enabling the tracking of objects with varied attributes across different settings. This approach significantly expands the applicability of MOT by using descriptive language as a flexible, powerful tool for tracking.
G²MOT dataset: We unveil a large-scale dataset that surpasses existing collections in both diversity and size. The G²MOT dataset is designed to support the nuanced needs of Grounded-GMOT, providing a rich resource for developing and testing advanced tracking systems.
KAM-SORT: Our innovative tracking methodology combines camera motion analysis with adjustments in the balance between motion and appearance information. KAM-SORT represents a significant improvement in tracking accuracy and robustness, addressing the dynamic challenges of real-world object tracking.

Through these contributions, we aim to push the boundaries of what's possible in object tracking, offering solutions that are not only technically advanced but also broadly accessible and adaptable to the evolving demands of various applications.

G2MOT Dataset

Ensuring a fair assessment of GMOT methods demands a dataset of consistent quality, free from annotator bias, and with a clearly defined problem setup. To offer comprehensive coverage of real-world scenarios across different research domains, our released dataset embodies two characteristics:

Diversity: integrating diverse object categories from various sources, encompassing a broad spectrum of classes and diverse properties such as motion, occlusion, appearance similarity, and density. Additionally, it employs high-level semantics like player, athlete, referee, etc., to describe objects in complex contexts, rather than using generic terms like person.
Fine-Grained Annotation: alongside capturing detailed visual attributes like color, texture, and attachments, it offers extensive textual descriptions with existing synonyms alongside captions.

Examples to illustrate the efficacy of IE-Strategy. Left: Output from pre-trained VLM. Right: Output from IE-Strategy.

Module 1: Diversity The dataset integrates a wide array of object categories from various sources. This diversity encompasses a broad spectrum of classes and properties, including motion, occlusion, appearance similarity, and density. Notably, it uses high-level semantics (like player, athlete, referee) to describe objects in complex contexts, moving beyond generic descriptors.

Statistical information of our proposed G²MOT dataset.

Module 2: Fine-Grained Annotation Beyond visual attributes (color, texture, attachments), the dataset provides extensive textual descriptions and synonyms. These annotations offer a granular view of each object, enhancing the dataset's utility for rigorous GMOT method assessment.

Demonstration of superset and subset from horse_4and car-3 in our proposed G²MOT dataset.

KAM-SORT:

The KAM-SORT method addresses tracking within the Generic Multiple Object Tracking (GMOT) setting. It computes the cost matrix between existing tracks 𝒯 and new detections 𝒟 using a novel approach that combines motion and visual appearance cues.

Key contributions and advantages of this method include:

Adaptive weighting of appearance and motion cues based on the homogeneity of object appearances, prioritizing motion over appearance when necessary.
Use of an uncertainty revision parameter to better manage fast motion and object deformation, thus reducing mismatches in tracking.

Overall, the KAM-SORT method exhibits a strategic balance between visual and motion cues, addressing the challenges presented by the high similarity of objects in GMOT. This advancement represents a significant step forward in tracking technology, especially in scenarios with rapid motion and deformation.

Tracking comparison on fast motion objects between our KAM-SORT with SORT, OC-SORT and DeepOCSORT on the video “insect-3”

Algorithm: Kalman++ algorithm

Data:

D, T: set of detection boxes at current frame and tracks at the previous frame.

α: parameters of uncertainty revise factor.

Model:

C: score matrix defined in Equation 5.

M: bipartite matching function.

K_p, K_u: Kalman Filter predict and update.

BC, IoU: function compute box center and IoU.

Output:

T' set of new tracks.

Algorithm Steps:

^x, P = K_p(T); // Get estimated location and error covariance.
S = C(^x, D); // Compute matching score between estimation and detection.
DT_m, D_r, T_r = M(S); // 1st-round association produce matched pairs DT_m, unmatched detections D_r, and unmatched tracks T_r.
S_IoU = IoU(D_r, T_r); // 2nd-round associate unmatched ones.
DT_r = M(S_IoU); // Rematched pairs from remaining detections and tracks.
For (i_d, i_t) in DT_r do
- // i_d: detection index, i_t: track index.
- c^min = ^x_{i_t}[:2] - α * sqrt(P[:2]);
- c^max = ^x_{i_t}[:2] + α * sqrt(P[:2]);
- c = BC(D_{i_d});
- If c > c^min && c < c^max then
- DT_m = DT_m ∪ {(i_d, i_t)};
T' = K_u(DT_m); // Update matched tracks.

G2MOT demo website

Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking

Introduction

G2MOT Dataset