Enhanced Kalman with Adaptive Appearance Motion SORT for Grounded Generic Multiple Object Tracking

Dataset

Comparison of **existing datasets** of SOT, MOT, GSOT, GMOT. "#" represents the quantity of the respective items. **Cat.**, **Vid.** denote Categories and Videos. **Obj.**: average number of objects per frame. **App.**: appearance similarity (%) between objects in a frame, calculated by the average cosine similarity of objects in the same frame; **Den.** density of objects in a frame, computed by the maximum number of objects at the same pixel. **Occ.**: occlusion between objects in a frame, represented by the average ratio of IoU (%) of the bounding boxes in the same frame; **Mot.**: motion speed of objects in a video, calculated by the average ratio of the IoU (%) of the bounding boxes in the same track in consecutive frames.
Datasets	Task	NLP	#Cat.	#Vid.	#Frames	#Tracks	#Boxs	Obj.	App.	Den.	Occ.	Mot.
OTB2013	SOT	✗	10	51	29K	51	29K	--	--	--	--	--
VOT2017	SOT	✗	24	60	21K	60	21K	--	--	--	--	--
TrackingNet	SOT	✗	21	31K	14M	31K	14M	--	--	--	--	--
MOT17	MOT	✗	1	14	11.2K	1.3K	0.3M	39(35)	62(10)	3.85(1.50)	14(16)	94(11)
MOT20	MOT	✗	1	8	13.41K	3.45K	1.65M	150(70)	68(8)	6.42(1.20)	15(15)	96(4)
Omni-MOT	MOT	✗	1	--	14M+	250K	110M	--	--	--	--	--
DanceTrack	MOT	✗	1	100	105K	990	--	9(5)	77(7)	2.67(0.99)	21(17)	90(9)
TAO	MOT	✗	833	2.9K	2.6M	17.2K	333K	3(2)	69(7)	1.82(0.76)	11(14)	49(34)
SportsMOT	MOT	✗	1	240	150K	3.4K	1.62M	11(3)	73(8)	2.44(0.80)	18(17)	80(16)
GOT-10k	GSOT	✗	563	10K	1.5M	10K	1.5M	--	--	--	--	--
Fish	GSOT	✗	1	1.6K	527.2K	8.25K	516K	--	--	--	--	--
AnimalTrack	GMOT	✗	10	58	24.7K	1.92K	429K	17(9)	72(8)	3.13(1.22)	15(15)	91(11)
GMOT-40	GMOT	✗	10	40	9K	2.02K	256K	24(17)	71(9)	2.56(0.88)	11(12)	43(44)
LaSOT	SOT	coarse	70	1.4K	3.52M	1.4K	3.52M	--	--	--	--	--
TNL2K	SOT	coarse	--	2K	1.24M	2K	1.24M	--	--	--	--	--
Refer-KITTI	MOT	coarse	2	18	6.65K	637	28.72K	5(4)	65(6)	1.78(0.74)	11(11)	73(21)
G²MOT (Ours)	GMOT	fine	20	253	157.2K	5.84K	1.87M	12(5)	74(8)	2.65(0.95)	18(16)	84(14)

Combining datasets in object tracking offers strategic advantages. First, individual tracking datasets focus on specific challenges. Second, merging tracking datasets yields diverse challenges requiring tracking models to efficiently in varied scenarios. Therefore, by combining datasets, we can evaluate the tracking models' ability to deal with diverse scenarios e.g. object movements, density, similar appearance, and occlusion which are in line with the goal of the GMOT challenge. Finally, our ultimate objective is to propose a new paradigm for GMOT and create a challenging benchmark dataset under various demanding real-world scenarios.

Each video in these datasets has been carefully annotated with several details:

For text label:

class_name: Represents the common name of the object class.
type (superset|subset): Indicates whether the object belongs to a "superset" category, grouping "coarse category" (e.g., horse), or a "subset" category, allowing for finer categorization (e.g., horse on ground).
synonyms: Offers alternative terms or phrases for the class name.
definition: Describe the object's visual characteristics.
attributes: encompasses a list of attributes used to distinguish objects within the "superset".
caption: Manually crafted comprehensive description providing detailed information about the tracked objects.
track_path: The exact tracking path is stored separately, following the standard format for multiple object tracking challenges.

For track label:

each line will contain 9 elements, seperated by commas

frame: index of frame in video sequence
id: id of object accord to tracker
bb_left: x coordinate for top left
bb_top: y coordinate for top left
bb_width: width of the box that contains object
bb_height: height of the box that contains object
conf: confidence score but get 1 as default
x: get 1 as default
y: get 1 as default

The annotations are formatted in JSON, and we provide examples to illustrate how they are structured. This data, prepared by 4 annotators, will be shared publicly.

Text label for referring with specific attributes
{
    id: "",
    video_id: "",
    is_eval: "",
    type: "",
    superset_idx: "",
    class_name: "",
    synonyms:[],
    definition: "",
    attributes: []
    track_path: "",
    caption: "",
}

Track label for associating objects' IDs through time
1, 1, xl, yt, w, h, 1, 1, 1
1, 2, xl, yt, w, h, 1, 1, 1
1, 3, xl, yt, w, h, 1, 1, 1
2, 1, xl, yt, w, h, 1, 1, 1
2, 2, xl, yt, w, h, 1, 1, 1
2, 3, xl, yt, w, h, 1, 1, 1
3, 1, xl, yt, w, h, 1, 1, 1
3, 2, xl, yt, w, h, 1, 1, 1
3, 3, xl, yt, w, h, 1, 1, 1

                video: "airplane-1",
                label:{
                        class_name: "helicopter",
                        class_synonyms:["airplane", "aircraft", "jet", "plane"],
                        definition: "a vehicle designed for flight in the air",
                        include_attributes: ["black", "flying"],
                        exclude_attributes: [],
                        caption: "Track all black flying helicopters",
                        track_path: "airplane_01.txt"
                }

              video: "car-1"
              label:{
                      class_name: "car",
                      class_synonyms: ["vehicle", "automobile", "auto", "transport", "transportation"],
                      definition: "mechanical device designed for transportation, powered by an engine or motor, equipped by four wheels",
                      include_attributes:  ["white headlight", "oncoming traffic"],
                      exclude_attributes:  ["red taillight", "opposite traffic"],
                      caption:  "Track white headlight cars while excluding red taillight cars",
                      track_path: "car_01.txt",
              }

G2MOT demo website

Dataset