
Figure 1
This figure provides a high-level overview of the software architecture, organized by module functionality. Arrows denote invocation direction. Green boxes indicate core pipeline components responsible for a broad range of tasks. Blue boxes represent subsidiary subsystems with well-defined roles within the pipeline. Yellow boxes correspond to utility classes, including data-structure components and auxiliary widgets extending Tkinter functionality. The red box denotes the underlying tracking dependency (SAM2).

Figure 2
The graphical user interface is built up from the control panel (left), the canvas (right), and the slider (below the canvas). The control panel has the following modules: (A) loading and saving sessions, (B) settings and information buttons, (C) loading input media, (D) label management, (E) feature management, (F) annotation propagation, (G) toggle for visualization, (H) frame slider.

Figure 3
Control flow for annotating a single block: the input video is processed in blocks to enable efficient memory usage. Within each block, the user provides prompts on selected frames. We apply SAM2 to extend prompts within frames to masks, and propagate the masks to the remaining frames in the block. Finally, the system allows for saving the annotation configuration, log and the annotations.
Table 1
Quantitative evaluation of SAMannot on a subset of the DAVIS 2017 train-val dataset (480p resolution). The metrics represent the mean Intersection over Union (IoU), Dice coefficient, and Pixel Accuracy for each sequence.
| SEQUENCE NAME | FRAMES (#) | INSTANCES (#) | ALL MASKS (#) | MEAN IOU | MEAN DICE | PIXEL ACC. |
|---|---|---|---|---|---|---|
| Rhino | 90 | 1 | 90 | 0.9807 | 0.9902 | 0.9968 |
| Cows | 104 | 1 | 104 | 0.9710 | 0.9853 | 0.9966 |
| Bear | 82 | 1 | 82 | 0.9703 | 0.9849 | 0.9966 |
| Camel | 90 | 1 | 90 | 0.9691 | 0.9843 | 0.9962 |
| Dog | 60 | 1 | 60 | 0.9640 | 0.9817 | 0.9960 |
| Breakdance | 84 | 1 | 84 | 0.9593 | 0.9792 | 0.9963 |
| Breakdance-flare | 71 | 1 | 71 | 0.9586 | 0.9789 | 0.9974 |
| Tuk-tuk | 59 | 3 | 177 | 0.9515 | 0.9747 | 0.9783 |
| Blackswan | 50 | 1 | 50 | 0.9506 | 0.9746 | 0.9950 |
| Cat-girl | 89 | 2 | 178 | 0.9460 | 0.9722 | 0.9838 |
| Night-race | 46 | 2 | 83 | 0.9445 | 0.9656 | 0.9973 |
| Train | 80 | 4 | 320 | 0.9290 | 0.9631 | 0.9877 |
| Bus | 80 | 1 | 80 | 0.9295 | 0.9626 | 0.9883 |
| Classic-car | 63 | 3 | 189 | 0.9265 | 0.9579 | 0.9879 |
| Color-run | 84 | 3 | 217 | 0.9252 | 0.9579 | 0.9695 |
| Boxing-fisheye | 87 | 3 | 261 | 0.9099 | 0.9522 | 0.9948 |
| Bike-packing | 69 | 2 | 138 | 0.9026 | 0.9482 | 0.9825 |
| Pigs | 79 | 3 | 237 | 0.8914 | 0.9278 | 0.9885 |
| Boat | 75 | 1 | 75 | 0.8243 | 0.9036 | 0.9882 |
| Sheep | 68 | 5 | 340 | 0.8361 | 0.8945 | 0.9951 |
| Drone | 91 | 4 | 298 | 0.8188 | 0.8649 | 0.9627 |
| Schoolgirls | 80 | 7 | 560 | 0.7473 | 0.8214 | 0.9896 |
| Average | 76 | – | 172 | 0.9185 | 0.9512 | 0.9893 |
| Std dev. | 14.37 | – | 125.27 | 0.0606 | 0.0436 | 0.0093 |
Table 2
Quantitative evaluation of SAMannot on a subset of the LVOS dataset [17]. The metrics represent the mean Intersection over Union (IoU), Dice coefficient, and Pixel Accuracy for each sequence.
| SEQUENCE NAME | FRAMES (#) | INSTANCES (#) | ALL MASKS (#) | MEAN IOU | MEAN DICE | PIXEL ACC. |
|---|---|---|---|---|---|---|
| 3bvEjhOT | 461 | 2 | 896 | 0.9585 | 0.9778 | 0.9967 |
| 7K7WVzGG | 617 | 2 | 1262 | 0.8030 | 0.8391 | 0.9989 |
| cUD1dwuP | 793 | 4 | 2723 | 0.9282 | 0.9588 | 0.9969 |
| EWCZAcdt | 1412 | 2 | 2056 | 0.8333 | 0.9010 | 0.9993 |
| HYSm91eM | 500 | 10 | 4992 | 0.7997 | 0.8452 | 0.9934 |
| Average | 757 | – | 2386 | 0.8645 | 0.9044 | 0.9970 |
| Std dev. | 388.45 | – | 1619.97 | 0.0739 | 0.0635 | 0.0023 |

Figure 4
Qualitative examples of segmentation results on images from the DAVIS 2017 dataset. The columns display the original video frame (left), the ground truth (middle), and the masks predicted by SAMannot (right).

Figure 5
Comparison of semantic segmentation boundaries. From left to right: original frame of the Blackswan sequence, official DAVIS 2017 ground truth, and SAMannot prediction. Note the discrepancy regarding the swan’s feet: while the ground truth excludes them, SAMannot correctly identifies these regions as part of the semantic instance. Such differences contribute to a lower measured Mean IoU and Mean Dice, despite the model providing a more anatomically complete segmentation.

Figure 6
Illustrative frames from the analyzed DAVIS sequences for the performance metrics.

Figure 7
Illustrative frames from the analyzed LVOS sequences, demonstrating the visual diversity of the dataset, for the performance metrics.
Table 3
Performance metrics and resource utilization during video annotation on videos from the DAVIS 2017 dataset [16]. Duration encompasses label definition, annotation, and the final data export.
| VIDEO NAME | INST. (#) | FRAMES (#) | Duration (mm:ss) | VRAMmin (MiB) | VRAMmax (MiB) | VRAM (MiB) |
|---|---|---|---|---|---|---|
| night-race | 2 | 46 | 0:47 | 1503 | 2354 | 851 |
| schoolgirls | 7 | 80 | 5:09 | 1536 | 2898 | 1362 |
| train | 4 | 80 | 4:09 | 1512 | 2711 | 1199 |
| tuk-tuk | 3 | 59 | 1:33 | 1509 | 2669 | 1160 |
| sheep | 5 | 68 | 1:35 | 1519 | 2711 | 1192 |
Table 4
Performance metrics and resource utilization during video annotation on videos from the LVOS [17] dataset. Duration encompasses label definition, annotation, and the final data export.
| VIDEO NAME | INST. (#) | FRAMES (#) | DURATION (mm:ss) | VRAMmin (MiB) | VRAMmax (MiB) | VRAM (MiB) |
|---|---|---|---|---|---|---|
| 3bvEjhOT | 2 | 461 | 7:42 | 1521 | 2713 | 1192 |
| 7K7WVzGG | 2 | 617 | 8:30 | 1523 | 2687 | 1164 |
| cUD1dwuP | 4 | 793 | 15:49 | 2112 | 2763 | 651 |
| EWCZAcdt | 2 | 1412 | 19:25 | 2150 | 2843 | 693 |
| HYSm91eM | 10 | 500 | 16:56 | 1530 | 2868 | 1338 |

Figure 8
Examples of ground-truth inconsistencies in LVOS (top) and the consistent annotation achieved with SAMannot (bottom): (a) merging the masks of two distinct players; (b) inclusion of unlabeled objects, moreover, incorrectly given the same label as another instance; (c) temporal and structural inconsistency: the ball within the player’s mask is labeled inconsistently across consecutive frames (alternating between allowing and preventing overlaps).

Figure 9
A system resources monitor, accessible via a pop-up from the main control window, provides real-time tracking of RAM usage, GPU utilization, and GPU VRAM occupancy.

Figure 10
Illustration of User guide windows(A).

Figure 11
Illustration of User guide windows(B).

Figure 12
Qualitative examples of segmentation results on images from the DAVIS 2017 dataset [16]. The columns display the original video frame (left), the ground truth (middle), and the masks predicted by SAMannot (right).
