SAMannot: A Memory-Efficient, Local, Open-Source Framework for Interactive Video Instance Segmentation Based on SAM2

Gergely Dinya; András Gelencsér; Krisztina Kupán; Clemens Küpper; Kristóf Karacs; Anna Gelencsér-Horváth

doi:10.5334/jors.680

SAMannot: A Memory-Efficient, Local, Open-Source Framework for Interactive Video Instance Segmentation Based on SAM2

Journal of Open Research Software

Volume 14 (2026): Issue 1

By: Gergely Dinya , András Gelencsér , Krisztina Kupán , Clemens Küpper , Kristóf Karacs and Anna Gelencsér-Horváth

Open Access

|Apr 2026

Figures & Tables

This figure provides a high-level overview of the software architecture, organized by module functionality. Arrows denote invocation direction. Green boxes indicate core pipeline components responsible for a broad range of tasks. Blue boxes represent subsidiary subsystems with well-defined roles within the pipeline. Yellow boxes correspond to utility classes, including data-structure components and auxiliary widgets extending Tkinter functionality. The red box denotes the underlying tracking dependency (SAM2).

The graphical user interface is built up from the control panel (left), the canvas (right), and the slider (below the canvas). The control panel has the following modules: **(A)** loading and saving sessions, **(B)** settings and information buttons, **(C)** loading input media, **(D)** label management, **(E)** feature management, **(F)** annotation propagation, **(G)** toggle for visualization, **(H)** frame slider.

Control flow for annotating a single block: the input video is processed in blocks to enable efficient memory usage. Within each block, the user provides prompts on selected frames. We apply SAM2 to extend prompts within frames to masks, and propagate the masks to the remaining frames in the block. Finally, the system allows for saving the annotation configuration, log and the annotations.

Table 1

Quantitative evaluation of SAMannot on a subset of the DAVIS 2017 train-val dataset (480p resolution). The metrics represent the mean Intersection over Union (IoU), Dice coefficient, and Pixel Accuracy for each sequence.

SEQUENCE NAME	FRAMES (#)	INSTANCES (#)	ALL MASKS (#)	MEAN IOU	MEAN DICE	PIXEL ACC.
Rhino	90	1	90	0.9807	0.9902	0.9968
Cows	104	1	104	0.9710	0.9853	0.9966
Bear	82	1	82	0.9703	0.9849	0.9966
Camel	90	1	90	0.9691	0.9843	0.9962
Dog	60	1	60	0.9640	0.9817	0.9960
Breakdance	84	1	84	0.9593	0.9792	0.9963
Breakdance-flare	71	1	71	0.9586	0.9789	0.9974
Tuk-tuk	59	3	177	0.9515	0.9747	0.9783
Blackswan	50	1	50	0.9506	0.9746	0.9950
Cat-girl	89	2	178	0.9460	0.9722	0.9838
Night-race	46	2	83	0.9445	0.9656	0.9973
Train	80	4	320	0.9290	0.9631	0.9877
Bus	80	1	80	0.9295	0.9626	0.9883
Classic-car	63	3	189	0.9265	0.9579	0.9879
Color-run	84	3	217	0.9252	0.9579	0.9695
Boxing-fisheye	87	3	261	0.9099	0.9522	0.9948
Bike-packing	69	2	138	0.9026	0.9482	0.9825
Pigs	79	3	237	0.8914	0.9278	0.9885
Boat	75	1	75	0.8243	0.9036	0.9882
Sheep	68	5	340	0.8361	0.8945	0.9951
Drone	91	4	298	0.8188	0.8649	0.9627
Schoolgirls	80	7	560	0.7473	0.8214	0.9896
Average	76	–	172	0.9185	0.9512	0.9893
Std dev.	14.37	–	125.27	0.0606	0.0436	0.0093

Table 2

Quantitative evaluation of SAMannot on a subset of the LVOS dataset [17]. The metrics represent the mean Intersection over Union (IoU), Dice coefficient, and Pixel Accuracy for each sequence.

SEQUENCE NAME	FRAMES (#)	INSTANCES (#)	ALL MASKS (#)	MEAN IOU	MEAN DICE	PIXEL ACC.
3bvEjhOT	461	2	896	0.9585	0.9778	0.9967
7K7WVzGG	617	2	1262	0.8030	0.8391	0.9989
cUD1dwuP	793	4	2723	0.9282	0.9588	0.9969
EWCZAcdt	1412	2	2056	0.8333	0.9010	0.9993
HYSm91eM	500	10	4992	0.7997	0.8452	0.9934
Average	757	–	2386	0.8645	0.9044	0.9970
Std dev.	388.45	–	1619.97	0.0739	0.0635	0.0023

Qualitative examples of segmentation results on images from the DAVIS 2017 dataset. The columns display the original video frame (left), the ground truth (middle), and the masks predicted by SAMannot (right).

Comparison of semantic segmentation boundaries. From left to right: original frame of the *Blackswan* sequence, official DAVIS 2017 ground truth, and SAMannot prediction. Note the discrepancy regarding the swan’s feet: while the ground truth excludes them, SAMannot correctly identifies these regions as part of the semantic instance. Such differences contribute to a lower measured Mean IoU and Mean Dice, despite the model providing a more anatomically complete segmentation.

Illustrative frames from the analyzed DAVIS sequences for the performance metrics.

Illustrative frames from the analyzed LVOS sequences, demonstrating the visual diversity of the dataset, for the performance metrics.

Table 3

Performance metrics and resource utilization during video annotation on videos from the DAVIS 2017 dataset [16]. Duration encompasses label definition, annotation, and the final data export.

VIDEO NAME	INST. (#)	FRAMES (#)	Duration (mm:ss)	VRAM_min (MiB)	VRAM_max (MiB)	$\symbf Δ$ VRAM (MiB)
night-race	2	46	0:47	1503	2354	851
schoolgirls	7	80	5:09	1536	2898	1362
train	4	80	4:09	1512	2711	1199
tuk-tuk	3	59	1:33	1509	2669	1160
sheep	5	68	1:35	1519	2711	1192

Table 4

Performance metrics and resource utilization during video annotation on videos from the LVOS [17] dataset. Duration encompasses label definition, annotation, and the final data export.

VIDEO NAME	INST. (#)	FRAMES (#)	DURATION (mm:ss)	VRAM_min (MiB)	VRAM_max (MiB)	$\symbf Δ$ VRAM (MiB)
3bvEjhOT	2	461	7:42	1521	2713	1192
7K7WVzGG	2	617	8:30	1523	2687	1164
cUD1dwuP	4	793	15:49	2112	2763	651
EWCZAcdt	2	1412	19:25	2150	2843	693
HYSm91eM	10	500	16:56	1530	2868	1338

Examples of ground-truth inconsistencies in LVOS (top) and the consistent annotation achieved with SAMannot (bottom): **(a)** merging the masks of two distinct players; **(b)** inclusion of unlabeled objects, moreover, incorrectly given the same label as another instance; **(c)** temporal and structural inconsistency: the ball within the player’s mask is labeled inconsistently across consecutive frames (alternating between allowing and preventing overlaps).

A system resources monitor, accessible via a pop-up from the main control window, provides real-time tracking of RAM usage, GPU utilization, and GPU VRAM occupancy.

Qualitative examples of segmentation results on images from the DAVIS 2017 dataset [16]. The columns display the original video frame (left), the ground truth (middle), and the masks predicted by SAMannot (right).

References

Authors

Metrics

Articles in this issue

DOI: https://doi.org/10.5334/jors.680 | Journal eISSN: 2049-9647

Journal RSS Feed

Language: English

Submitted on: Jan 16, 2026

Accepted on: Mar 26, 2026

Published on: Apr 20, 2026

Published by: Ubiquity Press

In partnership with: Paradigm Publishing Services

Publication frequency: 1 issue per year

Keywords:

instance segmentation,

video annotation,

instance tracking,

instance labeling

© 2026 Gergely Dinya, András Gelencsér, Krisztina Kupán, Clemens Küpper, Kristóf Karacs, Anna Gelencsér-Horváth, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.

Volume 14 (2026): Issue 1

SAMannot: A Memory-Efficient, Local, Open-Source Framework for Interactive Video Instance Segmentation Based on SAM2

Figures & Tables

Figure 1

Figure 2

Figure 3

Table 1

Table 2

Figure 4

Figure 5

Figure 6

Figure 7

Table 3

Table 4

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Paradigm

My account