BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation

Abstract

Food image segmentation is a critical task for dietary analysis, enabling accurate estimation of food volume and nutrients. However, current methods suffer from limited multi-view data and poor generalization to new viewpoints. We introduce BenchSeg, a novel multi-view food video segmentation dataset and benchmark. BenchSeg aggregates 55 dish scenes (from Nutrition5k, Vegetables & Fruits, MetaFood3D, and FoodKit) with 25,284 meticulously annotated frames, capturing each dish under free 360° camera motion. We evaluate a diverse set of 20 state-of-the-art segmentation models (e.g., SAM-based, transformer, CNN, and large multimodal) on the existing FoodSeg103 dataset and evaluate them (alone and combined with video-memory modules) on BenchSeg. Quantitative and qualitative results demonstrate that while standard image segmenters degrade sharply under novel viewpoints, memory-augmented methods maintain temporal consistency across frames. Our best model combinations outperform prior work (e.g., improving over BenchSeg by ~2.5% mAP), offering new insights into food segmentation and tracking for dietary analysis.

Proposed Methodology

Overview of the proposed three-stage methodology: (1) keyframe segmentation generates initial masks, (2) temporal propagation transfers them across non-key frames using stored features, and (3) late fusion refines masks by combining propagated and fresh predictions, enabling a reproducible and temporally coherent food-segmentation process. Camera poses shown in green indicate cases where the matching accuracy threshold $mAP \geq 95\%$ is satisfied; poses in red denote those falling below this threshold.
Overview of the proposed three-stage methodology: (1) keyframe segmentation generates initial masks, (2) temporal propagation transfers them across non-key frames using stored features, and (3) late fusion refines masks by combining propagated and fresh predictions, enabling a reproducible and temporally coherent food-segmentation process. Camera poses shown in green indicate cases where the matching accuracy threshold mAP ≥ 95% is satisfied; poses in red denote those falling below this threshold.

Experimental Results

BenchSeg The figure shows the comparison of camera locations for different methods on a Foodkit object, with cameras colored by mAP: red for ≤50%, orange for 50–75%, yellow for 75–95%, and green for ≥95%, highlighting differences in coverage and reliability across methods.
The figure shows the comparison of camera locations for different methods on a Foodkit object, with cameras colored by mAP: red for ≤50%, orange for 50–75%, yellow for 75–95%, and green for ≥95%, highlighting differences in coverage and reliability across methods.

Segmentor ❤️ Memory Tracking

Comparison of camera locations for the base segmentor, the segmentor with XMem2, and the segmentor with SAM2, with cameras colored by mAP: red for ≤50\, orange for 50–75%, yellow for 75–95%, and green for ≥95%. The visualization highlights how each integration affects segmentation robustness across different viewpoints.
The figure shows the comparison of camera locations for the base segmentor, the segmentor with XMem2, and the segmentor with SAM2, with cameras colored by mAP: red for ≤50\, orange for 50–75%, yellow for 75–95%, and green for ≥95%. The visualization highlights how each integration affects segmentation robustness across different viewpoints.

Citation

If you want to cite our work, please use this:

                  @article{almughrabi2026BenchSeg,
                      title={BenchSeg: BenchSeg: A Large-Scale Dataset and Benchmark for Multi-View Food Video Segmentation},
                      author={Guillermo Rivo, Carlos Jiménez-Farfán, Umair Haroon, Farid Al-Areqi, Hyunjun Jung, Benjamin Busam, Ricardo Marques, Petia Radeva},
                      journal={arXiv preprint 2601.07581},
                      year={2026}
                  }