Abstract: Remote sensing image retrieval with text feedback (RSIR-TF) presents a challenging multimodal retrieval task that leverages a reference image, modification text, and scene graph to retrieve ...
Vision foundation models (VFMs), such as the segment anything model (SAM), allow zero-shot or interactive segmentation of visual contents; thus, they are quickly applied in a variety of visual scenes.