Abstract: Compared with remote sensing image (RSI) captioning methods based on the traditional encoder–decoder model, two-stage RSI captioning methods include an auxiliary remote sensing task to ...
The main novelty seems to be an extra layer of indirection with the prior network (whether it is an autoregressive transformer or a diffusion network), which predicts an image embedding based on the ...