Abstract
Remote sensing cross-modal image-text retrieval (RSCIR) can flexibly and subjectively retrieve remote sensing images utilizing query text, which has received more researchers’ attention recently. However, with the increasing volume of visual-language pre-training model parameters, direct transfer learning consumes a substantial amount of computational and storage resources. Moreover, recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features, ignoring the spatial features which are vital for modeling key entity relationships. To address these issues, we design an efficient transfer learning framework for RSCIR, which is based on spatial feature efficient reconstruction (SPER). A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships. The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension. We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets. Compared with traditional methods, our approach achieves an improvement of 3%—11% in sumR metric. Compared with methods finetuning all parameters, our proposed method only trains less than 1% of the parameters, while maintaining an overall performance of about 96%. The relevant code and files are released at https://github.com/AICyberTeam/SPER.
In recent years, the exponential growth of remote sensing (RS) data and progressive processing techniques have greatly expanded human perceptual capabilities and prospected for many applications, such as ecological monitoring, land planning, and disaster predictio
The current mainstream in RSCIR is the end-to-end retrieval method based on embedding vector
According to the different ways of interacting between multimodal features, there are two main representative categories: Dual-stream and single-stream. Dual-stream methods refer to independently encoding multimodal features without interaction. Representative methods include VSE+
Meanwhile, the emergence of large-scale visual-language pre-training (VLP) models has provided new insights for RSCI
Although RSCIR has had some achievements, it still confronts some challenges. Firstly, as for the RS domain, training a VLP model from scratch requires considerable computational resources and annotated dat
Moreover, recently proposed parameter-efficient transfer learning methods mainly reconstruct features in the channel dimension by up-sampling and down-samplin
To address these issues, an efficient transfer learning framework for RSCIR is proposed, which is based on spatial feature efficient reconstruction (SPER). First, to enhance spatial relationship extraction and reduce computational consumption, we introduce a concise and efficient spatial adapter that reconstructs image-text features in the spatial dimension and integrates prior information from the channel dimension. By partitioning the cross-modal features in the channel dimension, we can obtain features that contain both spatial and a priori channel information. Differing from traditional methods, SPER reduces the volume of additional parameters introduced by the down-sampling and up-sampling processes. Then the proposed spatial adapters are inserted into the backbone of the VLP model. During training, SPER freezes the parameters of the backbone and only updates the parameters of the inserted spatial adapters. The main process of our method is illustrated in

Fig.1 Pipeline of the proposed SPER
(1) Different from traditional methods based on fine-tuning all parameters, we propose an innovative and efficient transfer learning framework for RSCIR, which reduces the consumption of computational and storage resources.
(2) To bridge the gap between different domains, we design the spatial adapter to efficiently reconstruct multimodal features in the spatial dimension and achieve superior performance.
(3) We have conducted quantitative and qualitative experiments on different publicly available datasets, demonstrating the effectiveness of our approach.
Consistent with the idea of contrastive learnin
For simplicity, residual connections are ignored in
The vision transformers (ViTs
(1) |
where is the input to the ViT, the class token for the image, the nth image patch, and the position embedding added to each token. And one fundamental ViT block is modeled as follows
(2) |
(3) |
where represents the layer normalization, the self-attention module in ViT, and the multi-layer perceptron, the hidden feature obtained by , and the output feature of the ViT block.
Similar to the visual feature , the semantic feature is extracted with BER
(4) |
(5) |
where represents the mth words in the caption, and the vocabulary size of the BERT; is the positional embedding vector, the word embedding matrix, the beginning of sentence token, and the end of the sentence token; represents the length of tokens, and the semantic encoder is denoted by .
To address the challenges presented above, the innovative aspects of our proposed method are: (1) Compared with recent efficient transfer learning methods, our approach enhances the extraction of spatial relationships in RS images. It leverages fewer parameters for the efficient reconstruction of multimodal features in the spatial dimension. (2) Compared with traditional RSCIR methods, our SPER framework is more concise and efficient. We only specify a limited number of parameters to be involved in backpropagation and updates. Further details are provided below.
Compared with natural scenes, RS scenes are characterized by greater complexity and variability in scale. Traditional parameter-efficient transfer learning method
(6) |

Fig.2 Comparison of feature reconstruction between traditional methods and the spatial adapter
where represents the original visual feature, the activation function, and the reconstructed feature; and represent the down-sampling matrix and the up-sampling matrix, respectively.
To address the issue mentioned above, we propose the SPA which can better handle the complexity of RS scenes and the scale variability of valuable targets. The details are shown as SPA in
Specifically, the process of spatial reconstruction is shown in
(7) |
Different from traditional methods, we do not simply employ down-sampling and up-sampling for feature reconstruction. We complete the reconstruction by applying cross-correlation between the features containing spatial information and the reconstruction matrix. By utilizing cross-correlation, our method efficiently captures spatial dependencies while reducing the need for excessive parameter overhead as
(8) |
where is the nth visual feature after partitioning, the spatial reconstruction function, the valid cross-correlation operator, the reconstruction weight matrix in the spatial adapter, and the bias parameter. is the visual feature obtained by the mth spatial reconstruction matrix.
In order to decrease the gap between different domains and reduce the difficulty of cross-modal alignment, we similarly perform global reconstruction for semantic features. Similar to the visual feature , the semantic feature is first partitioned in the channel dimension to obtain the global feature sequence . After efficient reconstruction, semantic features with global information are finally obtained, which can be expressed as
(9) |
where is the nth global semantic feature after partitioning, and the mth token after reconstruction.
Finally, we employ the class token from the last visual encoder as the visual embedding vector and the begin of sentence (BOS) token from the last semantic encoder as the semantic embedding vector . They are mapped to the d-dimensional hypersphere space after L2 normalization.
Moreover, to reduce the consumption of computational and storage resources, SPER freezes the parameters of the backbone and only updates the parameters of the proposed SPAs during transfer learning. Following the same procedure as previous method
(10) |
(11) |
where denotes the parameter of backbone at the iteration n, the parameter of spatial adapter at iteration n, the step size of the parameter update, and the derivative of the loss function L with respect to parameter .
Our objective is to restrict positive paired samples as close as possible and negative paired samples as far away as possible. Instead of the traditional triplet loss, we employ the InfoNCE loss in contrastive learnin
(12) |
where is the cosine similarity between image and caption , and the temperature coefficient; and represent the nth image and caption in the current batch, respectively. During backpropagation, only the parameters of the proposed spatial adapter are updated.
RSICD and RSITMD are two commonly used RSTIR datasets. The RSICD dataset contains 10 921 RS images of various resolutions and 54 605 query texts. The image size is . The RSITMD dataset includes 4 743 images of different resolutions and 23 715 query texts. The image size is . The pixel resolution is about 0.5 m to 20 m.
The UCM Captions and Sydney datasets were not considered due to their small sample sizes and single resolutions, with only 2 100 and 613 samples, respectively, and resolutions of 0.3 m and 0.5 m.
We conducted qualitative and quantitative experiments on the RSICD and RSITMD datasets. To ensure the fairness and reproducibility of the experiments, we follow the same dataset partitioning as in Ref.[
(13) |
We compare our approach with previous excellent methods, including traditional RSCIR methods and methods transferred from VLP (CLIP). Traditional methods include HVS
Type | Method | Training parameter/1 | RSICD | ||||||
---|---|---|---|---|---|---|---|---|---|
Text retrieval | Image retrieval | sumR | |||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||||
R |
LW⁃MCR⁃ | 1.65 | 4.39 | 13.35 | 20.29 | 4.30 | 18.85 | 32.34 | 93.52 |
AMFMN⁃si | 35.94 | 5.21 | 14.72 | 21.57 | 4.08 | 17.00 | 30.60 | 93.18 | |
MCR | 52.35 | 6.59 | 19.40 | 30.28 | 5.03 | 19.38 | 32.99 | 113.67 | |
SWA | - | 7.41 | 20.13 | 30.86 | 5.56 | 22.26 | 37.41 | 123.63 | |
GaLR with MR | 46.89 | 6.59 | 19.85 | 31.04 | 4.69 | 19.48 | 32.13 | 113.78 | |
T |
Single Languag | 151.00 | 10.70 | 29.64 | 41.53 | 9.14 | 28.96 | 44.59 | 164.56 |
Linear prob | 0.53 | 8.46 | 24.41 | 37.72 | 7.81 | 25.89 | 42.47 | 146.76 | |
RS⁃ligh | 9.20 | 6.67 | 18.92 | 28.42 | 8.94 | 26.45 | 41.06 | 130.46 | |
TGK | 4.70 | 8.69 | 24.52 | 37.15 | 6.61 | 24.74 | 39.71 | 141.42 | |
Cross⁃Modal Adapte | 0.16 | 11.18 | 27.31 | 40.62 | 9.57 | 30.74 | 48.36 | 167.78 | |
Full fine⁃tunin | 151.00 | 13.54 | 30.83 | 43.46 | 11.55 | 33.14 | 49.83 | 182.35 | |
Adapte | 2.57 | 12.99 | 28.63 | 42.54 | 9.84 | 30.74 | 45.92 | 170.66 | |
CLIP(ViT⁃B⁃16 | 0.00 | 6.67 | 17.65 | 26.44 | 7.33 | 22.15 | 33.57 | 113.81 | |
Adapter (ViT⁃B⁃16) | 2.57 | 14.36 | 31.65 | 44.46 | 11.60 | 32.68 | 48.32 | 183.07 | |
Ours | 0.18 | 14.36 | 30.19 | 43.73 | 10.57 | 30.52 | 46.03 | 175.40 | |
Ours (ViT⁃B⁃16) | 0.60 | 16.01 | 33.57 | 46.11 | 11.82 | 31.94 | 47.77 | 187.22 |
Firstly, compared with traditional methods on the RSICD, we have achieved a significant performance lead, which we believe is due to the powerful visual-semantic extraction capability of the pre-trained model. Additionally, compared with CLIP methods, our approach requires fewer training parameters and exhibits superior overall performance. Compared with the baseline method Adapter, SPER only needs to train 0.18 million parameters, while the Adapter method requires training 2.57 million parameters. Importantly, the sumR metric of SPER leads the Adapter method by 4.74 points on the RSICD dataset. In our opinion, the advancement of SPER lies in its ability to model and extract spatial relationships, whereas the Adapter method mainly focuses on channel features. Furthermore, compared to the Full fine-tuning method, SPER only needs to train less than 1% of the parameters to achieve 96% of its performance, demonstrating the efficiency of our proposed approach.
Our method also performs well on the RSITMD, as shown in
Type | Method | Training parameter/ 1 | RSITMD | ||||||
---|---|---|---|---|---|---|---|---|---|
Text retrieval | Image retrieval | sumR | |||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||||
R |
LW⁃MCR⁃ | 1.65 | 9.73 | 26.77 | 37.61 | 9.25 | 34.07 | 54.03 | 171.46 |
AMFMN⁃si | 35.94 | 10.63 | 24.78 | 41.81 | 11.51 | 34.69 | 54.87 | 178.29 | |
MCR | 52.35 | 13.27 | 29.42 | 41.59 | 9.42 | 35.53 | 52.74 | 181.97 | |
SWA | - | 13.35 | 32.15 | 46.90 | 11.24 | 40.40 | 60.60 | 204.64 | |
GaLR with M | 46.89 | 14.82 | 31.64 | 42.48 | 11.15 | 36.68 | 51.68 | 188.45 | |
T |
Single Languag | 151.00 | 19.69 | 40.26 | 54.42 | 17.61 | 49.73 | 66.59 | 248.30 |
Linear prob | 0.53 | 13.71 | 33.41 | 48.01 | 10.97 | 36.85 | 56.15 | 199.10 | |
RS⁃ligh | 9.20 | 12.61 | 31.85 | 46.23 | 12.92 | 38.98 | 60.08 | 202.67 | |
TGK | 4.70 | 17.92 | 36.95 | 52.88 | 12.83 | 43.14 | 62.48 | 226.20 | |
Cross⁃Modal Adapte | 0.16 | 18.16 | 36.08 | 48.72 | 16.31 | 44.33 | 64.75 | 228.35 | |
Full fine⁃tunin | 151.00 | 24.16 | 47.12 | 61.28 | 20.40 | 50.53 | 68.54 | 272.03 | |
Adapte | 2.57 | 21.01 | 41.59 | 53.76 | 16.94 | 46.19 | 64.02 | 243.51 | |
CLIP(ViT⁃B⁃16 | 0.00 | 8.84 | 23.45 | 36.28 | 9.86 | 34.38 | 49.38 | 162.19 | |
Adapter (ViT⁃B⁃16) | 2.57 | 23.67 | 40.92 | 52.65 | 15.35 | 46.72 | 65.35 | 244.66 | |
Ours | 0.18 | 21.46 | 43.36 | 54.42 | 16.81 | 45.88 | 62.96 | 244.89 | |
Ours (ViT⁃B⁃16) | 0.60 | 23.45 | 42.47 | 52.87 | 15.48 | 47.38 | 65.84 | 247.49 |
We explored the effect of the channel division step on the proposed SPER and the experimental results are shown in
Division step | Training parameter/ 1 | RSITMD | ||||||
---|---|---|---|---|---|---|---|---|
Text retrieval | Image retrieval | sumR | ||||||
R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |||
1 | 0.18 | 21.46 | 43.36 | 54.42 | 16.81 | 45.88 | 62.96 | 244.89 |
3 | 0.43 | 20.79 | 42.03 | 53.54 | 18.23 | 45.93 | 63.32 | 243.84 |
5 | 0.67 | 20.57 | 42.69 | 53.53 | 17.96 | 46.01 | 63.67 | 244.43 |

Fig.3 Retrieval cases of SPER on the RSITMD test set
Type | Method | TT/s | ET/s | IT/ms |
---|---|---|---|---|
R | AMFMN | 47.83 | 4.79 | 1.76 |
GaLR | 50.18 | 4.85 | 1.78 | |
T | CLIP | 101.04 | 5.66 | 2.08 |
Adapter | 79.39 | 5.86 | 2.16 | |
SPER | 76.40 | 5.71 | 2.10 |
Compared with the traditional GaLR method, SPER’s TT increases by 26.22 s, and IT increases by 0.32 ms. However, considering the significant improvement in SPER’s retrieval performance, we believe the additional time consumption is acceptable. Compared with the CLIP method, SPER reduces the TT by 24.3%, while the ET and IT are at the same level. Compared with the baseline method Adapter, SPER benefits from the efficient reconstruction of spatial features, leading to superior retrieval performance along with improved training and inference efficiency.
One potential limitation of SPER is its performance in aligning fine-grained information, particularly regarding quantities. As demonstrated in subsection 2.5, SPER is not always accurate when retrieving RS images based on the number of entities described in the queries. This suggests that while SPER performs well in general retrieval tasks, there is room for improvement in its ability to model and retrieve precise numerical or quantity-based details.
Another limitation is SPER’s efficiency when processing high-resolution remote sensing images (e.g., 10 000 pixel × 10 000 pixel). The high-resolution RS images were sliced to accommodate retrieval. As discussed in subsection 2.6, compared with traditional CNN-based approaches, i.e., approaches for instance AMFMN and GaLR, SPER could require more computational resources, potentially affecting inference speed. Thus, the trade-off between retrieval performance and efficiency is an important area for further exploration.
(1) We propose an efficient spatial feature reconstruction framework for RSCIR, which apparently reduces the consumption of computational and storage resources. Compared with the baseline method of fine-tuning all parameters in the VLP, our framework requires training only 0.18 million parameters (<1%) to achieve 96% of the baseline performance, reducing the training time by 24.3%.
(2) To bridge the gap between different domains, our designed spatial adapter efficiently models and extracts spatial relationships from multimodal features. In terms of retrieval performance, SPER leads similar methods by at least 2.7%.
(3) As discussed in the limitations section, our future research will focus on addressing the challenges of improving fine-grained retrieval and reducing computational demands during inference.
Contributions Statement
Mr. ZHANG Weihang designed the study, developed the methodology, interpreted the results, and wrote the manuscript. Mr. CHEN Jialiang conducted validation and contributed to the writing, review, and editing of the manuscript. Prof. ZHANG Wenkai managed project administration, organized resources, and supported the study’s progress. Prof. LI Xinming conducted validation and contributed to the manuscript through review and editing. Prof. GAO Xin supervised the research process and reviewed the manuscript. Prof. SUN Xian provided funding support and critical insights into the methodology. All authors commented on the manuscript draft and approved the submission.
Acknowledgements
This work was supported by the National Key R&D Program of China (No.2022ZD0118402).
Conflict of Interest
The authors declare no competing interests.
References
ZHOU W, GUAN H, LI Z, et al. Remote sensing image retrieval in the past decade: Achievements, challenges, and future directions[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2023, 16: 1447-1473. [Baidu Scholar]
YAN J, YU L, XIA C, et al. Super-resolution inversion and reconstruction of remote sensing image of unknown infrared band of interest[J]. Transactions of Nanjing University of Aeronautics & Astronautics, 2023, 40(4): 472-486. [Baidu Scholar]
CAO M, LI S, LI J, et al. Image-text retrieval: A survey on recent research and development[EB/OL].(2022-03-18). https://arxiv.org/abs/2203.14713. [Baidu Scholar]
TANG X, WANG Y, MA J, et al. Interacting-enhancing feature transformer for cross-modal remote-sensing image and text retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-15. [Baidu Scholar]
FAGHRI F, FLEET D J, KIROS J R, et al. VSE++: Improving visual-semantic embeddings with hard negatives[EB/OL]. (2018-07-18). https://arxiv.org/abs/1707.05612. [Baidu Scholar]
ZHANG W, LI J, LI S, et al. Hypersphere-based remote sensing cross-modal text-image retrieval via curriculum learning[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-15. [Baidu Scholar]
YUAN Z, ZHANG W, FU K, et al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 4404119. [Baidu Scholar]
PAN J, MA Q, BAI C. Reducing semantic confusion: Scene-aware aggregation network for remote sensing cross-modal retrieval[C]//Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. [S.l.]: ACM, 2023: 398-406. [Baidu Scholar]
WEN C, HU Y, LI X, et al. Vision-language models in remote sensing: Current progress and future trends[J]. IEEE Geoscience and Remote Sensing Magazine, 2024, 12(2): 32-66. [Baidu Scholar]
RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of International Conference on Machine Learning. [S.l.]: PMLR, 2021: 8748-8763. [Baidu Scholar]
LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in Neural Information Processing Systems, 2021, 34: 9694-9705. [Baidu Scholar]
LI J, LI D, SAVARESE S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of International Conference on Machine Learning. [S.l.]: PMLR, 2023: 19730-19742. [Baidu Scholar]
YUAN Y, ZHAN Y, XIONG Z. Parameter-efficient transfer learning for remote sensing image-text retrieval[J]. IEEE Transactions on Geoscience and Remote Sensing, 2023, 61: 1-14. [Baidu Scholar]
HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[C]// Proceedings of International Conference on Machine Learning. [S.l.]: PMLR, 2019: 2790-2799. [Baidu Scholar]
JIANG H, ZHANG J, HUANG R, et al. Cross-modal adapter for text-video retrieval[EB/OL]. (2022-11-17). https://arxiv.org/abs/2211.09623. [Baidu Scholar]
HE K, FAN H, WU Y, et al. Momentum contrast for unsupervised visual representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USA: IEEE, 2020: 9729-9738. [Baidu Scholar]
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03). https://arxiv.org/abs/2010.11929. [Baidu Scholar]
DEVLIN J. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. (2018-10-24). https://arxiv.org/abs/1810.04805. [Baidu Scholar]
YUAN Z, ZHANG W, RONG X, et al. A lightweight multi-scale cross modal text-image retrieval method in remote sensing[J]. IEEE Transactions on Geoscience and Remote Sensing, 2021, 60: 1-19. [Baidu Scholar]
YUAN Z, ZHANG W, TIAN C, et al. MCRN: A multi-source cross-modal retrieval network for remote sensing[J]. International Journal of Applied Earth Observation and Geoinformation, 2022, 115: 103071. [Baidu Scholar]
YUAN Z, ZHANG W, TIAN C, et al. Remote sensing cross-modal text-image retrieval based on global and local information[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-16. [Baidu Scholar]
AL RAHHAL M M, BAZI Y, ALSHARIF N A, et al. Multilanguage transformer for improved text to remote sensing image retrieval[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 9115-9126. [Baidu Scholar]
LIAO Y, YANG R, XIE T, et al. A fast and accurate method for remote sensing image-text retrieval based on large model knowledge distillation[C]//Proceedings of IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium. Pasadena, CA, USA: IEEE, 2023: 5077-5080. [Baidu Scholar]
LIU A A, YANG B, LI W, et al. Text-guided knowledge transfer for remote sensing image-text retrieval[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 3504005. [Baidu Scholar]