Transactions of Nanjing University of Aeronautics & Astronautics
网刊加载中。。。

使用Chrome浏览器效果最佳,继续浏览,你可能不会看到最佳的展示效果,

确定继续浏览么?

复制成功,请在其他浏览器进行阅读

Efficient Reconstruction of Spatial Features for Remote Sensing Image⁃Text Retrieval  PDF

  • ZHANG Weihang 1,2,3
  • CHEN Jialiang 1,2
  • ZHANG Wenkai 1,2
  • LI Xinming 4
  • GAO Xin 1,3
  • SUN Xian 1,2,3
1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, P. R. China; 2. Key Laboratory of Target Cognition and Application Technology (TCAT), Chinese Academy of Sciences, Beijing 100190, P. R. China; 3. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, P. R. China; 4. School of Computer Science and Artificial Intelligence, Aerospace Information Technology University, Jinan 250299, P. R. China

CLC: TP391.3

Updated:2025-03-14

DOI:http://dx.doi.org/10.16356/j.1005-1120.2025.01.008

  • Full Text
  • Figs & Tabs
  • References
  • Authors
  • About
CN CITE
OUTLINE

Abstract

Remote sensing cross-modal image-text retrieval (RSCIR) can flexibly and subjectively retrieve remote sensing images utilizing query text, which has received more researchers’ attention recently. However, with the increasing volume of visual-language pre-training model parameters, direct transfer learning consumes a substantial amount of computational and storage resources. Moreover, recently proposed parameter-efficient transfer learning methods mainly focus on the reconstruction of channel features, ignoring the spatial features which are vital for modeling key entity relationships. To address these issues, we design an efficient transfer learning framework for RSCIR, which is based on spatial feature efficient reconstruction (SPER). A concise and efficient spatial adapter is introduced to enhance the extraction of spatial relationships. The spatial adapter is able to spatially reconstruct the features in the backbone with few parameters while incorporating the prior information from the channel dimension. We conduct quantitative and qualitative experiments on two different commonly used RSCIR datasets. Compared with traditional methods, our approach achieves an improvement of 3%—11% in sumR metric. Compared with methods finetuning all parameters, our proposed method only trains less than 1% of the parameters, while maintaining an overall performance of about 96%. The relevant code and files are released at https://github.com/AICyberTeam/SPER.

0 Introduction

In recent years, the exponential growth of remote sensing (RS) data and progressive processing techniques have greatly expanded human perceptual capabilities and prospected for many applications, such as ecological monitoring, land planning, and disaster prediction

1-2. However, it is still challenging to process and retrieve valuable RS data efficiently. Remote sensing cross-modal image-text retrieval (RSCIR) aims to retrieve RS images utilizing text that describes the content of the image. This content-based retrieval approach has gradually become a research hotspot in recent years3.

The current mainstream in RSCIR is the end-to-end retrieval method based on embedding vectors

4. Specifically, in order to directly measure similarity, the end-to-end retrieval approach utilizes the powerful representation capability of neural networks to map data from different modalities into a common hypersphere space. The cross-modal features are aligned through contrastive learning.

According to the different ways of interacting between multimodal features, there are two main representative categories: Dual-stream and single-stream. Dual-stream methods refer to independently encoding multimodal features without interaction. Representative methods include VSE++[5], HVSA

6, etc. Single-stream methods involve the fusion and guidance of cross-modal features during the encoding process. Representative methods include AMFMN7, SWAN8, etc.

Meanwhile, the emergence of large-scale visual-language pre-training (VLP) models has provided new insights for RSCIR

9. Recently, there has been rapid development in large-scale multimodal pre-training models, such as CLIP10, ALBEF11 and BLIP-212. Instead of training all parameters in VLP, the parameter-efficient transfer learning method is designed to train a fraction of the parameters, which significantly reduce computational consumption while maintaining reliable performance13.

Although RSCIR has had some achievements, it still confronts some challenges. Firstly, as for the RS domain, training a VLP model from scratch requires considerable computational resources and annotated data

14. The initial CLIP method, for example, was trained on 400 million image-text pairs collected from the internet, which has been upgraded to 2 billion. Captions in the RS domain mostly rely on manual annotation by professionals thus it is quite challenging to train a VLP from scratch for the RS domain. Therefore, how to efficiently transfer the prior knowledge of the natural domain to the complex RS domains is worth further exploration.

Moreover, recently proposed parameter-efficient transfer learning methods mainly reconstruct features in the channel dimension by up-sampling and down-sampling

14-15. This is because most of them tend to transfer to downstream tasks in the same domain as VLP15. However, there is an inherent domain gap between RS scenes and natural scenes. RS scenes are complex and targets can vary greatly in scale. Merely reconstructing channel features is insufficient to explore the spatial relationships of instances, making it suboptimal for image-text retrieval.

To address these issues, an efficient transfer learning framework for RSCIR is proposed, which is based on spatial feature efficient reconstruction (SPER). First, to enhance spatial relationship extraction and reduce computational consumption, we introduce a concise and efficient spatial adapter that reconstructs image-text features in the spatial dimension and integrates prior information from the channel dimension. By partitioning the cross-modal features in the channel dimension, we can obtain features that contain both spatial and a priori channel information. Differing from traditional methods, SPER reduces the volume of additional parameters introduced by the down-sampling and up-sampling processes. Then the proposed spatial adapters are inserted into the backbone of the VLP model. During training, SPER freezes the parameters of the backbone and only updates the parameters of the inserted spatial adapters. The main process of our method is illustrated in Fig.1, where SPA represents spatial adapter, LN the layer norm, MHA the multi-head attention, and FFN the feed-forward network. The contributions of this paper can be summarized as follows:

Fig.1  Pipeline of the proposed SPER

(1) Different from traditional methods based on fine-tuning all parameters, we propose an innovative and efficient transfer learning framework for RSCIR, which reduces the consumption of computational and storage resources.

(2) To bridge the gap between different domains, we design the spatial adapter to efficiently reconstruct multimodal features in the spatial dimension and achieve superior performance.

(3) We have conducted quantitative and qualitative experiments on different publicly available datasets, demonstrating the effectiveness of our approach.

1 Methods

1.1 Cross‑modal feature representation

Consistent with the idea of contrastive learning

16, SPER constrains positive pairs to be as close as possible and negative pairs to be as far away as possible. The overall process is illustrated in Fig.1.

For simplicity, residual connections are ignored in Fig.1. We denote the RS image and query text as IRH×W×3 and C={wm}m=0M, respectively, where H×W is the size of the RS image and wm is the mth word in the query text. For RSCIR, we first encode the RS image I and the corresponding caption C with a multimodal encoder to obtain the visual embedding vector v and the semantic embedding vector s. Then, we map the visual embedding vector vRdv and semantic embedding vector sRds to the common hypersphere space and measure the similarity 𝒮(I,C) by the inner product.

The vision transformers (ViTs

17 are utilized to extract visual features initially. Specifically, we divide the RS image into N×N patches and add the class token as an aggregate representation of the image, which can be defined as

I^=[Ic,I0,I1,,IN2-1]+Ip (1)

where I^R(N2+1)×dv is the input to the ViT, IcRdv the class token for the image, InR(H/NW/N)×dv the nth image patch, and IpR(N2+1)×dv the position embedding added to each token. And one fundamental ViT block is modeled as follows

vh=SA(Norm(I^))+I^ (2)
vo=MLP(Norm(vh))+vh (3)

where Norm() represents the layer normalization, SA() the self-attention module in ViT, and MLP() the multi-layer perceptron, vhR(N2+1)×dv the hidden feature obtained by SA(), and voR(N2+1)×dv the output feature of the ViT block.

Similar to the visual feature v, the semantic feature s is extracted with BERT

18. The query caption is first preprocessed and then a sequence of tokens [cbos,c0,c1,,cM,ceos] is obtained as

cm=wmMe+wp (4)
so=Trans(cbos,c0,c1,,cM,ceos)RL×ds (5)

where wmR𝕍 represents the mth words in the caption, and 𝕍 the vocabulary size of the BERT; wpRds is the positional embedding vector, MeR𝕍×ds the word embedding matrix, cbosRds the beginning of sentence token, and ceosRds the end of the sentence token; L represents the length of tokens, and the semantic encoder is denoted by Trans().

To address the challenges presented above, the innovative aspects of our proposed method are: (1) Compared with recent efficient transfer learning methods, our approach enhances the extraction of spatial relationships in RS images. It leverages fewer parameters for the efficient reconstruction of multimodal features in the spatial dimension. (2) Compared with traditional RSCIR methods, our SPER framework is more concise and efficient. We only specify a limited number of parameters to be involved in backpropagation and updates. Further details are provided below.

1.2 Spatial adapter

Compared with natural scenes, RS scenes are characterized by greater complexity and variability in scale. Traditional parameter-efficient transfer learning methods

14-15 that rely solely on channel feature reconstruction are insufficient to capture the relationships between instances in RS images. These methods are designed to transfer prior knowledge to downstream tasks within the same domain as the pre-training task, ignoring transfer learning across domains, e.g., from the nature domain to the RS domain. Thus, they mostly focus on reconstructing channel features, as shown in Fig.2(a). They perform the reconstruction of features in the channel dimension by up-sampling and down-sampling, which can be expressed as

v˜o=ϕ(voWdown)Wup (6)

Fig.2  Comparison of feature reconstruction between traditional methods and the spatial adapter

where vo represents the original visual feature, ϕ() the activation function, and v˜oR(N2+1)×dv the reconstructed feature; WdownRdv×h and WupRh×dv represent the down-sampling matrix and the up-sampling matrix, respectively.

To address the issue mentioned above, we propose the SPA which can better handle the complexity of RS scenes and the scale variability of valuable targets. The details are shown as SPA in Fig.1.

Specifically, the process of spatial reconstruction is shown in Fig.2(b). In contrast to existing methods that primarily reconstruct channel features, our spatial adapter explicitly focuses on spatial feature reconstruction, a crucial aspect for handling RS images. It enhances the ability to model and extract spatial relationships while effectively incorporating prior channel information. We innovatively partition the visual feature vo in the channel dimension, obtaining a sequence of features containing spatial information as

vo=[v0,v1,,vN] (7)

Different from traditional methods, we do not simply employ down-sampling and up-sampling for feature reconstruction. We complete the reconstruction by applying cross-correlation between the features containing spatial information and the reconstruction matrix. By utilizing cross-correlation, our method efficiently captures spatial dependencies while reducing the need for excessive parameter overhead as

im=r(vo)=bm+n=0N-1Wm,nvn (8)

where vnRH×W×(dv/N) is the nth visual feature after partitioning, r() the spatial reconstruction function, the valid cross-correlation operator, WRH×W×(dv/N) the reconstruction weight matrix in the spatial adapter, and b the bias parameter. imRdv is the visual feature obtained by the mth spatial reconstruction matrix.

In order to decrease the gap between different domains and reduce the difficulty of cross-modal alignment, we similarly perform global reconstruction for semantic features. Similar to the visual feature vo, the semantic feature so is first partitioned in the channel dimension to obtain the global feature sequence [s0,s1,,sN]. After efficient reconstruction, semantic features with global information are finally obtained, which can be expressed as

tm=r(so)=bm+n=0N-1Wm,nsn (9)

where snRL×(ds/N) is the nth global semantic feature after partitioning, and tmRds the mth token after reconstruction.

Finally, we employ the class token from the last visual encoder as the visual embedding vector v and the begin of sentence (BOS) token from the last semantic encoder as the semantic embedding vector s. They are mapped to the d-dimensional hypersphere space after L2 normalization.

1.3 Efficient transfer learning

Moreover, to reduce the consumption of computational and storage resources, SPER freezes the parameters of the backbone and only updates the parameters of the proposed SPAs during transfer learning. Following the same procedure as previous methods

12, the ViT and BERT in SPER are initialized by the pre-training weights of CLIP and encode images and query text as backbone networks, respectively, shown as

θb(n+1)=θb(n) (10)
θs(n+1)=θs(n)-ηδLδθs(n) (11)

where θb(n) denotes the parameter of backbone at the iteration nθs(n) the parameter of spatial adapter at iteration nη the step size of the parameter update, and δLδθs(n) the derivative of the loss function L with respect to parameter θs(n).

Our objective is to restrict positive paired samples as close as possible and negative paired samples as far away as possible. Instead of the traditional triplet loss, we employ the InfoNCE loss in contrastive learning

13. In a batch of training data containing N paired samples, the alignment loss of one positive pair (I,C) can be expressed as

L=-12log2exp(𝒮(I,C)/τ)n=1Nexp(𝒮(I,Cn)/τ)+log2exp(𝒮(I,C)/τ)n=1Nexp(𝒮(In,C)/τ) (12)

where 𝒮() is the cosine similarity between image I and caption C, and τ the temperature coefficient; In and Cn represent the nth image and caption in the current batch, respectively. During backpropagation, only the parameters of the proposed spatial adapter are updated.

2 Experimentation and Analysis

2.1 Experimental datasets

RSICD and RSITMD are two commonly used RSTIR datasets. The RSICD dataset contains 10 921 RS images of various resolutions and 54 605 query texts. The image size is 224 pixel×224 pixel. The RSITMD dataset includes 4 743 images of different resolutions and 23 715 query texts. The image size is 256×256. The pixel resolution is about 0.5 m to 20 m.

The UCM Captions and Sydney datasets were not considered due to their small sample sizes and single resolutions, with only 2 100 and 613 samples, respectively, and resolutions of 0.3 m and 0.5 m.

2.2 Experimental implementation details

We conducted qualitative and quantitative experiments on the RSICD and RSITMD datasets. To ensure the fairness and reproducibility of the experiments, we follow the same dataset partitioning as in Ref.[

13]. Two metrics, R@KK = 1, 5, 10) and sumR, are used to evaluate retrieval performance quantitatively. The R@K metric denotes the percentage of ground truth in the first K recalled results. The sumR metric reflects the overall performance of the retrieval and can be calculated by Eq.(13). The optimization algorithm is AdamW. The initial learning rate is set to 5✕10-4, with the linear warm-up strategy for the first four epochs, and a total of 20 epochs for training. The dimension of the multimodal features is 512. All experiments are conducted on one NVIDIA Telsa V100 GPU.

sumR=K[1,5,10]R@K (13)

2.3 Retrieval performance comparison

We compare our approach with previous excellent methods, including traditional RSCIR methods and methods transferred from VLP (CLIP). Traditional methods include HVSA

6, AMFMN7, SWAN8, etc. Transfer learning-based methods include Full fine-tuning10, Adapter14, Cross-Modal Adapter15, etc. To demonstrate the effectiveness of the proposed framework, we choose the Full fine-tuning method and the Adapter algorithm as our baselines. We also report the zero-shot capability of the CLIP model in RSCIR, with 0.00 million training parameters.

Table 1 demonstrates the experimental results, where “R” denotes traditional methods and “T” CLIP methods. If not specified, the architecture of the visual encoder always adopts ViT-B-32. With the exception of Singe Language and Full fine-tuning methods, the best results are bolded to provide a better illustration of comparisons between similar methods. Concretely, the Full fine-tuning method for direct transfer learning is based on the CLIP model, all 151.00 million parameters are involved in backpropagation and gradient updates. Methods like the Adapter

13, Cross-Modal Adapter14, etc. are employed to efficiently transfer the pre-trained CLIP’s prior knowledge to RSCIR.

Table 1  Comparison of cross‑modal retrieval performance on RSICD
TypeMethodTraining parameter/106RSICD
Text retrievalImage retrievalsumR
R@1R@5R@10R@1R@5R@10
R LW⁃MCR⁃u[19] 1.65 4.39 13.35 20.29 4.30 18.85 32.34 93.52
AMFMN⁃sim[7] 35.94 5.21 14.72 21.57 4.08 17.00 30.60 93.18
MCRN[20] 52.35 6.59 19.40 30.28 5.03 19.38 32.99 113.67
SWAN[8] - 7.41 20.13 30.86 5.56 22.26 37.41 123.63
GaLR with MR [21] 46.89 6.59 19.85 31.04 4.69 19.48 32.13 113.78
T Single Language[22] 151.00 10.70 29.64 41.53 9.14 28.96 44.59 164.56
Linear probe[10] 0.53 8.46 24.41 37.72 7.81 25.89 42.47 146.76
RS⁃light[23] 9.20 6.67 18.92 28.42 8.94 26.45 41.06 130.46
TGKT[24] 4.70 8.69 24.52 37.15 6.61 24.74 39.71 141.42
Cross⁃Modal Adapter[15] 0.16 11.18 27.31 40.62 9.57 30.74 48.36 167.78
Full fine⁃tuning[10] 151.00 13.54 30.83 43.46 11.55 33.14 49.83 182.35
Adapter[14] 2.57 12.99 28.63 42.54 9.84 30.74 45.92 170.66
CLIP(ViT⁃B⁃16)[10] 0.00 6.67 17.65 26.44 7.33 22.15 33.57 113.81
Adapter (ViT⁃B⁃16) 2.57 14.36 31.65 44.46 11.60 32.68 48.32 183.07
Ours 0.18 14.36 30.19 43.73 10.57 30.52 46.03 175.40
Ours (ViT⁃B⁃16) 0.60 16.01 33.57 46.11 11.82 31.94 47.77 187.22

Firstly, compared with traditional methods on the RSICD, we have achieved a significant performance lead, which we believe is due to the powerful visual-semantic extraction capability of the pre-trained model. Additionally, compared with CLIP methods, our approach requires fewer training parameters and exhibits superior overall performance. Compared with the baseline method Adapter, SPER only needs to train 0.18 million parameters, while the Adapter method requires training 2.57 million parameters. Importantly, the sumR metric of SPER leads the Adapter method by 4.74 points on the RSICD dataset. In our opinion, the advancement of SPER lies in its ability to model and extract spatial relationships, whereas the Adapter method mainly focuses on channel features. Furthermore, compared to the Full fine-tuning method, SPER only needs to train less than 1% of the parameters to achieve 96% of its performance, demonstrating the efficiency of our proposed approach.

Our method also performs well on the RSITMD, as shown in Table 2. Compared with similar methods, our approach makes a better trade-off between the volume of training parameters and retrieval performance. It is worth noting that the performance of SPER (ViT-B-32) is comparable to the Adapter (ViT-B-16), which demonstrates the validity of SPER for efficient reconstruction of spatial features. The rest of the experimental results are more or less the same as RSICD and will not be repeated.

Table 2  Comparison of cross‑modal retrieval performance on RSITMD
TypeMethod

Training parameter/

106

RSITMD
Text retrievalImage retrievalsumR
R@1R@5R@10R@1R@5R@10
R LW⁃MCR⁃u[19] 1.65 9.73 26.77 37.61 9.25 34.07 54.03 171.46
AMFMN⁃sim[7] 35.94 10.63 24.78 41.81 11.51 34.69 54.87 178.29
MCRN[20] 52.35 13.27 29.42 41.59 9.42 35.53 52.74 181.97
SWAN[8] - 13.35 32.15 46.90 11.24 40.40 60.60 204.64
GaLR with MR[21] 46.89 14.82 31.64 42.48 11.15 36.68 51.68 188.45
T Single Language[22] 151.00 19.69 40.26 54.42 17.61 49.73 66.59 248.30
Linear probe[10] 0.53 13.71 33.41 48.01 10.97 36.85 56.15 199.10
RS⁃light [23] 9.20 12.61 31.85 46.23 12.92 38.98 60.08 202.67
TGKT [24] 4.70 17.92 36.95 52.88 12.83 43.14 62.48 226.20
Cross⁃Modal Adapter[15] 0.16 18.16 36.08 48.72 16.31 44.33 64.75 228.35
Full fine⁃tuning[10] 151.00 24.16 47.12 61.28 20.40 50.53 68.54 272.03
Adapter[14] 2.57 21.01 41.59 53.76 16.94 46.19 64.02 243.51
CLIP(ViT⁃B⁃16)[10] 0.00 8.84 23.45 36.28 9.86 34.38 49.38 162.19
Adapter (ViT⁃B⁃16) 2.57 23.67 40.92 52.65 15.35 46.72 65.35 244.66
Ours 0.18 21.46 43.36 54.42 16.81 45.88 62.96 244.89
Ours (ViT⁃B⁃16) 0.60 23.45 42.47 52.87 15.48 47.38 65.84 247.49

2.4 Ablation study

We explored the effect of the channel division step on the proposed SPER and the experimental results are shown in Table 3. When multimodal features are partitioned in the channel dimension, different division steps can be adopted, which is an important scientific hyperparameter. The division step size affects the quantity of spatial information as well as the volume of parameters required for reconstruction, which in turn affects the retrieval performance of the SPER. The best overall results are achieved when the division step takes 1. A longer division step length brings about an improvement in image retrieval metrics and has little effect on the overall performance. Therefore, we believe that SPER can efficiently perform feature reconstruction when the division step takes.

Table 3  Comparison of different division steps on RSITMD
Division step

Training parameter/

106

RSITMD
Text retrievalImage retrievalsumR
R@1R@5R@10R@1R@5R@10
1 0.18 21.46 43.36 54.42 16.81 45.88 62.96 244.89
3 0.43 20.79 42.03 53.54 18.23 45.93 63.32 243.84
5 0.67 20.57 42.69 53.53 17.96 46.01 63.67 244.43

2.5 Case study

Fig.3 shows some of the SPER retrieval results in different RS scenarios, and the retrieved images are arranged in order of similarity from left to right. The ground truth is indicated by the green box. Benefiting from the efficient reconstruction of spatial features, SPER is able to better extract valuable information and enhance the spatial relationships in RS images which is bolded in the query text. As shown in Fig.3(a), the performance of the proposed SPER remains reliable even in the presence of many entities and complex spatial relationships. Besides the ground truth, the retrieved images also contain the white building or boats in the river that are relevant to the query text. As shown in Fig.3(b), SPER could also align the image content and query semantics well when dealing with multi-scale targets. However, SPER is not accurate enough in retrieving RS images based on the number of entities described in the queries, as shown in Fig.3(c) and Fig.3(d). The SPER should be further optimized for the ability to extract the quantity of valuable targets.

Fig.3  Retrieval cases of SPER on the RSITMD test set

2.6 Analysis of time consumption

Table 4 presents a comparison of the retrieval time consumption between different methods, where TT denotes the training time for one pass through the training set, ET the evaluation time for the test set, and IT the inference time for a single cross-modal retrieval. The computing platform consists of a 2.50 GHz Intel Xeon Gold 6133 CPU and a single NVIDIA 32 GB V100 GPU. The experimental dataset is RSITMD. The recorded results are the average of three runs.

Table 4  Retrieval time consumption of different methods
TypeMethodTT/sET/sIT/ms
R AMFMN 47.83 4.79 1.76
GaLR 50.18 4.85 1.78
T CLIP 101.04 5.66 2.08
Adapter 79.39 5.86 2.16
SPER 76.40 5.71 2.10

Compared with the traditional GaLR method, SPER’s TT increases by 26.22 s, and IT increases by 0.32 ms. However, considering the significant improvement in SPER’s retrieval performance, we believe the additional time consumption is acceptable. Compared with the CLIP method, SPER reduces the TT by 24.3%, while the ET and IT are at the same level. Compared with the baseline method Adapter, SPER benefits from the efficient reconstruction of spatial features, leading to superior retrieval performance along with improved training and inference efficiency.

2.7 Limitations of SPER

One potential limitation of SPER is its performance in aligning fine-grained information, particularly regarding quantities. As demonstrated in subsection 2.5, SPER is not always accurate when retrieving RS images based on the number of entities described in the queries. This suggests that while SPER performs well in general retrieval tasks, there is room for improvement in its ability to model and retrieve precise numerical or quantity-based details.

Another limitation is SPER’s efficiency when processing high-resolution remote sensing images (e.g., 10 000 pixel × 10 000 pixel). The high-resolution RS images were sliced to accommodate retrieval. As discussed in subsection 2.6, compared with traditional CNN-based approaches, i.e., approaches for instance AMFMN and GaLR, SPER could require more computational resources, potentially affecting inference speed. Thus, the trade-off between retrieval performance and efficiency is an important area for further exploration.

3 Conclusions

(1) We propose an efficient spatial feature reconstruction framework for RSCIR, which apparently reduces the consumption of computational and storage resources. Compared with the baseline method of fine-tuning all parameters in the VLP, our framework requires training only 0.18 million parameters (<1%) to achieve 96% of the baseline performance, reducing the training time by 24.3%.

(2) To bridge the gap between different domains, our designed spatial adapter efficiently models and extracts spatial relationships from multimodal features. In terms of retrieval performance, SPER leads similar methods by at least 2.7%.

(3) As discussed in the limitations section, our future research will focus on addressing the challenges of improving fine-grained retrieval and reducing computational demands during inference.

Contributions Statement

Mr. ZHANG Weihang designed the study, developed the methodology, interpreted the results, and wrote the manuscript. Mr. CHEN Jialiang conducted validation and contributed to the writing, review, and editing of the manuscript. Prof. ZHANG Wenkai managed project administration, organized resources, and supported the study’s progress. Prof. LI Xinming conducted validation and contributed to the manuscript through review and editing. Prof. GAO Xin supervised the research process and reviewed the manuscript. Prof. SUN Xian provided funding support and critical insights into the methodology. All authors commented on the manuscript draft and approved the submission.

Acknowledgements

This work was supported by the National Key R&D Program of China (No.2022ZD0118402).

Conflict of Interest

The authors declare no competing interests.

References

1

ZHOU WGUAN HLI Zet al. Remote sensing image retrieval in the past decade: Achievements, challenges, and future directions‍[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing2023161447-1473. [Baidu Scholar] 

2

YAN JYU LXIA Cet al. Super-resolution inversion and reconstruction of remote sensing image of unknown infrared band of interest‍[J]. Transactions of Nanjing University of Aeronautics & Astronautics2023404): 472-486. [Baidu Scholar] 

3

CAO MLI SLI Jet al. Image-text retrieval: A survey on recent research and development‍[EB/OL].(2022-03-18). https://arxiv.org/abs/2203.14713. [Baidu Scholar] 

4

TANG XWANG YMA Jet al. Interacting-enhancing feature transformer for cross-modal remote-sensing image and text retrieval‍[J]. IEEE Transactions on Geoscience and Remote Sensing2023611-15. [Baidu Scholar] 

5

FAGHRI FFLEET D JKIROS J Ret al. VSE++: Improving visual-semantic embeddings with hard negatives‍[EB/OL]. (2018-07-18). https://arxiv.org/abs/1707.05612. [Baidu Scholar] 

6

ZHANG WLI JLI Set al. Hypersphere-based remote sensing cross-modal text-image retrieval via curriculum learning‍[J]. IEEE Transactions on Geoscience and Remote Sensing2023611-15. [Baidu Scholar] 

7

YUAN ZZHANG WFU Ket al. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval‍[J]. IEEE Transactions on Geoscience and Remote Sensing2022604404119. [Baidu Scholar] 

8

PAN JMA QBAI C. Reducing semantic confusion: Scene-aware aggregation network for remote sensing cross-modal retrieval‍[C]//Proceedings of the 2023 ACM International Conference on Multimedia Retrieval. [S.l.]ACM2023398-406. [Baidu Scholar] 

9

WEN CHU YLI Xet al. Vision-language models in remote sensing: Current progress and future trends‍[J]. IEEE Geoscience and Remote Sensing Magazine2024122): 32-66. [Baidu Scholar] 

10

RADFORD AKIM J WHALLACY Cet al. Learning transferable visual models from natural language supervision‍[C]//Proceedings of International Conference on Machine Learning. [S.l.]PMLR20218748-8763. [Baidu Scholar] 

11

LI JSELVARAJU RGOTMARE Aet al. Align before fuse: Vision and language representation learning with momentum distillation‍[J]. Advances in Neural Information Processing Systems2021349694-9705. [Baidu Scholar] 

12

LI JLI DSAVARESE Set al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models‍[C]//Proceedings of International Conference on Machine Learning. [S.l.]PMLR202319730-19742. [Baidu Scholar] 

13

YUAN YZHAN YXIONG Z. Parameter-efficient transfer learning for remote sensing image-text retrieval‍[J]. IEEE Transactions on Geoscience and Remote Sensing2023611-14. [Baidu Scholar] 

14

HOULSBY NGIURGIU AJASTRZEBSKI Set al. Parameter-efficient transfer learning for NLP‍[C]// Proceedings of International Conference on Machine Learning. [S.l.]PMLR20192790-2799. [Baidu Scholar] 

15

JIANG HZHANG JHUANG Ret al. Cross-modal adapter for text-video retrieval‍[EB/OL]. (2022-11-17). https://arxiv.org/abs/2211.09623. [Baidu Scholar] 

16

HE KFAN HWU Yet al. Momentum contrast for unsupervised visual representation learning‍[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, WA, USAIEEE20209729-9738. [Baidu Scholar] 

17

DOSOVITSKIY ABEYER LKOLESNIKOV Aet al. An image is worth‍ 16x16 words: Transformers for image recognition at scale[EB/OL]. (2021-06-03). https://arxiv.org/abs/2010.11929. [Baidu Scholar] 

18

DEVLIN J. BERT: Pre-training of deep bidirectional transformers for language understanding‍[EB/OL]. (2018-10-24). https://arxiv.org/abs/1810.04805. [Baidu Scholar] 

19

YUAN ZZHANG WRONG Xet al. A lightweight multi-scale cross modal text-image retrieval method in remote sensing‍[J]. IEEE Transactions on Geoscience and Remote Sensing2021601-19. [Baidu Scholar] 

20

YUAN ZZHANG WTIAN Cet al. MCRN: A multi-source cross-modal retrieval network for remote sensing‍[J]. International Journal of Applied Earth Observation and Geoinformation2022115103071. [Baidu Scholar] 

21

YUAN ZZHANG WTIAN Cet al. Remote sensing cross-modal text-image retrieval based on global and local information‍[J]. IEEE Transactions on Geoscience and Remote Sensing2022601-16. [Baidu Scholar] 

22

AL RAHHAL M MBAZI YALSHARIF N Aet al. Multilanguage transformer for improved text to remote sensing image retrieval‍[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing2022159115-9126. [Baidu Scholar] 

23

LIAO YYANG RXIE Tet al. A fast and accurate method for remote sensing image-text retrieval based on large model knowledge distillation‍[C]//Proceedings of IGARSS 2023—2023 IEEE International Geoscience and Remote Sensing Symposium. Pasadena, CA, USAIEEE20235077-5080. [Baidu Scholar] 

24

LIU A AYANG BLI Wet al. Text-guided knowledge transfer for remote sensing image-text retrieval‍[J]. IEEE Geoscience and Remote Sensing Letters2024213504005. [Baidu Scholar] 

WeChat

Mobile website