DOAJ Open Access 2025

An LLM-based Data Augmentation Method for Constructing Science & Technology Topic Linkages: Taking the Energy Conservation Field as an Example

WANG Xiaoyu, HU Jingyuan, WU Ruoyu, WANG Shu, ZHAI Yujia

Abstrak

[Purpose/Significance] In the contemporary era of rapid technological advancement, understanding the intrinsic linkages between scientific research and technological innovation is critical for guiding strategic decision-making, optimizing resource allocation, and promoting effective technology transfer. Scientific publications and patents represent two complementary yet heterogeneous knowledge sources, with distinct linguistic styles, terminologies, and documentation structures, which often create a significant semantic gap. Traditional methods of linking scientific and technological (S&T) knowledge rely primarily on lexical overlap, keyword co-occurrence, or citation analysis. These methods are limited in their ability to capture deeper semantic relationships, particularly across non-homologous texts. To address this challenge, this study proposes a novel approach leveraging large language models (LLMs) for data augmentation, aiming to uncover latent semantic associations between research paper topics and patent topics. The key innovation of this work lies in using LLMs not merely for text generation but as a semantic bridge to enhance cross-domain knowledge alignment, thereby advancing the methodological toolkit for science-technology linkage studies. This approach offers potential contributions to knowledge mapping, thematic analysis, and strategic innovation management, particularly in areas where domain-specific terminology or conceptual divergence hampers conventional analyses. [Method/Process] The proposed method employs ChatGPT-4 as a knowledge-enriched intermediary to generate semantically enhanced textual variants of existing S&T documents in the energy-saving domain. Specifically, the LLM was used to perform synonym-based paraphrasing, expansion, and semantic inference on research paper abstracts and patent summaries, producing augmented texts that retain domain relevance while highlighting latent semantic connections. These enhanced texts were used to extract features that were subsequently incorporated into a non-patent citation prediction task, which serves as a practical evaluation of the method's effectiveness. By comparing predicted associations against existing citation links, the study assesses the capacity of LLM-derived features to capture cross-domain topic relatedness beyond lexical similarity. The approach relies on the theoretical premise that LLMs can model high-level semantic patterns, enabling the inference of conceptual correspondence even when explicit terminology differs between scientific and technological texts. [Results/Conclusions] The experimental validation process involved four baseline models, and it was found that features derived from the augmented texts consistently improved prediction performance. The area under the ROC curve (AUC) increased by 13.91%, 16.90%, 16.21%, and 15.69% across the four models, respectively, demonstrating the efficacy of LLM-based data augmentation in bridging the semantic gap between S&T knowledge. These results suggest that the method can uncover latent topic associations, facilitate cross-domain term alignment, and support knowledge discovery tasks that conventional lexical-based approaches may overlook. However, the study is limited by its focus on a single application domain, leaving open questions regarding generalizability across multiple S&T fields. Future work should extend the methodology to diverse domains, investigate the robustness of the LLM-generated semantic bridges, and explore automated mechanisms for scaling cross-domain knowledge integration. Overall, this research provides a promising framework for enhancing the semantic connectivity of heterogeneous knowledge sources. This contributes to a broader understanding of the interactions between science and technology and informs data-driven strategies for managing research and innovation.

Penulis (1)

W

WANG Xiaoyu, HU Jingyuan, WU Ruoyu, WANG Shu, ZHAI Yujia

Format Sitasi

Yujia, W.X.H.J.W.R.W.S.Z. (2025). An LLM-based Data Augmentation Method for Constructing Science & Technology Topic Linkages: Taking the Energy Conservation Field as an Example. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0513

Akses Cepat

Informasi Jurnal
Tahun Terbit
2025
Sumber Database
DOAJ
DOI
10.13998/j.cnki.issn1002-1248.25-0513
Akses
Open Access ✓