BACKGROUND: Drug repositioning is a pivotal strategy in pharmaceutical research, offering accelerated and cost-effective therapeutic discovery. However, biomedical information relevant to drug reposit Show more
BACKGROUND: Drug repositioning is a pivotal strategy in pharmaceutical research, offering accelerated and cost-effective therapeutic discovery. However, biomedical information relevant to drug repositioning is often complex, dispersed, and underutilized due to limitations in traditional extraction methods, such as reliance on annotated data and poor generalizability. Large language models (LLMs) show promise but face challenges such as hallucinations and interpretability issues.
OBJECTIVE: This study proposed long chain-of-thought for drug repositioning knowledge extraction (LCoDR-KE), a lightweight and domain-specific framework to enhance LLMs' accuracy and adaptability in extracting structured biomedical knowledge for drug repositioning.
METHODS: A domain-specific schema defined 11 entities (eg, drug, disease) and 18 relationships (eg, treats, is biomarker of). Following the established schema architecture, we constructed automatic annotation based on 10,000 PubMed abstracts via chain-of-thought prompt engineering. A total of 1000 expert-validated abstracts were curated into a drug repositioning corpus, a high-quality specialized corpus, while the remaining entries were allocated for model training purposes. Then, the proposed LCoDR-KE framework combined supervised fine-tuning of the Qwen2.5-7B-Instruct model with reinforcement learning and dual-reward mechanisms. Performance was evaluated against state-of-the-art models (eg, conditional random fields, Bidirectional Encoder Representations From Transformers, BioBERT, Qwen2.5, DeepSeek-R1, OpenBioLLM-70B, and model variants) using precision, recall, and F1-score. In addition, the convergence of the training method was assessed by analyzing performance progression across iteration steps.
RESULTS: LCoDR-KE achieved an entity F1 of 81.46% (eg, drug 95.83%, disease 90.52%) and triplet F1 of 69.04%, outperforming traditional models and rivaling larger LLMs (DeepSeek-R1: entity F1=84.64%, triplet F1=69.02%). Ablation studies confirmed the contributions of supervised fine-tuning (8.61% and 20.70% F1 drop if removed) and reinforcement learning (6.09% and 14.09% F1 drop if removed). The training process demonstrated stable convergence, validated through iterative performance monitoring. Qualitative analysis of the model's chain-of-thought outputs showed that LCoDR-KE performed structured and schema-aware reasoning by validating entity types, rejecting incompatible relations, enforcing constraints, and generating compliant JSON. Error analysis revealed 4 main types of mistakes and challenges for further improvement.
CONCLUSIONS: LCoDR-KE enhances LLMs' domain-specific adaptability for drug repositioning by offering an open-source drug repositioning corpus and a long chain-of-thought framework based on a lightweight LLM model. This framework supports drug discovery and knowledge reasoning while providing scalable, interpretable solutions applicable to broader biomedical knowledge extraction tasks. Show less
2025 · Bioinformatics · Oxford University Press · added 2026-04-21
Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a Show more
Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification-based methods, and the lack of systematic evaluation metrics for protein question answering systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model Show less
Despite the vast number of enzymatic kinetic measurements reported across decades of biochemical literature, the majority of relational enzyme kinetic data—linking amino acid sequence, substrate ident Show more
Despite the vast number of enzymatic kinetic measurements reported across decades of biochemical literature, the majority of relational enzyme kinetic data—linking amino acid sequence, substrate identity, kinetic parameters, and assay conditions—remains uncollected and inaccessible in structured form. This constitutes a significant portion of the “dark matter” of enzymology. Unlocking these hidden data through automated extraction offers an opportunity to expand enzyme dataset diversity and size, critical Show less