$107208.295278 USD

-1.54%

ethereum

$3874.629914 USD

-1.38%

tether

$1.000440 USD

0.03%

bnb

$1089.465513 USD

-5.53%

xrp

$2.327672 USD

-1.65%

solana

$184.766505 USD

-0.73%

usd-coin

$1.000076 USD

0.02%

tron

$0.310632 USD

-1.99%

dogecoin

$0.187615 USD

-1.60%

cardano

$0.633389 USD

-2.75%

ethena-usde

$0.999553 USD

0.03%

hyperliquid

$35.608231 USD

-4.13%

chainlink

$16.876114 USD

-3.98%

stellar

$0.312239 USD

-0.91%

bitcoin-cash

$473.262969 USD

-7.09%

암호화폐 뉴스 기사

Token-Mol: A Large-Scale Language Model for Molecular Pre-training

2025/05/13 17:15

Drug discovery is a remarkably intricate journey that has recently been revolutionized by rapid advances in artificial intelligence (AI) technologies, particularly deep learning (DL), which has been progressively impacting multiple facets of drug development. These technologies are accelerating in innovative drug research. However, the high cost associated with acquiring annotated data sets in drug discovery remains a significant impediment to the advancement in this field. Recently, the rapid evolution of unsupervised learning frameworks, epitomized by BERT1 and GPT2, has introduced unsupervised chemical and biological pre-training models across disciplines such as chemistry3,4,5,6,7,8,9,10,11,12, and biology13,14,15,16. These models undergo large-scale unsupervised training to learn representations of small molecules or proteins, subsequently fine-tuned for specific applications. By leveraging unsupervised learning on large-scale datasets, these pre-training models effectively address the challenges associated with sparse labeling and suboptimal out-of-distribution generalization, leading to improved performance17.

Large-scale molecular pre-training models can be broadly categorized into two main groups: models based on chemical language and models utilizing molecular graphs. First, chemical language models encode molecular structures using representations such as simplified molecular input line entry system (SMILES)18 or self-referencing embedded strings (SELFIES)19. They employ training methodologies akin to BERT or GPT, well-established in natural language processing (NLP). Notable examples include SMILES-BERT20, MolGPT21, Chemformer22, and Multitask Text and Chemistry T523, which exhibit architectural similarities to universal or general NLP models such as LLaMA24.

Second, graph-based molecular pre-trained models exhibit higher versatility. They represent molecules in a graphical format, with nodes for atoms and edges for chemical bonds. Pre-training methodologies include various techniques, such as random masking of atom types, contrastive learning, and context prediction25,26,27. Unlike language-based models, graph-based molecular pre-trained models inherently incorporate geometric information, as demonstrated by methods like GEM28 and Uni-Mol29.

Despite their advancements, both classes of models exhibit distinct limitations. Large-scale molecular pre-training models based on the chemical language face a significant constraint in their inability to inherently process 3D structural information, which is pivotal for determining the physical, chemical, and biological properties of molecules28,29. Consequently, these models are inadequate for downstream tasks that involve 3D structures, such as molecular conformation generation and 3D structure-based drug design. In contrast, graph-based molecular pre-trained models can effectively incorporate 3D information. However, existing approaches primarily focus on learning molecular representations for property prediction rather than molecular generation. Moreover, integrating these models with universal NLP models presents considerable challenges. As a result, a comprehensive model capable of addressing all drug design tasks remains elusive. To address the limitations of these two model types and develop a pre-trained model suitable for all drug design scenarios, and easily integrable with existing general large language models, is pressing.

The emergence of universal artificial intelligence models holds promise in this domain. By leveraging vast amounts of data, these models acquire expert knowledge across diverse fields, rendering them capable of providing valuable assistance to practitioners in various domains2,24,30,31. Recent studies have demonstrated that GPT-4 exhibits a deep understanding of key concepts in drug discovery, including therapeutic proteins and the fundamental principles governing the design of small molecule-based and other types of drugs. Although its proficiency in specific drug design tasks, such as de novo molecule generation, molecular structure alteration, drug-target interaction prediction, molecular property estimation, and retrosynthetic pathway prediction, requires further improvement, it has achieved promising results in tasks like molecular structure generation and drug-target interaction prediction32. Among these capabilities, the application of a token-based approach by the above models to handle continuous spatial data is particularly noteworthy.

Building on this concept, Born et al. introduced the Regression Transformer, which integrates regression tasks by encoding numerical values as tokens. Nonetheless, this method does not fully address the structural complexities of molecules. Additionally, Flam-Shepherd and Aspuru-Guzik proposed directly tokenizing 3D atomic coordinates (XYZ) to represent molecular 3D structures. Concurrently, the BindGPT framework employs a similar approach to generate molecular structures and their corresponding 3D coordinates. While the performance of these models still needs enhancement, both approaches have exhibited promising outcomes in relevant drug design tasks. These results highlight the potential of large models to grasp the semantics of numerical values and affirm the feasibility of employing token-only models to handle continuous data. However, directly training language models on Cartesian coordinates of atoms presents unique challenges. For larger molecules, the extensive XYZ coordinates can result in excessively long sequences, posing difficulties for the model's learning process. Furthermore, achieving invariance through random translation and rotation does not necessarily confer equivari

원본 소스：nature

부인 성명:info@kdj.com

제공된 정보는 거래 조언이 아닙니다. kdj.com은 이 기사에 제공된 정보를 기반으로 이루어진 투자에 대해 어떠한 책임도 지지 않습니다. 암호화폐는 변동성이 매우 높으므로 철저한 조사 후 신중하게 투자하는 것이 좋습니다!

2025年10月18日 에 게재된 다른 기사

더