$108309.944805 USD

-1.81%

ethereum

$3861.653445 USD

-2.57%

tether

$1.000476 USD

0.02%

bnb

$1064.809647 USD

-3.07%

xrp

$2.422923 USD

-2.29%

solana

$186.552328 USD

-0.93%

usd-coin

$0.999917 USD

0.00%

tron

$0.322438 USD

-0.01%

dogecoin

$0.194315 USD

-2.57%

cardano

$0.642133 USD

-3.06%

chainlink

$17.657259 USD

-6.17%

hyperliquid

$35.120261 USD

-7.45%

ethena-usde

$0.999614 USD

0.03%

stellar

$0.312748 USD

-3.27%

bitcoin-cash

$480.377391 USD

0.23%

暗号通貨のニュース記事

Token-Mol: A Large-Scale Language Model for Molecular Pre-training

2025/05/13 17:15

Drug discovery is a remarkably intricate journey that has recently been revolutionized by rapid advances in artificial intelligence (AI) technologies, particularly deep learning (DL), which has been progressively impacting multiple facets of drug development. These technologies are accelerating in innovative drug research. However, the high cost associated with acquiring annotated data sets in drug discovery remains a significant impediment to the advancement in this field. Recently, the rapid evolution of unsupervised learning frameworks, epitomized by BERT1 and GPT2, has introduced unsupervised chemical and biological pre-training models across disciplines such as chemistry3,4,5,6,7,8,9,10,11,12, and biology13,14,15,16. These models undergo large-scale unsupervised training to learn representations of small molecules or proteins, subsequently fine-tuned for specific applications. By leveraging unsupervised learning on large-scale datasets, these pre-training models effectively address the challenges associated with sparse labeling and suboptimal out-of-distribution generalization, leading to improved performance17.

Large-scale molecular pre-training models can be broadly categorized into two main groups: models based on chemical language and models utilizing molecular graphs. First, chemical language models encode molecular structures using representations such as simplified molecular input line entry system (SMILES)18 or self-referencing embedded strings (SELFIES)19. They employ training methodologies akin to BERT or GPT, well-established in natural language processing (NLP). Notable examples include SMILES-BERT20, MolGPT21, Chemformer22, and Multitask Text and Chemistry T523, which exhibit architectural similarities to universal or general NLP models such as LLaMA24.

Second, graph-based molecular pre-trained models exhibit higher versatility. They represent molecules in a graphical format, with nodes for atoms and edges for chemical bonds. Pre-training methodologies include various techniques, such as random masking of atom types, contrastive learning, and context prediction25,26,27. Unlike language-based models, graph-based molecular pre-trained models inherently incorporate geometric information, as demonstrated by methods like GEM28 and Uni-Mol29.

Despite their advancements, both classes of models exhibit distinct limitations. Large-scale molecular pre-training models based on the chemical language face a significant constraint in their inability to inherently process 3D structural information, which is pivotal for determining the physical, chemical, and biological properties of molecules28,29. Consequently, these models are inadequate for downstream tasks that involve 3D structures, such as molecular conformation generation and 3D structure-based drug design. In contrast, graph-based molecular pre-trained models can effectively incorporate 3D information. However, existing approaches primarily focus on learning molecular representations for property prediction rather than molecular generation. Moreover, integrating these models with universal NLP models presents considerable challenges. As a result, a comprehensive model capable of addressing all drug design tasks remains elusive. To address the limitations of these two model types and develop a pre-trained model suitable for all drug design scenarios, and easily integrable with existing general large language models, is pressing.

The emergence of universal artificial intelligence models holds promise in this domain. By leveraging vast amounts of data, these models acquire expert knowledge across diverse fields, rendering them capable of providing valuable assistance to practitioners in various domains2,24,30,31. Recent studies have demonstrated that GPT-4 exhibits a deep understanding of key concepts in drug discovery, including therapeutic proteins and the fundamental principles governing the design of small molecule-based and other types of drugs. Although its proficiency in specific drug design tasks, such as de novo molecule generation, molecular structure alteration, drug-target interaction prediction, molecular property estimation, and retrosynthetic pathway prediction, requires further improvement, it has achieved promising results in tasks like molecular structure generation and drug-target interaction prediction32. Among these capabilities, the application of a token-based approach by the above models to handle continuous spatial data is particularly noteworthy.

Building on this concept, Born et al. introduced the Regression Transformer, which integrates regression tasks by encoding numerical values as tokens. Nonetheless, this method does not fully address the structural complexities of molecules. Additionally, Flam-Shepherd and Aspuru-Guzik proposed directly tokenizing 3D atomic coordinates (XYZ) to represent molecular 3D structures. Concurrently, the BindGPT framework employs a similar approach to generate molecular structures and their corresponding 3D coordinates. While the performance of these models still needs enhancement, both approaches have exhibited promising outcomes in relevant drug design tasks. These results highlight the potential of large models to grasp the semantics of numerical values and affirm the feasibility of employing token-only models to handle continuous data. However, directly training language models on Cartesian coordinates of atoms presents unique challenges. For larger molecules, the extensive XYZ coordinates can result in excessively long sequences, posing difficulties for the model's learning process. Furthermore, achieving invariance through random translation and rotation does not necessarily confer equivari

オリジナルソース：nature

免責事項:info@kdj.com

提供される情報は取引に関するアドバイスではありません。 kdj.com は、この記事で提供される情報に基づいて行われた投資に対して一切の責任を負いません。暗号通貨は変動性が高いため、十分な調査を行った上で慎重に投資することを強くお勧めします。

このウェブサイトで使用されているコンテンツが著作権を侵害していると思われる場合は、直ちに当社 (info@kdj.com) までご連絡ください。速やかに削除させていただきます。

2025年10月22日に掲載されたその他の記事

もっと