市值: $3.1678T -3.780%
體積(24小時): $135.9315B 30.070%
  • 市值: $3.1678T -3.780%
  • 體積(24小時): $135.9315B 30.070%
  • 恐懼與貪婪指數:
  • 市值: $3.1678T -3.780%
加密
主題
加密植物
資訊
加密術
影片
頭號新聞
加密
主題
加密植物
資訊
加密術
影片
bitcoin
bitcoin

$102145.347630 USD

-2.79%

ethereum
ethereum

$2433.100596 USD

-7.19%

tether
tether

$1.000331 USD

-0.01%

xrp
xrp

$2.108643 USD

-4.65%

bnb
bnb

$635.810177 USD

-4.54%

solana
solana

$146.177937 USD

-5.05%

usd-coin
usd-coin

$0.999828 USD

0.00%

tron
tron

$0.276248 USD

1.27%

dogecoin
dogecoin

$0.172078 USD

-9.59%

cardano
cardano

$0.629322 USD

-6.68%

hyperliquid
hyperliquid

$33.937667 USD

-4.46%

sui
sui

$2.969578 USD

-7.27%

chainlink
chainlink

$13.059499 USD

-6.18%

stellar
stellar

$0.259762 USD

-3.08%

unus-sed-leo
unus-sed-leo

$8.739283 USD

-2.20%

加密貨幣新聞文章

令牌 - 摩爾:一種用於分子預訓練的大規模語言模型

2025/05/13 17:15

近年來,人工智能(AI)技術(尤其是深度學習(DL))的進步已經逐漸影響了藥物開發的多個方面。

令牌 - 摩爾:一種用於分子預訓練的大規模語言模型

Drug discovery is a remarkably intricate journey that has recently been revolutionized by rapid advances in artificial intelligence (AI) technologies, particularly deep learning (DL), which has been progressively impacting multiple facets of drug development. These technologies are accelerating in innovative drug research. However, the high cost associated with acquiring annotated data sets in drug discovery remains a significant impediment to the advancement in this field. Recently, the rapid evolution of unsupervised learning frameworks, epitomized by BERT1 and GPT2, has introduced unsupervised chemical and biological pre-training models across disciplines such as chemistry3,4,5,6,7,8,9,10,11,12, and biology13,14,15,16. These models undergo large-scale unsupervised training to learn representations of small molecules or proteins, subsequently fine-tuned for specific applications. By leveraging unsupervised learning on large-scale datasets, these pre-training models effectively address the challenges associated with sparse labeling and suboptimal out-of-distribution generalization, leading to improved performance17.

藥物發現是一次非常複雜的旅程,最近因人工智能(AI)技術的快速進步(尤其是深度學習(DL))的快速進步而徹底改變了這一過程,該技術一直在逐步影響藥物開發的多個方面。這些技術在創新的藥物研究中正在加速。但是,與獲取藥物發現中註釋的數據集有關的高成本仍然是該領域進步的重大障礙。最近,由BERT1和GPT2表現的無監督學習框架的快速發展已引入了跨學科的無監督化學和生物學預培訓模型,例如化學3,4,5,6,6,7,8,9,1111,11,12,以及生物學13,14,15,15,15,16。這些模型接受了大規模的無監督訓練,以學習小分子或蛋白質的表示,隨後對特定應用進行了微調。通過利用大規模數據集上的無監督學習,這些預訓練模型有效地解決了與稀疏標籤和次優的分佈概括相關的挑戰,從而改善了性能17。

Large-scale molecular pre-training models can be broadly categorized into two main groups: models based on chemical language and models utilizing molecular graphs. First, chemical language models encode molecular structures using representations such as simplified molecular input line entry system (SMILES)18 or self-referencing embedded strings (SELFIES)19. They employ training methodologies akin to BERT or GPT, well-established in natural language processing (NLP). Notable examples include SMILES-BERT20, MolGPT21, Chemformer22, and Multitask Text and Chemistry T523, which exhibit architectural similarities to universal or general NLP models such as LLaMA24.

大規模分子預訓練模型可以廣泛地分為兩個主要組:基於化學語言的模型和使用分子圖的模型。首先,化學語言模型使用表示分子結構(例如簡化的分子輸入線進入系統(Smiles)18或自我引用嵌入字符串(自拍照)19。他們採用類似於BERT或GPT的培訓方法,在自然語言處理(NLP)方面具有良好的成就。值得注意的例子包括Smiles-Bert20,Molgpt21,Chemformer22和多任務文本和化學T523,它們與通用或一般NLP模型(例如Llama24)具有架構相似性。

Second, graph-based molecular pre-trained models exhibit higher versatility. They represent molecules in a graphical format, with nodes for atoms and edges for chemical bonds. Pre-training methodologies include various techniques, such as random masking of atom types, contrastive learning, and context prediction25,26,27. Unlike language-based models, graph-based molecular pre-trained models inherently incorporate geometric information, as demonstrated by methods like GEM28 and Uni-Mol29.

其次,基於圖的分子預訓練模型表現出更高的多功能性。它們代表圖形格式的分子,具有原子的節點和化學鍵的邊緣。訓練前方法包括各種技術,例如原子類型的隨機掩蓋,對比度學習和上下文預測25,26,27。與基於語言的模型不同,基於圖的分子預訓練模型固有地包含了幾何信息,如Gem28和Uni-Mol29之類的方法所證明的。

Despite their advancements, both classes of models exhibit distinct limitations. Large-scale molecular pre-training models based on the chemical language face a significant constraint in their inability to inherently process 3D structural information, which is pivotal for determining the physical, chemical, and biological properties of molecules28,29. Consequently, these models are inadequate for downstream tasks that involve 3D structures, such as molecular conformation generation and 3D structure-based drug design. In contrast, graph-based molecular pre-trained models can effectively incorporate 3D information. However, existing approaches primarily focus on learning molecular representations for property prediction rather than molecular generation. Moreover, integrating these models with universal NLP models presents considerable challenges. As a result, a comprehensive model capable of addressing all drug design tasks remains elusive. To address the limitations of these two model types and develop a pre-trained model suitable for all drug design scenarios, and easily integrable with existing general large language models, is pressing.

儘管有進步,但兩類模型都表現出明顯的局限性。基於化學語言的大規模分子預訓練模型在無法固有地處理3D結構信息方面面臨著重大限制,這對於確定分子的物理,化學和生物學特性至關重要。 28,29。因此,對於涉及3D結構的下游任務,例如分子構象產生和基於3D結構的藥物設計,這些模型是不足的。相反,基於圖的分子預訓練模型可以有效地包含3D信息。但是,現有方法主要集中於學習財產預測而不是分子產生的分子表示。此外,將這些模型與通用NLP模型相結合,提出了巨大的挑戰。結果,能夠解決所有藥物設計任務的綜合模型仍然難以捉摸。為了解決這兩種模型類型的局限性,並開發適合所有藥物設計方案的預訓練模型,並且很容易與現有的一般大語模型集成。

The emergence of universal artificial intelligence models holds promise in this domain. By leveraging vast amounts of data, these models acquire expert knowledge across diverse fields, rendering them capable of providing valuable assistance to practitioners in various domains2,24,30,31. Recent studies have demonstrated that GPT-4 exhibits a deep understanding of key concepts in drug discovery, including therapeutic proteins and the fundamental principles governing the design of small molecule-based and other types of drugs. Although its proficiency in specific drug design tasks, such as de novo molecule generation, molecular structure alteration, drug-target interaction prediction, molecular property estimation, and retrosynthetic pathway prediction, requires further improvement, it has achieved promising results in tasks like molecular structure generation and drug-target interaction prediction32. Among these capabilities, the application of a token-based approach by the above models to handle continuous spatial data is particularly noteworthy.

通用人工智能模型的出現在該領域有希望。通過利用大量數據,這些模型獲得了各種領域的專家知識,從而使它們能夠為各個領域的從業者提供寶貴的幫助2,24,30,31。最近的研究表明,GPT-4對藥物發現中的關鍵概念表現出深刻的理解,包括治療蛋白以及管理基於小分子和其他類型藥物設計的基本原理。儘管其在特定藥物設計任務中的熟練程度,例如從頭分子的產生,分子結構改變,藥物目標相互作用預測,分子性質估計和倒角途徑途徑,需要進一步改進,但它已在分子結構產生和藥物靶向相互作用預測等任務中取得了希望的結果32。在這些功能中,以上模型將基於令牌的方法應用於處理連續的空間數據是特別值得注意的。

Building on this concept, Born et al. introduced the Regression Transformer, which integrates regression tasks by encoding numerical values as tokens. Nonetheless, this method does not fully address the structural complexities of molecules. Additionally, Flam-Shepherd and Aspuru-Guzik proposed directly tokenizing 3D atomic coordinates (XYZ) to represent molecular 3D structures. Concurrently, the BindGPT framework employs a similar approach to generate molecular structures and their corresponding 3D coordinates. While the performance of these models still needs enhancement, both approaches have exhibited promising outcomes in relevant drug design tasks. These results highlight the potential of large models to grasp the semantics of numerical values and affirm the feasibility of employing token-only models to handle continuous data. However, directly training language models on Cartesian coordinates of atoms presents unique challenges. For larger molecules, the extensive XYZ coordinates can result in excessively long sequences, posing difficulties for the model's learning process. Furthermore, achieving invariance through random translation and rotation does not necessarily confer equivari

以這個概念為基礎,Born et al。引入了回歸變壓器,該變壓器通過將數值值編碼為令牌來集成回歸任務。但是,該方法並未完全解決分子的結構複雜性。此外,Flam-Shepherd和Aspuru-Guzik提出了直接引導3D原子坐標(XYZ)代表分子3D結構。同時,結合框架採用類似的方法來產生分子結構及其相應的3D坐標。儘管這些模型的性能仍然需要增強,但兩種方法在相關的藥物設計任務中都表現出了有希望的結果。這些結果突出了大型模型的潛力,可以掌握數值的語義,並確認使用僅代幣模型來處理連續數據的可行性。但是,直接訓練原子座坐標的語言模型提出了獨特的挑戰。對於較大的分子,廣泛的XYZ坐標可能會導致過長的序列,從而為模型的學習過程帶來了困難。此外,通過隨機翻譯和旋轉實現不變性,不一定會賦予equivari

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!

如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。

2025年06月07日 其他文章發表於