$104845.584169 USD

3.45%

ethereum

$2393.566018 USD

6.81%

tether

$1.000638 USD

0.02%

xrp

$2.135461 USD

5.90%

bnb

$638.831774 USD

3.05%

solana

$142.621453 USD

7.35%

usd-coin

$0.999948 USD

-0.02%

tron

$0.272708 USD

2.53%

dogecoin

$0.162300 USD

6.41%

cardano

$0.577802 USD

6.00%

hyperliquid

$36.756431 USD

2.75%

sui

$2.770422 USD

10.99%

bitcoin-cash

$458.337033 USD

2.12%

chainlink

$12.883720 USD

10.25%

unus-sed-leo

$9.128868 USD

0.84%

加密貨幣新聞文章

通過人為反饋學習的強化學習：僅為外行解釋

2025/06/24 07:31

通過人類反饋（RLHF）揭開增強學習的神秘面紗：發現這種技術如何為CHATGPT和其他高級語言模型提供動力，這都是簡單的術語解釋的。

Reinforcement Learning with Human Feedback: Explained Simply for the Layman

通過人為反饋學習的強化學習：僅為外行解釋

ChatGPT's arrival in 2022 revolutionized our perception of AI. Its impressive capabilities spurred the creation of other powerful Large Language Models (LLMs). A key innovation behind ChatGPT's success is Reinforcement Learning from Human Feedback (RLHF). This article provides a simplified explanation of RLHF, avoiding complex reinforcement learning jargon.

Chatgpt於2022年到來徹底改變了我們對AI的看法。它令人印象深刻的功能刺激了其他強大的大語言模型（LLM）的創建。 Chatgpt成功背後的一個關鍵創新是從人類反饋（RLHF）學習的強化。本文提供了RLHF的簡化解釋，避免了複雜的加固學習術語。

NLP Development Before ChatGPT: The Bottleneck of Human Annotation

NLP開發前的開發：人類註釋的瓶頸

Traditionally, LLM development involved two main stages:

傳統上，LLM開發涉及兩個主要階段：

Pre-training: Language modeling where the model predicts hidden words, learning language structure and meaning.
Fine-tuning: Adapting the model for specific tasks like summarization or question answering, often requiring human-labeled data.

The fine-tuning stage faces a significant hurdle: the need for extensive human annotation. For example, creating a question-answering dataset requires humans to provide accurate answers for millions or even billions of questions. This process is time-consuming and doesn't scale well.

微調階段面臨著一個重大障礙：需要廣泛的人類註釋。例如，創建一個提問數據集需要人類為數百萬甚至數十億個問題提供準確的答案。這個過程很耗時，而且擴展不佳。

RLHF: A Smarter Approach to Training LLMs

RLHF：培訓LLMS的更聰明的方法

RLHF addresses this limitation by leveraging a clever approach. Instead of asking humans to provide direct answers, it asks them to choose the better answer from a pair of options. This simpler task allows for continuous improvement of models like ChatGPT.

RLHF通過利用巧妙的方法來解決這一限制。它沒有要求人類提供直接答案，而是要求他們從兩對選項中選擇更好的答案。這項更簡單的任務允許不斷改進諸如ChatGpt之類的模型。

Response Generation: Creating Options for Human Feedback

響應生成：為人類反饋創建選項

LLMs generate responses by predicting the probability of the next word in a sequence. Techniques like nucleus sampling introduce randomness, producing diverse text sequences. RLHF uses these techniques to generate pairs of responses for human evaluation.

LLM通過在序列中預測下一個單詞的概率來產生響應。核採樣等技術會引入隨機性，從而產生不同的文本序列。 RLHF使用這些技術來生成成對的響應以進行人體評估。

Reward Model: Quantifying the Quality of Responses

獎勵模型：量化響應質量

The human-labeled data is used to train a "reward model." This model learns to estimate how good or bad a given answer is for an initial prompt, assigning positive values to good responses and negative values to bad ones. The reward model shares the same architecture as the original LLM, but outputs a numerical score instead of text.

人體標記的數據用於訓練“獎勵模型”。該模型學會了估計給定答案的最初提示的好壞，將正值分配給良好的答案和負值對不良值的價值。獎勵模型與原始LLM共享相同的體系結構，但輸出數值得分而不是文本。

Training the Original LLM with the Reward Model

用獎勵模型培訓原始LLM

The trained reward model then guides the training of the original LLM. The LLM generates responses, which are evaluated by the reward model. These numerical estimates are used as feedback to update the LLM's weights, refining its ability to generate high-quality responses. This process often utilizes a reinforcement learning algorithm like Proximal Policy Optimization (PPO), which, in simplified terms, can be thought of as similar to backpropagation.

然後，受過訓練的獎勵模型指導原始LLM的培訓。 LLM生成響應，通過獎勵模型進行評估。這些數值估計值用作反饋，以更新LLM的權重，從而完善其產生高質量響應的能力。這個過程通常利用諸如近端策略優化（PPO）之類的加固學習算法，從簡化的術語中，可以將其視為類似於反向傳播。

Inference and Continuous Improvement

推論和持續改進

During inference (when you're using the model), only the original trained model is used. However, the model can continuously improve in the background by collecting user prompts and asking users to rate which of two responses is better, feeding this back into the reward model and retraining the LLM.

在推理期間（使用模型時），僅使用原始訓練的模型。但是，該模型可以通過收集用戶提示並要求用戶對兩個響應的哪個更好，將其反饋到獎勵模型中並重新訓練LLM來不斷改進。

Why This Matters

為什麼這很重要

RLHF's beauty lies in its efficiency and scalability. By simplifying the annotation task for humans, it enables the training of powerful LLMs like ChatGPT, Claude, Gemini, and Mistral. It's a game-changer because it allows us to overcome the limitations of traditional fine-tuning methods that rely on extensive, manually labeled datasets. Imagine trying to teach a puppy a trick. Instead of perfectly sculpting its every move, you simply reward it when it gets closer to the desired action. That's the essence of RLHF – guiding the AI with simple feedback.

RLHF的美麗在於其效率和可擴展性。通過簡化人類的註釋任務，它可以培訓強大的LLM，例如Chatgpt，Claude，Gemini和Mistral。這是一個改變遊戲規則的人，因為它使我們能夠克服依賴廣泛的手動標記數據集的傳統微調方法的局限性。想像一下試圖教小狗的技巧。當它越來越接近所需的動作時，您只需獎勵它，而不是完美地雕刻它。這就是RLHF的本質 - 用簡單的反饋引導AI。

The Future is Feedback

未來是反饋

RLHF is a really elegant blend of LLMs with a reward model that allows us to greatly simplify the annotation task performed by humans. Who knew that the secret to smarter AI was simply asking for a little human help? Now, if only we could get our algorithms to do the dishes...

RLHF是LLM和獎勵模型的非常優雅的融合，使我們能夠大大簡化人類執行的註釋任務。誰知道智能AI的秘密只是在尋求一些人類的幫助？現在，只要我們可以讓我們的算法做菜餚...

免責聲明:info@kdj.com

所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大，建議您充分研究後謹慎投資！

如果您認為本網站使用的內容侵犯了您的版權，請立即聯絡我們（info@kdj.com），我們將及時刪除。

2025年06月24日其他文章發表於