$104845.584169 USD

3.45%

ethereum

$2393.566018 USD

6.81%

tether

$1.000638 USD

0.02%

xrp

$2.135461 USD

5.90%

bnb

$638.831774 USD

3.05%

solana

$142.621453 USD

7.35%

usd-coin

$0.999948 USD

-0.02%

tron

$0.272708 USD

2.53%

dogecoin

$0.162300 USD

6.41%

cardano

$0.577802 USD

6.00%

hyperliquid

$36.756431 USD

2.75%

sui

$2.770422 USD

10.99%

bitcoin-cash

$458.337033 USD

2.12%

chainlink

$12.883720 USD

10.25%

unus-sed-leo

$9.128868 USD

0.84%

加密货币新闻

通过人为反馈学习的强化学习：仅为外行解释

2025/06/24 07:31

通过人类反馈（RLHF）揭开增强学习的神秘面纱：发现这种技术如何为CHATGPT和其他高级语言模型提供动力，这都是简单的术语解释的。

Reinforcement Learning with Human Feedback: Explained Simply for the Layman

通过人为反馈学习的强化学习：仅为外行解释

ChatGPT's arrival in 2022 revolutionized our perception of AI. Its impressive capabilities spurred the creation of other powerful Large Language Models (LLMs). A key innovation behind ChatGPT's success is Reinforcement Learning from Human Feedback (RLHF). This article provides a simplified explanation of RLHF, avoiding complex reinforcement learning jargon.

Chatgpt于2022年到来彻底改变了我们对AI的看法。它令人印象深刻的功能刺激了其他强大的大语言模型（LLM）的创建。 Chatgpt成功背后的一个关键创新是从人类反馈（RLHF）学习的强化。本文提供了RLHF的简化解释，避免了复杂的加固学习术语。

NLP Development Before ChatGPT: The Bottleneck of Human Annotation

NLP开发前的开发：人类注释的瓶颈

Traditionally, LLM development involved two main stages:

传统上，LLM开发涉及两个主要阶段：

Pre-training: Language modeling where the model predicts hidden words, learning language structure and meaning.
Fine-tuning: Adapting the model for specific tasks like summarization or question answering, often requiring human-labeled data.

The fine-tuning stage faces a significant hurdle: the need for extensive human annotation. For example, creating a question-answering dataset requires humans to provide accurate answers for millions or even billions of questions. This process is time-consuming and doesn't scale well.

微调阶段面临着一个重大障碍：需要广泛的人类注释。例如，创建一个提问数据集需要人类为数百万甚至数十亿个问题提供准确的答案。这个过程很耗时，而且扩展不佳。

RLHF: A Smarter Approach to Training LLMs

RLHF：培训LLMS的更聪明的方法

RLHF addresses this limitation by leveraging a clever approach. Instead of asking humans to provide direct answers, it asks them to choose the better answer from a pair of options. This simpler task allows for continuous improvement of models like ChatGPT.

RLHF通过利用巧妙的方法来解决这一限制。它没有要求人类提供直接答案，而是要求他们从两对选项中选择更好的答案。这项更简单的任务允许不断改进诸如ChatGpt之类的模型。

Response Generation: Creating Options for Human Feedback

响应生成：为人类反馈创建选项

LLMs generate responses by predicting the probability of the next word in a sequence. Techniques like nucleus sampling introduce randomness, producing diverse text sequences. RLHF uses these techniques to generate pairs of responses for human evaluation.

LLM通过在序列中预测下一个单词的概率来产生响应。核采样等技术会引入随机性，从而产生不同的文本序列。 RLHF使用这些技术来生成成对的响应以进行人体评估。

Reward Model: Quantifying the Quality of Responses

奖励模型：量化响应质量

The human-labeled data is used to train a "reward model." This model learns to estimate how good or bad a given answer is for an initial prompt, assigning positive values to good responses and negative values to bad ones. The reward model shares the same architecture as the original LLM, but outputs a numerical score instead of text.

人体标记的数据用于训练“奖励模型”。该模型学会了估计给定答案的最初提示的好坏，将正值分配给良好的答案和负值对不良值的价值。奖励模型与原始LLM共享相同的体系结构，但输出数值得分而不是文本。

Training the Original LLM with the Reward Model

用奖励模型培训原始LLM

The trained reward model then guides the training of the original LLM. The LLM generates responses, which are evaluated by the reward model. These numerical estimates are used as feedback to update the LLM's weights, refining its ability to generate high-quality responses. This process often utilizes a reinforcement learning algorithm like Proximal Policy Optimization (PPO), which, in simplified terms, can be thought of as similar to backpropagation.

然后，受过训练的奖励模型指导原始LLM的培训。 LLM生成响应，通过奖励模型进行评估。这些数值估计值用作反馈，以更新LLM的权重，从而完善其产生高质量响应的能力。这个过程通常利用诸如近端策略优化（PPO）之类的加固学习算法，从简化的术语中，可以将其视为类似于反向传播。

Inference and Continuous Improvement

推论和持续改进

During inference (when you're using the model), only the original trained model is used. However, the model can continuously improve in the background by collecting user prompts and asking users to rate which of two responses is better, feeding this back into the reward model and retraining the LLM.

在推理期间（使用模型时），仅使用原始训练的模型。但是，该模型可以通过收集用户提示并要求用户对两个响应的哪个更好，将其反馈到奖励模型中并重新训练LLM来不断改进。

Why This Matters

为什么这很重要

RLHF's beauty lies in its efficiency and scalability. By simplifying the annotation task for humans, it enables the training of powerful LLMs like ChatGPT, Claude, Gemini, and Mistral. It's a game-changer because it allows us to overcome the limitations of traditional fine-tuning methods that rely on extensive, manually labeled datasets. Imagine trying to teach a puppy a trick. Instead of perfectly sculpting its every move, you simply reward it when it gets closer to the desired action. That's the essence of RLHF – guiding the AI with simple feedback.

RLHF的美丽在于其效率和可扩展性。通过简化人类的注释任务，它可以培训强大的LLM，例如Chatgpt，Claude，Gemini和Mistral。这是一个改变游戏规则的人，因为它使我们能够克服依赖广泛的手动标记数据集的传统微调方法的局限性。想象一下试图教小狗的技巧。当它越来越接近所需的动作时，您只需奖励它，而不是完美地雕刻它。这就是RLHF的本质 - 用简单的反馈引导AI。

The Future is Feedback

未来是反馈

RLHF is a really elegant blend of LLMs with a reward model that allows us to greatly simplify the annotation task performed by humans. Who knew that the secret to smarter AI was simply asking for a little human help? Now, if only we could get our algorithms to do the dishes...

RLHF是LLM和奖励模型的非常优雅的融合，使我们能够大大简化人类执行的注释任务。谁知道智能AI的秘密只是在寻求一些人类的帮助？现在，只要我们可以让我们的算法做菜肴...

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2025年06月24日发表的其他文章