$87959.907984 USD

1.34%

ethereum

$2920.497338 USD

3.04%

tether

$0.999775 USD

0.00%

xrp

$2.237324 USD

8.12%

bnb

$860.243768 USD

0.90%

solana

$138.089498 USD

5.43%

usd-coin

$0.999807 USD

0.01%

tron

$0.272801 USD

-1.53%

dogecoin

$0.150904 USD

2.96%

cardano

$0.421635 USD

1.97%

hyperliquid

$32.152445 USD

2.23%

bitcoin-cash

$533.301069 USD

-1.94%

chainlink

$12.953417 USD

2.68%

unus-sed-leo

$9.535951 USD

0.73%

zcash

$521.483386 USD

-2.87%

加密货币新闻

人工智能基础设施：引领未来趋势和不断发展的技术格局

2025/11/11 23:05

探索人工智能基础设施的未来、主要趋势和不断发展的技术前景，重点关注分布式推理、多模式数据工程和资源管理。

AI Infrastructure: Navigating Future Trends and the Evolving Technology Landscape

人工智能基础设施：引领未来趋势和不断发展的技术格局

The dynamics of AI infrastructure, future trends, and the technology landscape are rapidly evolving. This article synthesizes key findings and trends, focusing on distributed inference, multimodal data engineering, and efficient resource management.

人工智能基础设施的动态、未来趋势和技术格局正在迅速发展。本文综合了主要发现和趋势，重点关注分布式推理、多模式数据工程和高效资源管理。

Distributed Inference: The New Standard

分布式推理：新标准

Serving large and mixture-of-experts models has transformed into a distributed systems challenge. "Distributed inference" involves intricate orchestration, splitting computation between prompt processing and token generation, routing requests to different expert models, and managing key-value cache transfers. This complexity is now the baseline for deploying frontier models in production.

为大型混合专家模型提供服务已转变为分布式系统挑战。 “分布式推理”涉及复杂的编排、在提示处理和令牌生成之间分割计算、将请求路由到不同的专家模型以及管理键值缓存传输。这种复杂性现在是在生产中部署前沿模型的基线。

Ray Tie-in: Ray's actor model allows precise placement and communication between different model parts running on separate hardware, enabling advanced routing and parallelism.

Ray 配合：Ray 的参与者模型允许在单独的硬件上运行的不同模型部件之间进行精确的放置和通信，从而实现高级路由和并行性。

Post-Training and Reinforcement Learning Take Center Stage

训练后和强化学习成为焦点

The most significant improvements now occur after pre-training, including alignment, fine-tuning, and reinforcement learning. AI teams focus on reward modeling, data curation from live traffic, and rapid iteration of small variants, rather than solely on pre-training compute.

现在最显着的改进发生在预训练之后，包括对齐、微调和强化学习。人工智能团队专注于奖励建模、实时流量的数据管理以及小变体的快速迭代，而不仅仅是预训练计算。

Ray Tie-in: Ray manages complex compute patterns inherent in reinforcement learning, coordinating data generation, reward modeling, and model updates. Nearly every major open-source post-training framework is built on Ray.

Ray 配合：Ray 管理强化学习中固有的复杂计算模式、协调数据生成、奖励建模和模型更新。几乎所有主要的开源后训练框架都是基于 Ray 构建的。

Multimodal Data Engineering Becomes First-Class

多模态数据工程成为一流

AI data pipelines are evolving beyond text-only workloads to process diverse data types like images, video, audio, and sensor data. This transition complicates the initial data processing stage, requiring CPUs for general transformations and GPUs for specialized tasks like generating embeddings. Data processing is now a sophisticated, heterogeneous distributed computing problem.

人工智能数据管道正在发展超越纯文本工作负载，以处理图像、视频、音频和传感器数据等多种数据类型。这种转变使初始数据处理阶段变得复杂，需要 CPU 进行一般转换，需要 GPU 来执行生成嵌入等专门任务。数据处理现在是一个复杂的异构分布式计算问题。

Ray Tie-in: Ray orchestrates tasks across heterogeneous CPU and GPU clusters, essential for building efficient data pipelines. The Ray Data library is enhanced to handle large tensors and diverse data formats.

Ray 配合：Ray 跨异构 CPU 和 GPU 集群编排任务，这对于构建高效的数据管道至关重要。射线数据库得到增强，可以处理大张量和不同的数据格式。

Agentic Workflows and Continuous Loops

代理工作流程和连续循环

Applications are shifting to systems that plan, invoke tools/models, check results, and learn from feedback continuously. These loops span data collection, post-training, deployment, and evaluation. Infrastructure must support coordinating long-running workflows across these stages for faster product learning cycles.

应用程序正在转向能够规划、调用工具/模型、检查结果并不断从反馈中学习的系统。这些循环涵盖数据收集、训练后、部署和评估。基础设施必须支持协调跨这些阶段的长期运行的工作流程，以加快产品学习周期。

Ray Tie-in: Ray’s actor model supports long-lived agents, coordinating tool use and evaluations. The same cluster runs data preparation, training, and serving, avoiding the need to integrate multiple platforms.

Ray 配合：Ray 的参与者模型支持长寿命代理、协调工具使用和评估。同一集群运行数据准备、训练和服务，避免了集成多个平台的需要。

Global GPU Scheduling and Cost Control

全局GPU调度和成本控制

Efficient GPU utilization is crucial. Policy-driven schedulers preempt low-priority jobs during traffic spikes, resuming them later, leading to higher utilization, lower costs, and faster developer startup times.

GPU 的高效利用至关重要。策略驱动的调度程序会在流量高峰期间抢占低优先级作业，并在稍后恢复它们，从而提高利用率、降低成本并缩短开发人员的启动时间。

Ray Tie-in: Anyscale’s platform uses a global resource scheduler built on Ray, providing a centralized system for managing constrained resources across an organization.

Ray 搭配：Anyscale 的平台使用基于 Ray 构建的全局资源调度程序，提供集中式系统来管理整个组织内的受限资源。

Cloud-Native and Multi-Cloud Strategies

云原生和多云策略

GPU scarcity drives enterprises to multi-cloud strategies, distributing workloads across AWS, Google Cloud, Azure, and specialized GPU clouds. This addresses availability and avoids vendor lock-in but introduces complexity.

GPU 稀缺促使企业采取多云战略，在 AWS、Google Cloud、Azure 和专用 GPU 云之间分配工作负载。这可以解决可用性问题并避免供应商锁定，但会带来复杂性。

Ray Tie-in: Ray/Anyscale provides a common runtime across multiple clouds, allowing teams to chase capacity without rebuilding systems.

Ray 搭配：Ray/Anyscale 提供跨多个云的通用运行时，允许团队在不重建系统的情况下追逐容量。

Evaluation-Driven Operations for Non-Deterministic Systems

非确定性系统的评估驱动操作

AI models are non-deterministic systems whose behavior can drift in production. Continuous evaluations tied to product metrics and feedback into post-training are essential. Iteration speed—collect, retrain, redeploy, re-measure—is critical.

人工智能模型是非确定性系统，其行为可能在生产中发生变化。与产品指标相关的持续评估和培训后反馈至关重要。迭代速度（收集、重新训练、重新部署、重新测量）至关重要。

Ray Tie-in: Ray hosts the full loop on one substrate, reusing the same primitives for data collection, evaluation jobs, training runs, and rollouts. Ray actors maintain state across evaluation runs, enabling sophisticated monitoring patterns.

Ray 捆绑：Ray 在一个基板上托管完整的循环，重复使用相同的基元进行数据收集、评估作业、训练运行和部署。射线参与者在评估运行中维护状态，从而实现复杂的监控模式。

Reliability at Scale on Unreliable Hardware

在不可靠的硬件上实现大规模可靠性

Operating AI infrastructure at scale requires designing for failure. Production systems must incorporate robust fault tolerance, including automatic retries, job checkpointing, and graceful handling of worker failures.

大规模运营人工智能基础设施需要针对失败进行设计。生产系统必须具有强大的容错能力，包括自动重试、作业检查点以及对工作故障的妥善处理。

Ray Tie-in: Ray has invested significantly in reliability and fault tolerance. Its internal state management system is re-architected for high availability, and system processes are isolated from application resource pressure. Ray’s support for checkpointing is critical for long-running training jobs.

Ray 配合：Ray 在可靠性和容错能力方面投入了大量资金。其内部状态管理系统经过重新架构以实现高可用性，并且系统进程与应用程序资源压力隔离。 Ray 对检查点的支持对于长期运行的训练作业至关重要。

Heterogeneous Clusters: The Baseline

异构集群：基线

Pipelines blend CPUs (parsing, aggregation) with GPUs (embeddings, vision/audio transforms) across many nodes.

管道将 CPU（解析、聚合）与 GPU（嵌入、视觉/音频转换）跨多个节点混合在一起。

Ray Tie-in: Ray handles dynamic orchestration across heterogeneous hardware, allowing developers to specify resource requirements declaratively.

Ray 配合：Ray 处理跨异构硬件的动态编排，允许开发人员以声明方式指定资源需求。

Accelerators and Fast Interconnects Determine Throughput

加速器和快速互连决定吞吐量

Specialized AI data centers with purpose-built accelerators connected via high-speed networking technologies are becoming standard, shifting from general-purpose cloud computing to specialized infrastructure.

具有通过高速网络技术连接的专用加速器的专业人工智能数据中心正在成为标准，从通用云计算转向专业基础设施。

Ray Tie-in: Ray Direct Transport enables direct GPU-to-GPU transfers, improving utilization for RL, distributed inference, and multimodal training.

Ray Tie-in：Ray Direct Transport 可实现 GPU 到 GPU 的直接传输，从而提高 RL、分布式推理和多模式训练的利用率。

The PARK Stack

公园堆栈

A stack is coalescing into clear layers: Kubernetes for provisioning resources, Ray for scaling applications, foundation models, and high-level frameworks like PyTorch.

堆栈正在合并成清晰的层：用于配置资源的 Kubernetes、用于扩展应用程序的 Ray、基础模型以及 PyTorch 等高级框架。

Ray Tie-in: Ray unifies data processing, training, and distributed inference into one operational substrate and plugs into model stacks and Kubernetes. Joining the PyTorch Foundation signals tighter integration with the training/serving ecosystem.

Ray 搭配：Ray 将数据处理、训练和分布式推理统一到一个操作基底中，并插入模型堆栈和 Kubernetes。加入 PyTorch 基金会标志着与训练/服务生态系统更紧密的集成。

Decentralized AI Infrastructure

去中心化人工智能基础设施

Initiatives like Pi Network's proof-of-concept with OpenMind explore decentralized node architectures for AI training, potentially democratizing access to AI infrastructure.

Pi Network 与 OpenMind 的概念验证等举措探索了用于人工智能训练的去中心化节点架构，有可能使人工智能基础设施的访问民主化。

Final Thoughts

最后的想法

The future of AI infrastructure is dynamic and exciting, with trends pointing toward more efficient, scalable, and accessible systems. Keep experimenting and pushing the boundaries – the possibilities are endless!

人工智能基础设施的未来是充满活力和令人兴奋的，趋势指向更高效、可扩展和可访问的系统。不断尝试并突破界限——可能性是无限的！

原文来源：substack

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资，kdj.com不承担任何责任。加密货币具有高波动性，强烈建议您深入研究后，谨慎投资！

如您认为本网站上使用的内容侵犯了您的版权，请立即联系我们（info@kdj.com），我们将及时删除。

2026年08月01日发表的其他文章