市值: $3.2264T 7.740%
成交额(24h): $162.8717B 32.210%
  • 市值: $3.2264T 7.740%
  • 成交额(24h): $162.8717B 32.210%
  • 恐惧与贪婪指数:
  • 市值: $3.2264T 7.740%
加密货币
话题
百科
资讯
加密话题
视频
热门新闻
加密货币
话题
百科
资讯
加密话题
视频
bitcoin
bitcoin

$102645.326745 USD

3.86%

ethereum
ethereum

$2235.824185 USD

20.09%

tether
tether

$0.999978 USD

-0.04%

xrp
xrp

$2.318227 USD

6.77%

bnb
bnb

$626.285788 USD

2.98%

solana
solana

$162.866519 USD

8.45%

usd-coin
usd-coin

$1.000142 USD

0.00%

dogecoin
dogecoin

$0.196724 USD

10.69%

cardano
cardano

$0.771249 USD

9.92%

tron
tron

$0.256040 USD

2.64%

sui
sui

$3.963536 USD

10.47%

chainlink
chainlink

$15.896137 USD

10.95%

avalanche
avalanche

$22.320543 USD

11.21%

stellar
stellar

$0.296058 USD

10.87%

shiba-inu
shiba-inu

$0.000014 USD

9.85%

加密货币新闻

多模式AI演变为创建可以使用多种数据类型理解,生成和响应的系统

2025/05/09 14:26

多模式AI迅速发展以创建可以在单个对话或任务中使用多个数据类型理解,生成和响应的系统

多模式AI演变为创建可以使用多种数据类型理解,生成和响应的系统

Multimodal AI is rapidly evolving to create systems that can understand, generate, and respond using multiple data types within a single conversation or task. This capability, crucial for seamless human-AI communication, is being actively researched as users increasingly engage AI for tasks like image captioning, text-based photo editing, and style transfers.

多模式AI正在迅速发展,以创建可以在单个对话或任务中使用多个数据类型来理解,生成和响应的系统。当用户越来越多地参与图像字幕,基于文本的照片编辑和样式转移之类的任务时,这种功能对无缝的人类交流至关重要,正在积极研究。

A major obstacle in this area stems from the misalignment between language-based semantic understanding and the visual fidelity required in image synthesis or editing. When separate models handle different modalities, the outputs often become inconsistent, leading to poor coherence or inaccuracies. For instance, the visual model might excel in an image but fail to comprehend the nuanced instructions, while the language model might understand the prompt but cannot shape it visually.

该领域的主要障碍源于基于语言的语义理解与图像合成或编辑所需的视觉保真度之间的错位。当单独的模型处理不同的模式时,输出通常会变得不一致,导致连贯性或不准确性。例如,视觉模型可能在图像中表现出色,但无法理解细微差别的说明,而语言模型可能会理解提示,但不能以视觉形式塑造它。

This approach also demands significant compute resources and retraining efforts for each domain. Thus, the inability to seamlessly link vision and language into a coherent and interactive experience remains one of the fundamental problems in advancing intelligent systems.

这种方法还需要大量的计算资源和对每个领域的重新训练工作。因此,无法将视觉和语言连接到连贯和互动的体验仍然是推进智能系统的基本问题之一。

In recent attempts to bridge this gap, researchers have combined architectures with fixed visual encoders and separate decoders that function through diffusion-based techniques. Tools such as TokenFlow and Janus integrate token-based language models with image generation backends, typically emphasizing pixel accuracy over semantic depth. While these approaches can produce visually rich content, they often miss the contextual nuances of user input.

在最近弥合这一差距的尝试中,研究人员将体系结构与固定的视觉编码器和单独的解码器结合在一起,这些解码器通过基于扩散的技术发挥作用。 TokenFlow和Janus之类的工具将基于令牌的语言模型与图像生成后端集成在一起,通常强调像素精度在语义深度上。尽管这些方法可以产生视觉上丰富的内容,但它们通常会错过用户输入的上下文细微差别。

Others, like GPT-4o, have moved toward native image generation capabilities but still operate with limitations in deeply integrated understanding. The friction lies in translating abstract text prompts into meaningful and context-aware visuals in a fluid interaction without splitting the pipeline into disjointed parts.

其他人,例如GPT-4O,已经朝着本地图像生成能力迈进,但仍处于深入综合理解中的局限性。摩擦在于将抽象文本转换为流体相互作用中有意义的和上下文感知的视觉效果,而无需将管道分解为分离的部分。

Now, researchers from Inclusion AI, Ant Group have presented Ming-Lite-Uni, an open-source framework designed to unify text and vision through an autoregressive multimodal structure. The system features a native autoregressive model built on top of a fixed large language model and a fine-tuned diffusion image generator. This design is based on two core frameworks: MetaQueries and M2-omni.

现在,蚂蚁群体中包含AI的研究人员提出了Ming-Lite-Uni,这是一个开源框架,旨在通过自回归的多模式结构统一文本和视觉。该系统采用固定大型语言模型和微调扩散图像发生器的本机自动回归模型。该设计基于两个核心框架:metaqueries和M2-omni。

Ming-Lite-Uni introduces an innovative component of multi-scale learnable tokens, which act as interpretable visual units, and a corresponding multi-scale alignment strategy to maintain coherence between various image scales. The researchers have provided all the model weights and implementation openly to support community research, positioning Ming-Lite-Uni as a prototype moving toward general artificial intelligence.

Ming-lite-Uni引入了多尺度可学习令牌的创新组件,该代币充当可解释的视觉单元,以及相应的多尺度对齐策略,以保持各种图像尺度之间的连贯性。研究人员公开提供了所有模型权重和实施,以支持社区研究,将Ming-Lite-Uni定位为朝着通用人工智能发展的原型。

The core mechanism behind the model involves compressing visual inputs into structured token sequences across multiple scales, such as 4×4, 8×8, and 16×16 image patches, each representing different levels of detail, from layout to textures. These tokens are processed alongside text tokens using a large autoregressive transformer. Each resolution level is marked with unique start and end tokens and assigned custom positional encodings.

该模型背后的核心机制涉及将视觉输入压缩到多个尺度的结构令牌序列中,例如4×4、8×8和16×16图像贴片,每个图像贴片,每个图像贴片,每个都代表不同级别的细节,从布局到纹理。使用大型自回归变压器与文本令牌一起处理这些令牌。每个分辨率级别都标有唯一的启动和结束令牌和分配的自定义位置编码。

The model employs a multi-scale representation alignment strategy that aligns intermediate and output features through a mean squared error loss, ensuring consistency across layers. This technique boosts image reconstruction quality by over 2 dB in PSNR and improves generation evaluation (GenEval) scores by 1.5%.

该模型采用多尺度表示策略,该策略通过平方误差损失来对齐中间和输出特征,从而确保跨层的一致性。该技术在PSNR中将图像重建质量提高了2 dB,并将生成评估(Geneval)得分提高了1.5%。

Unlike other systems that retrain all components, Ming-Lite-Uni keeps the language model frozen and only fine-tunes the image generator, allowing faster updates and more efficient scaling. The system was tested on various multimodal tasks, including text-to-image generation, style transfer, and detailed image editing using instructions like “make the sheep wear tiny sunglasses” or “remove two of the flowers in the image.”

与重新训练所有组件的其他系统不同,ming-lite-uni可以使语言模型冻结,并且只能微调图像生成器,从而可以更快地更新和更有效的缩放。该系统经过各种多模式任务进行了测试,包括文本到图像生成,样式转移和详细的图像编辑,使用“使绵羊戴着微小的太阳镜”或“删除图像中的两朵花”等说明。

The model handled these tasks with high fidelity and contextual fluency. It maintained strong visual quality even when given abstract or stylistic prompts such as “Hayao Miyazaki’s style” or “Adorable 3D.”

该模型以高保真度和上下文流利度处理了这些任务。即使给出了抽象或风格的提示,例如“宫崎骏的风格”或“可爱的3D”,它也保持了强大的视觉质量。

The training set spanned over 2.25 billion samples, combining LAION-5B (1.55B), COYO (62M), and Zero (151M), supplemented with filtered samples from Midjourney (5.4M), Wukong (35M), and other web sources (441M). Furthermore, it incorporated fine-grained datasets for aesthetic assessment, including AVA (255K samples), TAD66K (66K), AesMMIT (21.9K), and APDD (10K), which enhanced the model’s ability to generate visually appealing outputs according to human aesthetic standards.

训练套件跨越了22.5亿个样品,结合了Laion-5b(1.55b),Coyo(62m)和零(151m),并补充了来自Midjourney(54m),Wukong(35m)(35m)和其他Web源的过滤样品(54m)和其他网络源(441m)。此外,它合并了精美的数据集,以供美学评估,包括AVA(255K样品),TAD66K(66K),AESMMIT(21.9K)和APDD(10K),从而增强了模型根据人体审美标准产生视觉吸引力的能力。

The model combines semantic robustness with high-resolution image generation in a single pass. It achieves this by aligning image and text representations at the token level across scales, rather than depending on a fixed encoder-decoder split. The approach allows autoregressive models to carry out complex editing tasks with contextual guidance, which was previously hard to achieve. FlowMatching loss and scale-specific boundary markers support better interaction between the transformer and the diffusion layers.

该模型将语义鲁棒性与单个通过中的高分辨率图像生成结合在一起。它通过将图像和文本表示在令牌级别上对齐,而不是根据固定的编码器拆分来实现这一目标。该方法允许自回归模型通过上下文指导执行复杂的编辑任务,以前很难实现。流动损耗和特定于尺度的边界标记支持变压器与扩散层之间的更好相互作用。

Overall, the model strikes a rare balance between language comprehension and visual output, positioning it as a significant step toward practical multimodal AI systems.

总体而言,该模型在语言理解和视觉输出之间取得了罕见的平衡,将其定位为迈向实用多模式AI系统的重要一步。

Several Key Takeaways from the Research on Ming-Lite_Uni:

关于ming-lite_uni的研究的几个关键要点:

免责声明:info@kdj.com

所提供的信息并非交易建议。根据本文提供的信息进行的任何投资,kdj.com不承担任何责任。加密货币具有高波动性,强烈建议您深入研究后,谨慎投资!

如您认为本网站上使用的内容侵犯了您的版权,请立即联系我们(info@kdj.com),我们将及时删除。

2025年05月10日 发表的其他文章