![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
![]() |
|
多模式AI迅速發展以創建可以在單個對話或任務中使用多個數據類型理解,生成和響應的系統
Multimodal AI is rapidly evolving to create systems that can understand, generate, and respond using multiple data types within a single conversation or task. This capability, crucial for seamless human-AI communication, is being actively researched as users increasingly engage AI for tasks like image captioning, text-based photo editing, and style transfers.
多模式AI正在迅速發展,以創建可以在單個對話或任務中使用多個數據類型來理解,生成和響應的系統。當用戶越來越多地參與圖像字幕,基於文本的照片編輯和样式轉移之類的任務時,這種功能對無縫的人類交流至關重要,正在積極研究。
A major obstacle in this area stems from the misalignment between language-based semantic understanding and the visual fidelity required in image synthesis or editing. When separate models handle different modalities, the outputs often become inconsistent, leading to poor coherence or inaccuracies. For instance, the visual model might excel in an image but fail to comprehend the nuanced instructions, while the language model might understand the prompt but cannot shape it visually.
該領域的主要障礙源於基於語言的語義理解與圖像合成或編輯所需的視覺保真度之間的錯位。當單獨的模型處理不同的模式時,輸出通常會變得不一致,導致連貫性或不准確性。例如,視覺模型可能在圖像中表現出色,但無法理解細微差別的說明,而語言模型可能會理解提示,但不能以視覺形式塑造它。
This approach also demands significant compute resources and retraining efforts for each domain. Thus, the inability to seamlessly link vision and language into a coherent and interactive experience remains one of the fundamental problems in advancing intelligent systems.
這種方法還需要大量的計算資源和對每個領域的重新訓練工作。因此,無法將視覺和語言連接到連貫和互動的體驗仍然是推進智能係統的基本問題之一。
In recent attempts to bridge this gap, researchers have combined architectures with fixed visual encoders and separate decoders that function through diffusion-based techniques. Tools such as TokenFlow and Janus integrate token-based language models with image generation backends, typically emphasizing pixel accuracy over semantic depth. While these approaches can produce visually rich content, they often miss the contextual nuances of user input.
在最近彌合這一差距的嘗試中,研究人員將體系結構與固定的視覺編碼器和單獨的解碼器結合在一起,這些解碼器通過基於擴散的技術發揮作用。 TokenFlow和Janus之類的工具將基於令牌的語言模型與圖像生成後端集成在一起,通常強調像素精度在語義深度上。儘管這些方法可以產生視覺上豐富的內容,但它們通常會錯過用戶輸入的上下文細微差別。
Others, like GPT-4o, have moved toward native image generation capabilities but still operate with limitations in deeply integrated understanding. The friction lies in translating abstract text prompts into meaningful and context-aware visuals in a fluid interaction without splitting the pipeline into disjointed parts.
其他人,例如GPT-4O,已經朝著本地圖像生成能力邁進,但仍處於深入綜合理解中的局限性。摩擦在於將抽象文本轉換為流體相互作用中有意義的和上下文感知的視覺效果,而無需將管道分解為分離的部分。
Now, researchers from Inclusion AI, Ant Group have presented Ming-Lite-Uni, an open-source framework designed to unify text and vision through an autoregressive multimodal structure. The system features a native autoregressive model built on top of a fixed large language model and a fine-tuned diffusion image generator. This design is based on two core frameworks: MetaQueries and M2-omni.
現在,螞蟻群體中包含AI的研究人員提出了Ming-Lite-Uni,這是一個開源框架,旨在通過自回歸的多模式結構統一文本和視覺。該系統採用固定大型語言模型和微調擴散圖像發生器的本機自動回歸模型。該設計基於兩個核心框架:metaqueries和M2-omni。
Ming-Lite-Uni introduces an innovative component of multi-scale learnable tokens, which act as interpretable visual units, and a corresponding multi-scale alignment strategy to maintain coherence between various image scales. The researchers have provided all the model weights and implementation openly to support community research, positioning Ming-Lite-Uni as a prototype moving toward general artificial intelligence.
Ming-lite-Uni引入了多尺度可學習令牌的創新組件,該代幣充當可解釋的視覺單元,以及相應的多尺度對齊策略,以保持各種圖像尺度之間的連貫性。研究人員公開提供了所有模型權重和實施,以支持社區研究,將Ming-Lite-Uni定位為朝著通用人工智能發展的原型。
The core mechanism behind the model involves compressing visual inputs into structured token sequences across multiple scales, such as 4×4, 8×8, and 16×16 image patches, each representing different levels of detail, from layout to textures. These tokens are processed alongside text tokens using a large autoregressive transformer. Each resolution level is marked with unique start and end tokens and assigned custom positional encodings.
該模型背後的核心機制涉及將視覺輸入壓縮到多個尺度的結構令牌序列中,例如4×4、8×8和16×16圖像貼片,每個圖像貼片,每個圖像貼片,每個都代表不同級別的細節,從佈局到紋理。使用大型自回歸變壓器與文本令牌一起處理這些令牌。每個分辨率級別都標有唯一的啟動和結束令牌和分配的自定義位置編碼。
The model employs a multi-scale representation alignment strategy that aligns intermediate and output features through a mean squared error loss, ensuring consistency across layers. This technique boosts image reconstruction quality by over 2 dB in PSNR and improves generation evaluation (GenEval) scores by 1.5%.
該模型採用多尺度表示策略,該策略通過平方誤差損失來對齊中間和輸出特徵,從而確保跨層的一致性。該技術在PSNR中將圖像重建質量提高了2 dB,並將生成評估(Geneval)得分提高了1.5%。
Unlike other systems that retrain all components, Ming-Lite-Uni keeps the language model frozen and only fine-tunes the image generator, allowing faster updates and more efficient scaling. The system was tested on various multimodal tasks, including text-to-image generation, style transfer, and detailed image editing using instructions like “make the sheep wear tiny sunglasses” or “remove two of the flowers in the image.”
與重新訓練所有組件的其他系統不同,ming-lite-uni可以使語言模型凍結,並且只能微調圖像生成器,從而可以更快地更新和更有效的縮放。該系統經過各種多模式任務進行了測試,包括文本到圖像生成,樣式轉移和詳細的圖像編輯,使用“使綿羊戴著微小的太陽鏡”或“刪除圖像中的兩朵花”等說明。
The model handled these tasks with high fidelity and contextual fluency. It maintained strong visual quality even when given abstract or stylistic prompts such as “Hayao Miyazaki’s style” or “Adorable 3D.”
該模型以高保真度和上下文流利度處理了這些任務。即使給出了抽像或風格的提示,例如“宮崎駿的風格”或“可愛的3D”,它也保持了強大的視覺質量。
The training set spanned over 2.25 billion samples, combining LAION-5B (1.55B), COYO (62M), and Zero (151M), supplemented with filtered samples from Midjourney (5.4M), Wukong (35M), and other web sources (441M). Furthermore, it incorporated fine-grained datasets for aesthetic assessment, including AVA (255K samples), TAD66K (66K), AesMMIT (21.9K), and APDD (10K), which enhanced the model’s ability to generate visually appealing outputs according to human aesthetic standards.
訓練套件跨越了22.5億個樣品,結合了Laion-5b(1.55b),Coyo(62m)和零(151m),並補充了來自Midjourney(54m),Wukong(35m)(35m)和其他Web源的過濾樣品(54m)和其他網絡源(441m)。此外,它合併了精美的數據集,以供美學評估,包括AVA(255K樣品),TAD66K(66K),AESMMIT(21.9K)和APDD(10K),從而增強了模型根據人體審美標準產生視覺吸引力的能力。
The model combines semantic robustness with high-resolution image generation in a single pass. It achieves this by aligning image and text representations at the token level across scales, rather than depending on a fixed encoder-decoder split. The approach allows autoregressive models to carry out complex editing tasks with contextual guidance, which was previously hard to achieve. FlowMatching loss and scale-specific boundary markers support better interaction between the transformer and the diffusion layers.
該模型將語義魯棒性與單個通過中的高分辨率圖像生成結合在一起。它通過將圖像和文本表示在令牌級別上對齊,而不是根據固定的編碼器拆分來實現這一目標。該方法允許自回歸模型通過上下文指導執行複雜的編輯任務,以前很難實現。流動損耗和特定於尺度的邊界標記支持變壓器與擴散層之間的更好相互作用。
Overall, the model strikes a rare balance between language comprehension and visual output, positioning it as a significant step toward practical multimodal AI systems.
總體而言,該模型在語言理解和視覺輸出之間取得了罕見的平衡,將其定位為邁向實用多模式AI系統的重要一步。
Several Key Takeaways from the Research on Ming-Lite_Uni:
關於ming-lite_uni的研究的幾個關鍵要點:
免責聲明:info@kdj.com
所提供的資訊並非交易建議。 kDJ.com對任何基於本文提供的資訊進行的投資不承擔任何責任。加密貨幣波動性較大,建議您充分研究後謹慎投資!
如果您認為本網站使用的內容侵犯了您的版權,請立即聯絡我們(info@kdj.com),我們將及時刪除。
-
-
-
-
- 在華爾街開放後,比特幣(BTC)對歷史最高阻力恢復的價格攻擊
- 2025-05-10 01:00:13
- 隨著公牛在5月9日繼續持有六位數,比特幣(BTC)吸引了“拋物線”目標目標。
-
- 密西西比州錢幣協會的年度硬幣,貨幣和紙牌展覽會返回海岸
- 2025-05-10 00:55:13
- 硬幣收藏家將聚集在海岸體育館和會議中心,在那裡將展出罕見的美國歷史。
-
- 說(微型)策略只是比特幣就像說尼亞加拉瀑布只是水(評級升級)
- 2025-05-10 00:55:13
- “我知道,不是真錢。”
-
- 專家已敦促所有人檢查自己的更改,以獲取稀有2英鎊的硬幣。
- 2025-05-10 00:50:14
- 根據硬幣收藏家的說法,一個小的細節可能使2英鎊的硬幣增加250倍。
-
- Fiobit的AI驅動雲挖掘平台提供了合規,高效的解決方案
- 2025-05-10 00:50:14
- Fiobit旨在支持移動優先用戶,並由清潔能源提供動力,可讓個人在短短2天內產生高達$ 4,800
-
- 美國美聯儲一直在減少其資產負債表
- 2025-05-10 00:45:13
- 美國美聯儲一直在減少其資產負債表,這引發了關於這對比特幣(BTC)和更廣泛的金融市場可能意味著什麼的討論。