Market Cap: $3.2264T 7.740%
Volume(24h): $162.8717B 32.210%
  • Market Cap: $3.2264T 7.740%
  • Volume(24h): $162.8717B 32.210%
  • Fear & Greed Index:
  • Market Cap: $3.2264T 7.740%
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
Top News
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
bitcoin
bitcoin

$102645.326745 USD

3.86%

ethereum
ethereum

$2235.824185 USD

20.09%

tether
tether

$0.999978 USD

-0.04%

xrp
xrp

$2.318227 USD

6.77%

bnb
bnb

$626.285788 USD

2.98%

solana
solana

$162.866519 USD

8.45%

usd-coin
usd-coin

$1.000142 USD

0.00%

dogecoin
dogecoin

$0.196724 USD

10.69%

cardano
cardano

$0.771249 USD

9.92%

tron
tron

$0.256040 USD

2.64%

sui
sui

$3.963536 USD

10.47%

chainlink
chainlink

$15.896137 USD

10.95%

avalanche
avalanche

$22.320543 USD

11.21%

stellar
stellar

$0.296058 USD

10.87%

shiba-inu
shiba-inu

$0.000014 USD

9.85%

Cryptocurrency News Articles

Multimodal AI Evolves to Create Systems That Can Understand, Generate and Respond Using Multiple Data Types

May 09, 2025 at 02:26 pm

Multimodal AI rapidly evolves to create systems that can understand, generate, and respond using multiple data types within a single conversation or task

Multimodal AI Evolves to Create Systems That Can Understand, Generate and Respond Using Multiple Data Types

Multimodal AI is rapidly evolving to create systems that can understand, generate, and respond using multiple data types within a single conversation or task. This capability, crucial for seamless human-AI communication, is being actively researched as users increasingly engage AI for tasks like image captioning, text-based photo editing, and style transfers.

A major obstacle in this area stems from the misalignment between language-based semantic understanding and the visual fidelity required in image synthesis or editing. When separate models handle different modalities, the outputs often become inconsistent, leading to poor coherence or inaccuracies. For instance, the visual model might excel in an image but fail to comprehend the nuanced instructions, while the language model might understand the prompt but cannot shape it visually.

This approach also demands significant compute resources and retraining efforts for each domain. Thus, the inability to seamlessly link vision and language into a coherent and interactive experience remains one of the fundamental problems in advancing intelligent systems.

In recent attempts to bridge this gap, researchers have combined architectures with fixed visual encoders and separate decoders that function through diffusion-based techniques. Tools such as TokenFlow and Janus integrate token-based language models with image generation backends, typically emphasizing pixel accuracy over semantic depth. While these approaches can produce visually rich content, they often miss the contextual nuances of user input.

Others, like GPT-4o, have moved toward native image generation capabilities but still operate with limitations in deeply integrated understanding. The friction lies in translating abstract text prompts into meaningful and context-aware visuals in a fluid interaction without splitting the pipeline into disjointed parts.

Now, researchers from Inclusion AI, Ant Group have presented Ming-Lite-Uni, an open-source framework designed to unify text and vision through an autoregressive multimodal structure. The system features a native autoregressive model built on top of a fixed large language model and a fine-tuned diffusion image generator. This design is based on two core frameworks: MetaQueries and M2-omni.

Ming-Lite-Uni introduces an innovative component of multi-scale learnable tokens, which act as interpretable visual units, and a corresponding multi-scale alignment strategy to maintain coherence between various image scales. The researchers have provided all the model weights and implementation openly to support community research, positioning Ming-Lite-Uni as a prototype moving toward general artificial intelligence.

The core mechanism behind the model involves compressing visual inputs into structured token sequences across multiple scales, such as 4×4, 8×8, and 16×16 image patches, each representing different levels of detail, from layout to textures. These tokens are processed alongside text tokens using a large autoregressive transformer. Each resolution level is marked with unique start and end tokens and assigned custom positional encodings.

The model employs a multi-scale representation alignment strategy that aligns intermediate and output features through a mean squared error loss, ensuring consistency across layers. This technique boosts image reconstruction quality by over 2 dB in PSNR and improves generation evaluation (GenEval) scores by 1.5%.

Unlike other systems that retrain all components, Ming-Lite-Uni keeps the language model frozen and only fine-tunes the image generator, allowing faster updates and more efficient scaling. The system was tested on various multimodal tasks, including text-to-image generation, style transfer, and detailed image editing using instructions like “make the sheep wear tiny sunglasses” or “remove two of the flowers in the image.”

The model handled these tasks with high fidelity and contextual fluency. It maintained strong visual quality even when given abstract or stylistic prompts such as “Hayao Miyazaki’s style” or “Adorable 3D.”

The training set spanned over 2.25 billion samples, combining LAION-5B (1.55B), COYO (62M), and Zero (151M), supplemented with filtered samples from Midjourney (5.4M), Wukong (35M), and other web sources (441M). Furthermore, it incorporated fine-grained datasets for aesthetic assessment, including AVA (255K samples), TAD66K (66K), AesMMIT (21.9K), and APDD (10K), which enhanced the model’s ability to generate visually appealing outputs according to human aesthetic standards.

The model combines semantic robustness with high-resolution image generation in a single pass. It achieves this by aligning image and text representations at the token level across scales, rather than depending on a fixed encoder-decoder split. The approach allows autoregressive models to carry out complex editing tasks with contextual guidance, which was previously hard to achieve. FlowMatching loss and scale-specific boundary markers support better interaction between the transformer and the diffusion layers.

Overall, the model strikes a rare balance between language comprehension and visual output, positioning it as a significant step toward practical multimodal AI systems.

Several Key Takeaways from the Research on Ming-Lite_Uni:

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on May 10, 2025