Market Cap: $3.9038T 0.93%
Volume(24h): $156.0044B -1.37%
  • Market Cap: $3.9038T 0.93%
  • Volume(24h): $156.0044B -1.37%
  • Fear & Greed Index:
  • Market Cap: $3.9038T 0.93%
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
Top News
Cryptos
Topics
Cryptospedia
News
CryptosTopics
Videos
bitcoin
bitcoin

$113137.862908 USD

0.65%

ethereum
ethereum

$4107.436072 USD

-1.96%

xrp
xrp

$2.908808 USD

2.59%

tether
tether

$1.000294 USD

0.01%

bnb
bnb

$1010.914842 USD

-1.12%

solana
solana

$210.653310 USD

-2.16%

usd-coin
usd-coin

$0.999776 USD

-0.01%

dogecoin
dogecoin

$0.239360 USD

-0.04%

tron
tron

$0.337849 USD

0.37%

cardano
cardano

$0.807698 USD

-0.61%

hyperliquid
hyperliquid

$45.387447 USD

0.61%

chainlink
chainlink

$21.408287 USD

-0.92%

ethena-usde
ethena-usde

$1.000509 USD

-0.04%

avalanche
avalanche

$32.634682 USD

-4.77%

sui
sui

$3.349772 USD

-0.19%

Cryptocurrency News Articles

Both WEKA and VAST Data aim to solve the problem of AI inferencing context history overflowing GPU memory and slowing down large language model (LLM) responsiveness.

Feb 27, 2025 at 12:08 am

VAST Data co-founder Jeff Denworth writes: “As a chat or agentic AI session grows in length across multiple prompts and responses, the history that is created is known as context.

Both WEKA and VAST Data aim to solve the problem of AI inferencing context history overflowing GPU memory and slowing down large language model (LLM) responsiveness.

Both WEKA and VAST Data aim to solve the problem of AI inferencing context history overflowing GPU memory and slowing down large language model (LLM) responsiveness.

As a chat or agentic AI session grows in length across multiple prompts and responses, the history that is created is known as context. Context is created and stored using self-attention mechanisms that store session history as a series of vectorized tokens (stored as keys and values) that consume considerable amounts of GPU and CPU memory, often leveraging key-value caches.

The greater the length of the session and the richer the chat history, the larger the context that must be loaded into a GPU to serve an instance of the model. In addition, as techniques like reasoning tokens are introduced, models must process significantly longer sequences, putting additional strain on memory and compute resources.

A fundamental limitation in modern AI inference is the amount of memory available – GPUs process vast amounts of data in parallel, but the memory available per GPU is fixed. As models grow in complexity and require longer contexts, their memory footprint expands beyond what a single GPU can handle.

This results in inefficiencies where GPUs are memory-starved, causing significant bottlenecks in token generation. This is a particular challenge during the decode phase of Large Language Models (LLMs), which are memory-bound, requiring fast data retrieval to process input prompts efficiently.

One of the biggest challenges emerging in inference is the impact of expanding context lengths on compute requirements.

To meet this challenge, WEKA has focused on speeding up token load time and VAST on being picky about which tokens to load first.

Testing the Llama3.170B model, WEKA found it took about 24 seconds to load a 100,000-token prompt into a key-value (KV) cache in a prefill phase to initialize the model before any output was generated. It set out to load and apply the cache at scale, demonstrating how extending GPU memory to ultra-fast storage can dramatically improve token processing efficiency.

The ultra-fast storage was an eight-node WEKApod with PCIe Gen 5 connectivity linked to an Nvidia DGX H100 server via Nvidia’s Quantum-2 QM9700 64-port 400 Gbps InfiniBand switches.

At the software level, WEKA’s software already has the capability to align reads and writes into GPU memory (via GDS) directly to the NIC closest to the GPU, and extract every last bit of performance by reducing unnecessary data movement and latency. The WEKApod is the icing on this cake.

Its software already had the capability to align reads and writes into GPU memory (via GDS) directly to the NIC closest to the GPU, and extract every last bit of performance by reducing unnecessary data movement and latency. The WEKApod is the icing on this cake.

As the context length grows, machine memory consumption scales linearly. Long-sequence chat or agentic sessions can put pressure on system resources and cause memory overflow.

Cache space is limited to what can be held in a GPU machine. AI services with multiple tenants (that periodically sign in and out of AI applications) need to constantly evict non-active session data from GPU and CPU cache to make room for whatever is happening at the moment.

Reloading the cache from public cloud object storage is so long that several leading AI-as-a-service shops choose to simply recalculate an entire prompt history rather than grab all of the context and attention data from object storage.

VAST wants to make scalable, multi-tenant inference fast, more cost-efficient and global. It has developed a Linux-based agent that runs in your GPU servers and provides a new data presentation layer to AI frameworks. This is the VUA agent, VAST Undivided Attention. Each GPU server’s VUA is hooked up to a shared VAST RDMA-attached NVMe storage system. When tokens are not found in a GPU Server’s KV cache, they are reloaded from the VAST storage via the GPUDirect protocol providing what Denworth calls an infinite memory space for context data.

VUA has the ability to intelligently store and serve prefixes, which are the initial token sequence needed to provide the model with context.

As each token in a sequence attends to all previous tokens via self-attention, it produces key and value vectors for every position. During tasks like text generation, the model processes one token at a time after an initial input (the prompt). The KV cache stores these vectors for all tokens processed so far, so the model only computes keys and values for the new token and retrieves the rest from the cache.

VUA can load prefixes by priority and policy so that, for example, the longest prefixes associated with a sequence can be served first to a GPU machine, getting the session underway faster. Prefixes can also be stored to help multiple related prompts share similar context within a GPU machine, thus reducing the number of cache

Original source:blocksandfiles

Disclaimer:info@kdj.com

The information provided is not trading advice. kdj.com does not assume any responsibility for any investments made based on the information provided in this article. Cryptocurrencies are highly volatile and it is highly recommended that you invest with caution after thorough research!

If you believe that the content used on this website infringes your copyright, please contact us immediately (info@kdj.com) and we will delete it promptly.

Other articles published on Sep 25, 2025