DeepSeek or DeepFake?

13432086885?profile=RESIZE_400xMaybe they should have called it DeepFake, or DeepState, or better still Deep Selloff.  Or maybe the other obvious deep thing that the indigenous AI vendors in the United States are standing up to their knees in right now.  Call it what you will, but the DeepSeek foundation model has in one short week turned the AI world on its head, proving once again that Chinese researchers can make inferior hardware run a superior algorithm and get results that are commensurate with the best that researchers in the US, either at the national labs running exascale HPC simulations or at hyperscalers running AI training and inference workloads, can deliver.  And for a whole lot less money if the numbers behind the DeepSeek models are not hyperbole or even mere exaggeration.  Unfortunately, there may be a bit of that, which will be cold comfort for the investors in Nvidia and other publicly traded companies that are playing in the AI space right now.  These companies have lost hundreds of billions of dollars in market capitalization today as we write.[1]

Having seen the paper come out a few days ago about the DeepSeek-V3 training model, analysts were already set to give it a looksee this morning to start the week, and Wall Street’s panic beat us to the punch.  Here is what is known.

DeepSeek-AI was founded by Liang Wenfeng in May 2023 and is effectively a spinout of High-Flyer AI, a hedge fund reportedly with $8 billion in assets under management that was created explicitly to employ AI algorithms to trade in various kinds of financial instruments.  It has been largely under the radar until August 2024, when DeepSeek published a paper describing a new kind of load balancer it had created to link the elements of its mixture of experts (MoE) foundation model to each other.  Over the holidays, the company published the architectural details of its DeepSeek-V3 foundation model, which spans 671 billion parameters (with only 37 billion parameters activated for any given token generated) and was trained on 14.8 trillion tokens.  And finally, and perhaps most importantly, on 20 January, DeepSeek rolled out its DeepSeek-R1 model, which adds two more reinforcement learning stages and two supervised fine-tuning stages to enhance the model’s reasoning capabilities.  DeepSeek AI is charging 6.5X more for the R1 model than for the base V3 model, as you can see here.  There is much chatter out there on the Intertubes as to why this might be the case.

Interestingly, the source code for both the V3 and R1 models and their V2 predecessor are all available on GitHub, which is more than you can say for the proprietary models from OpenAI, Google, Anthropic, xAI, and others.  But what we want to know – and what is roiling the tech titans today – is precisely how DeepSeek was able to take a few thousand crippled “Hopper” H800 GPU accelerators from Nvidia, which have some of their performance capped, and create an MoE foundation model that can stand toe-to-toe with the best that OpenAI, Google, and Anthropic can do with their largest models as they are trained on tens of thousands of uncrimped GPU accelerators.  If it takes one-tenth to one-twentieth the hardware to train a model, that would seem to imply that the value of the AI market can, in theory, contract by a factor of 10X to 20X.  It is no coincidence that Nvidia stock is down 17.2 percent as we write this sentence.

In the DeepSeek-V3 paper, DeepSeek says that it spent 2.66 million GPU-hours on H800 accelerators to do the pretraining, 119,000 GPU-hours on context extension, and a mere 5,000  GPU-hours for supervised fine-tuning and reinforcement learning on the base V3 model, for a total of 2.79 million GPU-hours. At the cost of $2 per GPU hour, we have no idea if that is actually the prevailing price in China – then it cost a mere $5.58 million.

The cluster that DeepSeek says that it used to train the V3 model had a mere 256 server nodes with eight of the H800 GPU accelerators each, for a total of 2,048 GPUs.  We presume that they are the H800 SXM5 version of the H800 cards, which have their FP64 floating point performance capped at 1 teraflops and are otherwise the same as the 80 GB version of the H100 card that most of the companies in the world can buy.  (The PCI-Express version of the H800 card has some of its CUDA cores deactivated and has its memory bandwidth cut by 39 percent to 2 TB/sec from the 3.35 TB/sec on the base H100 card announced way back in 2022.)  The eight GPUs inside the node are interlinked with NVSwitch es to created a shared memory domain across those GPU memories, and the nodes have multiple InfiniBand cards (probably one per GPU) to create high bandwidth links out to other nodes in the cluster.  Researchers strongly suspect DeepSeek only had access to 100 Gb/sec InfiniBand adapters and switches, but it could be running at 200 Gb/sec; the company does not say.

Analysts think this is a fairly modest cluster by any modern AI standard, especially given the size of the clusters that OpenAI/Microsoft, Anthropic, and Google have built to train their equivalent GPT-4 and o1, Claude 3.5, and Gemini 1.5 models.  Experts are very skeptical that the V3 model was trained from scratch on such a small cluster. It is simply hard to accept until someone else repeats the task.  Luckily, science is repeatable: There are companies with trillions of curated tokens and tens of thousands of GPUs to see if what DeepSeek is claiming is true.  On 2,048 H100 GPUs, it would take under two months to train DeepSeek-V3 if what the Chinese AI upstart says is true. That’s pocket change for the hyperscalers and cloud builders to prove out.

Despite that skepticism, if you comb through the 53 page paper, there are all kinds of clever optimizations and approaches that DeepSeek has taken to make the V3 model, and these, we do believe that they do cut down on inefficiencies and boost the training and inference performance on the iron DeepSeek has to play with.

The key innovation in the approach taken to train the V3 foundation model, we think, is the use of 32 of the 132 streaming multiprocessors (SMs) on the Hopper GPU to work, for lack of better words, as a communication accelerator and scheduler for data as it passes around a cluster as the training run chews through the tokens and generates the weights for the model from the parameter depths set.  As far as we can surmise, this “the overlap between computation and communication to hide the communication latency during computation,” as the V3 paper puts it, uses SMs to create what is in effect an L3 cache controller and a data aggregator between the GPUs not in the same nodes.  As the paper puts it, this communication accelerator, which is called DualPipe, has the following tasks:

  • Forwarding data between the InfiniBand and NVLink domain while aggregating InfiniBand traffic destined for multiple GPUs within the same node from a single GPU.
  • Transporting data between RDMA buffers (registered GPU memory regions) and input/output buffers.
  • Executing reduce operations for all-to-all combine.
  • Managing fine-grained memory layout during chunked data transferring to multiple experts across the InfiniBand and NVLink domain.

In another sense, then, DeepSeek has created its own on-GPU virtual DPU for doing all kinds of SHARP-like processing associated with all-to-all communication in the GPU cluster.  Here is an important paragraph about DualPipe:  “As for the training framework, we design the DualPipe algorithm for efficient pipeline parallelism, which has fewer pipeline bubbles and hides most of the communication during training through computation-communication overlap.  This overlap ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead.  In addition, we also develop efficient cross-node all-to-all communication kernels to fully utilize InfiniBand and NVLink bandwidths.  Furthermore, we meticulously optimize the memory footprint, making it possible to train DeepSeek-V3 without using costly tensor parallelism.  Combining these efforts, we achieve high training efficiency.”

The paper does not say how much of a boost this DualPipe feature offers, but if a GPU is waiting for data 25 percent of the time because of the inefficiency of communication, reducing that compute delay by hiding latency and scheduling tricks like L3 caches do for CPU and GPU cores, and you can push that computational efficiency to near 100 percent, then those 2,048 GPUs start acting like they are 8,192.  OpenAI’s GPT-4 foundation model was trained on 8,000 of Nvidia’s “Ampere” A100 GPUs, which is like 4,000 H100s (sort of). 

Here is another side effect: The V3 model uses pipeline parallelism and data parallelism, but because the memory in managed so tightly, and overlaps forward and backward propagations as the model is being built, V3 does not have to use tensor parallelism at all.  Weird, right?  Another key innovation for V3 is that auxiliary loss-free load balancing mentioned above.  When you train a MoE model, there has to be some sort of router to know which model to send which tokens, just like you have to know which model to listen to when you query a bunch of the models inherent in the MoE.

A performance boost is that a FP8 low precision processing, which boosts bandwidth through the GPUs at the same time as making the most of the limited 80 GB of memory on the H800 GPU accelerators.  The majority of the V3 model kernels are implemented in FP8 format.  But certain operations still require 16-bit of 32-bit precision, and master weights, weight gradients, and optimizer states are stored in higher precisions than FP8.  DeepSeek has come up with its own way of microscaling the mantissas and exponents of data being processed such that the level or precision and numerical range necessary for any given calculation can be maintained without sacrificing the fidelity of the data in a way that hurts the reliability of the answers that come out of the model.

One neat technique that DeepSeek came up with its to promote higher-precision matrix math operations on intermediate results in the tensor cores to the vector units on the CUDA cores to preserve a semblance of higher precision.  (Enough of a semblance to get output that looked like 32-bit math was used for the whole dataset.) Incidentally, DeepSeek uses 4-bit exponents and 3-bit mantissas is called E4M3 on all tensor calculations inside the tensor cores.  None of this funny bitness is happening inside there. It is just happening in the CUDA cores.  FP16 formats are used inside the Optimizer and master weights are in the FP32 format.

There are lots of other neat tricks, such as recomputing all RMSNorm operations and recomputing all MLA up-projections during back propagation, which means they do not take up valuable space in the HBM memory on the H800 card.  The exponential moving average (EMA) parameters, which are used to estimate the performance of the model and its learning rate decay, are stored in CPU host memory.  Memory consumption and communication overhead is cut further by caching activations model activations and optimizer states in lower precision formats.

After perusing the paper, judge for yourself if all of the clever tweaks can add up to a 10X reduction in hardware.  Cyber experts are skeptical until proof is revealed.  Interestingly, in the V3 model paper, DeepSeek researchers offered Nvidia – or other AI accelerator providers, a list of needed features.  “Our experiments reveal that it only uses the highest 14 bits of each mantissa product after sign-fill right shifting, and truncates bits exceeding this range. However, for example, to achieve precise FP32 results from the accumulation of 32 FP8×FP8 multiplications, at least 34-bit precision is required.  Thus, we recommend that future chip designs increase accumulation precision in Tensor Cores to support full-precision accumulation, or select an appropriate accumulation bit-width according to the accuracy requirements of training and inference algorithms.  This approach ensures that errors remain within acceptable bounds while maintaining computational efficiency.”

DeepSeek has developed a method of tile-wise and block-wide quantization, which lets it move the numerical range of numbers at a certain bitness around within a dataset.  Nvidia only supports tensor quantization, and DeepSeek wants Nvidia architects to read its paper and see the benefits of its approach.  (And even if Nvidia does add such a feature, it might be turned off by the US government.)

DeepSeek also wants support for online quantization, which is also part of the V3 model.  To do online quantization, DeepSeek says it has to read 128 BF16 activation values, which is the output of a prior calculation, from HBM memory to quantize them, write them back as FP8 values to the HBM memory, and then read them again to perform MMA operations in the tensor cores.  DeepSeek says that future chips should have FP8 cast and tensor memory acceleration in a single, fused operation so the quantization can happen during the transfer of activations from global to shared memory, cutting down on reads and writes.  DeepSeek also wants GPU makers to fuse matrix transposition with GEMM operations, which will also cut down on memory operations and make the quantization workflow more streamlined.

 

DeepSeek trains this V3 model.  To create the R1 model, it takes the output of other AI models (according to rumor) and feeds them into reinforcement learning and supervised fine training operations to improve the “reasoning patterns” of V3. And then, here is the kicker, as outlined in the paper:  “We conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential.  During the post-training stage, we distill the reasoning capability from the DeepSeek-R1 series of models, and meanwhile carefully maintain the balance between model accuracy and generation length.”

DeepSeek says this: “We introduce an innovative methodology to distill reasoning capabilities from the long-Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3.  Our pipeline elegantly incorporates the verification and reflection patterns of R1 into DeepSeek-V3 and notably improves its reasoning performance.  Meanwhile, we also maintain control over the output style and length of DeepSeek-V3.”

So, how much does this snake eating its tail, boost the effectiveness of the V3 model and reduce the training burden?  Researchers would love to see that quantified and qualified.

This article is shared at no charge for educational and informational purposes only.

Red Sky Alliance is a Cyber Threat Analysis and Intelligence Service organization.  We provide indicators of compromise information via a notification service (RedXray) or an analysis service (CTAC).  For questions, comments or assistance, please contact the office directly at 1-844-492-7225, or feedback@redskyalliance.com    

Weekly Cyber Intelligence Briefings:

Weekly Cyber Intelligence Briefings:

REDSHORTS - Weekly Cyber Intelligence Briefings

https://register.gotowebinar.com/register/5207428251321676122

[1] https://www.nextplatform.com/2025/01/27/how-did-deepseek-train-its-ai-model-on-a-lot-less-and-crippled-hardware/

E-mail me when people leave their comments –

You need to be a member of Red Sky Alliance to add comments!