Search for...

Martin Vechev of INSAIT: “DeepSeek $6M Cost Of Training Is Misleading”

Martin Vechev of INSAIT: “DeepSeek $6M Cost Of Training Is Misleading”, TheRecursive.com
Image credit: Martin Vechev, INSAIT
~

Martin Vechev, the Director of Bulgarian INSAIT (Institute for Computer Science, Artificial Intelligence and Technology) comments on yesterday’s happenings regarding DeepSeek, the Chinese AI startup that claimed its R1 LLM model was developed with less than $6M, compared to the billions spent by others. The news caused Nvidia’s stock price to sink. 

“A quick comment about DeepSeek (DS for short below), because a lot of people are asking and a lot of the information in the media is not accurate because they probably haven’t read the DS articles themselves (not just the last one) and are copying what some other media is writing:

1. Who’s working on DS: The DS series of models from China have actually been public for years. They are developed by strong researchers and engineers in the field who often publish what they do in various conferences and continuously improve their models and make them publicly available. And that’s very good.

 

2. Cost of training (compute): the $5-6M cost of training is misleading. It comes from the claim that 2048 H800 cards were used for *one* training, which at market prices is upwards of $5-6M. Developing such a model, however, requires running this training, or some variation of it, many times, and also many other experiments (item 3 below). That makes the cost to be many times above that, not to mention data collection and other things, a process which can be very expensive (why? item 4 below). Also, 2048 H800 cost between $50-100M. The company that deals with DC is owned by a large Chinese investment fund, where there are many times more GPUs than 2048 H800.

 

3. Technology: DeepSeek R1/V3 uses a standard architecture (mixture-of-experts: MoE), but with important improvements. MoE was used by Mixtral on the French Mistral, but they couldn’t get it to work that well (INSAIT has a version of BgGPT with MoE as of March 2024 that has never been released publicly). Roughly speaking with MoE: when used in real time, only a small % of the model is activated. This is fine and for this reason MoE may be faster than non-MoE models. Also one of the basic techniques on how to train DS models was published about a year ago (in DeepSeekMath), but the latest DS paper has some improvements that are important and are the result of a lot of experimentation and research to improve the results (i.e., compute = $$$).

Read more:  UiPath Ventures Is Scouting Startups That Complement Their Fully Automated Enterprise Vision

 

4. The training data: it is not known what it is, how it was acquired, how much it is, etc. Regarding copying O1 from OpenAI: it cannot be copied quite directly (so called distillation) since OpenAI does not make publicly available the thinking tokens themselves, or more generally their thoughts that it uses to generate a solution. But that doesn’t mean it can’t be copied. It can be, for example, running O1, looking at the result, and then using a fairly standard RL algorithm to arrive at the same solution. Surprise: you still need a lot of computers here. Of course, there is speculation that DS knows the architecture of O1, but that is speculation.

 

5. DS vs. O1: This has happened before with versions of GPT-4o for non-reasoning models (e.g. with LLama3 400b). I guess having a US/China sub-story makes people more emotional.

 

6. Quality: the DS R1 is the best open model for O1-type reasoning today, but is relatively specialized for this purpose and not for everything. For example, I don’t expect it to be the optimal model for multi-language models. R1 and V3 (the non-reasoning version) are quite large, >600 billion parameters and the variants that are more applicable built with DS are those that are distilled (or generated from R1/V3) to much smaller models (not 600+ billion, but say 30) and this is useful for various purposes (which many people use, incl. INSAIT).

 

7. Expectations: What I expect to happen is the standard. The closed-source AI companies (Google, OpenAI etc) will show new tests on which R1 does not work well. This is not a problem and there are already some for O1 as well. Then someone with more GPUs will make an open version again which is similar to the closed ones (DS or someone else). However, to make such a version there will be a need for a lot of computing power and a lot of experimentation (not $5M, but closer to $50-100M even when the model is more specialized).” 

Read more:  This Macedonian Data Scientist Explains How AI Can Improve Every Area of Life in 2023

 

Help us grow the emerging innovation hubs in Central and Eastern Europe

Every single contribution of yours helps us guarantee our independence and sustainable future. With your financial support, we can keep on providing constructive reporting on the developments in the region, give even more global visibility to our ecosystem, and educate the next generation of innovation journalists and content creators.

Find out more about how your donation could help us shape the story of the CEE entrepreneurial ecosystem!

One-time donation

You can also support The Recursive’s mission with a pick-any-amount, one-time donation. 👍