Large Models: an Exclusive Party for the Upper Crust
Credit:Visual China
By Zhang Shuai and Shaw Wan
BEIJING, July 11 (TiPost)— “As long as you have a graphics processing unit (GPU), I can buy the whole server,” said a buyer. Zhang Yang, head of a cloud computing service provider, witnessed customers going on a shopping spree since March. “At that time, customers were in such a rush that all they cared about was to hoard the devices. They didn’t have any requirements for the products, nor did they mention cloud networking or data storage. They didn’t even know how to put them into good use,” said Zhang.
【资料图】
It wasn’t until April that some of these buyers started to figure out what kind of devices they truly needed. They went on a wrong track of hoarding a bunch of GPUs, when training a large model demands massive distributed computing power that often pairs with a full set of services.
The computing power industry involves various parts, such as artificial intelligence (AI) chips, servers, optical modules, data centers and cloud computing platforms, forming the driving force for the digital economy. Because of such scale and complexity, only a small number of enterprises are able to afford to join the race. Since training a large model is the start point of the large model ecosystem basic, enough computing power is the admission ticket to the industry.
The birth of the AI chatbot ChatGPT in last November has proven the partnership between Microsoft and Open AI a success. Training models in the cloud turned out to make the cut. For companies, cloud service providers can provide scalable computing resources, including hardware and software, which saves businesses from building their own infrastructure. They can also serve talents in the industry, such as R&D engineers, algorithm engineers and individual developers. Since they are often backed up by tech giants, they are rich in financial support, talents and data, enabling them to crack on with the big model training.
The frenzied ChatGPT wave
The companies that are qualified for the race are mostly established giants. For example, the super computer behind ChatGPT is Azure, a Microsoft-backed cloud computing platform that was initially released 14 years ago . At the current stage, to catch up with ChatGPT or to get ahead in the race, companies are competing against each other with their strategies and technologies that were developed in the past years.
“The large model training is obviously hyped. The industry should be more rational and avoid taking advantage of the concept to lure investment. I think the companies that want to scramble for a piece of the pie shouldn’t start from scratch. There are possible chances for them, but there are also great challenges,” said a person in charge of large-scale model products of a tech giant.
From the perspective of academia, OpenAI does not embody revolutionary innovation. It is more like "engineered innovation" of the artificial general intelligence (AGI) products. The engineered production process involves different phases of big model training, including the research, engineering, products and organization.
“It is also hard to pull the engineered production off. However, it at least proved that having more computing power and more data would work,” said Han Kai, principal engineering manager at Microsoft.
Although engineered production has proven to be a success, it was hard for many other companies to choose the path in the first place because the huge input didn’t promise a bright future. Chinese enterprises tend to follow others’ steps, which is also why ChatGPT wasn’t born in China.
Challenges in big model training
There are at least three major challenges to achieve the engineered production of cloud computing.
The first is computing power. “For example, GPT-3, which has 175 billion parameters, required 314 zettaFLOPS of computing power in order for it to be trained. However, as one GPU only delivers 312 teraFLOPS of deep learning performance, it takes 32 years to train one large model. Therefore, it is necessary to introduce the distributed training, where we use multiple devices and multiple GPUs to train large models,” said Chen Xi, an expert at Chinatelecom Cloud, a cloud computing platform backed by China Telecom.
Data storage can also be a problem. The video memory of a single GPU can no longer load a model with hundreds of billions of parameters. It takes about a few terabytes to fully load so many parameters. That number can grow even bigger, when intermediate results generated during the training process, such as gradients and optimizer states, are also taken into consideration. Thus, hundreds of GPUs are needed.
Therefore, companies generally adopt a way of pipeline parallelism, and run different layers of the model in GPUs of different nodes. In this way, a group of nodes only need to load a limited number of parameters, reducing the pressure on the memory.
As the big model training task is broken into a sequence of processing stages, there will be a large amount of communication between clusters, resulting in high requirements on the bus and its bandwidth. The amount of data transferred can reach up to hundreds of gigabytes.
Apart from these three major challenges, the fast growth of large model parameters and the slow development of chip technology is also hindering the industry. In recent years, with the introduction of the transformer, the number of model parameters has increased by 15 times every two years. However, the development of chip technology lags behind. The computing power of a single GPU grew by less than 4 times, with the chip process decreased from 7 nanometers to 4 nanometers.
Large model training requires not only computing power, but also storage, security, and the training framework. A complete set of platforms or services are needed to provide support. “We feel like there are not many service providers who can satisfy the needs of large model training. And the overall supply of high-performance computing power is relatively tight,” said Chen.
Opportunities for Chinese chip makers
As the Chinese companies are trying to jump on the bandwagon of large model training, the demand for chips soar. Although Chinese chip makers are trying to catch up with the top chip designers, they are not the first choice of many computing power platforms.
“At present, when everyone is working on large model training, time is of the essence. What the industry needs is high-end products, so that they can avoid stability or maturity problems. That’s why the Chinese chips are left out,” said Zhang Yalin, the chief operating officer of Enflame, a Shanghai-based AI start-up developing cloud-based deep learning chips for AI training platforms.
The American chip maker Nvidia is the dominant supplier for the reasoning and training of large models in China. Chinese tech giant Baidu once purchased tens of thousands of Nvidia A800s within just half a year.
According to Nvidia’s financial results for first quarter fiscal 2024, the revenue of Nvidia"s data center business was 4.28 billion dollars, a record high with an increase of 14 percent compared with the same period last year. In May , shares of the company soared, taking the company"s valuation above one trillion dollars.
However, in terms of reasoning, there are still business opportunities for Chinese chips. “I think Chinese chip makers should take a different path, by starting with reasoning and fine-tuning. They can later cooperate with research institutes from universities and national laboratories to move on to large model training,” Zhang said.
The development of AI chips is faster than Moore"s Law, which could also lead to a decline in growth in some cases, according to Xie Guangjun, the vice president of Baidu. The temporary shortage of computing power is because the computing power cannot keep up with the demand. The overall imbalanced supply chain is also part of the cause.
As of now, no Chinese chips can replace the high-end chips produced by Nvidia, such as the A100. Several Chinese chip makers planned to release similar products later this year. As the shortage of Nvidia chips persists, Chinese chips are likely to grab a slice of the cake after next year, once they are able to meet the requirements.
“Although Internet companies care more about the price-performance ratio, they need to pay more attention to the total cost of ownership when it comes to computing power. For example, the GPU cluster of my company can support 1,000 GPUs, whose performance is similar to that of Nvidia’s cluster of 600 GPUs. But our products can also be competitive, as long as we can provide more cost-effective and customized services,” said Zhang.
If Enflame’s products were to be favored by the Internet clients, they needed to have 1.5 times the performance of Nvidia products and twice the price-performance ratio in the desired scenarios and businesses, Zhang added.
As early as June 2021, Baidu AI Cloud began to plan the construction of a new GPU cluster of high-performance. Together with NVIDIA, it completed the design of the InfiniBand network architecture, which can be equipped with over 10,000 GPUs to provide EFLOPS-level computing power. Thanks to this cluster, Baidu released its ChatGPT-style AI bot called Wenxin Yiyan, or ERNIE Bot, in March.
标签:
精彩推送
新闻快讯
X 关闭
X 关闭
新闻快讯
- Large Models: an Exclusive Party for the Upper Crust
- 【澎湃原动力】(第一站)“智引科技”:在煤矿行业最懂5G 在5G里最懂煤矿
- 提醒!明晚开始南昌这些地方将停水降压
- ST天顺股东户数下降1.76%,户均持股13.59万元
- 南向资金今日净卖出20.79亿港元
- 易建联1家4口现身机场,娇妻长腿吸睛,俩儿子天赋出众成男篮希望
- 达摩院进军农业赛道,招聘信息曝光
- 王牌对王牌第八季播放时间(王牌对王牌第八季播出时间)
- 喊着网暴入刑的人,骂起人来最狠
- 热力升级!高温预警刚刚更新为橙色
- csgo饰品租赁平台哪个最好 国服最好的csgo饰品租赁平台盘点
- 请在我后悔之前离开我歌词含义_请在我后悔之前离开我
- GUM:6月强积金综合指数升6点 人均赚6359港元
- 2022年贯标产品系列介绍之三——大族激光智能装备集团有限公司
- 珠海冠宇:7月10日融券卖出6.81万股,融资融券余额3.49亿元
- 富士康已退出印度半导体计划!十亿美元迟迟不给
- 堵成一锅粥,合作路施工围挡啥时能撤?施工方最新回应
- 仙乐健康7月11日快速反弹
- 宁波超级“大菜园”,开建!
- 川西气田最大脱硫站35千伏变电所受电成功
- 日照71岁奶奶自驾川藏线,女儿赞其“具备所有优秀女性美好品质”
- 《封神第一部》7月20日上映!首映礼黄渤一把拉走袁泉引热议
- 江海之城涌动新侨力量(新时代·新侨乡)
- 诸葛亮为何手拿鹅毛扇 揭秘诸葛亮的羽毛扇
- 小米笔记本怎么进入bios
- 办理上海居住证需要准备哪些材料
- 西南交通大学与泰国孔敬大学联手共建天佑铁道学院
- 7月10日若羽臣发布公告,其股东减持6.5万股
- 湖北咸宁:做好政治生态分析研判后半篇文章
- 电改油!通用军用电动战术车曝光,悍马EV平台,2025年装备!
- 恒指夜期开盘(7.10)︱恒指夜期(7月)报18550点 高水70点
- 中听|大奖颁给“抄袭家”,难怪新诗成笑话
- 陵水黎族自治县纪委监委第六审查调查室主任韦世明被查
- 华信永道北交所上市首日涨23% 募1.14亿东北证券保荐
- 关于加强“自媒体”管理的通知
- 西安市首列比亚迪云巴正式上线,线路全长 17.2 公里
- 为期三个月共222场!陕西餐饮消费季活动全面开启
- 杭州园林(300649)7月10日主力资金净买入1134.40万元
- 大学生放假遇尴尬场面,到饭点没人喊吃饭,父母忘记他也在家
- 三大亮点多元呈现,来建博会看梦天木作如何玩转空间美学
- 甘肃大部连续晴晒“烤验” 陇东南今日迎降水 高温逐渐缓解
- 国家统计局:6月份工业生产者出厂价格同比下降5.4% 环比下降0.8%
- 适度放宽按揭 利好香港楼市复常
- 警方通报北京政法职业学院学生纠纷:积极引导学生继续完成学业
- 预告:商务部召开7月第2次例行新闻发布会
- 盛夏时节 有“杏”与你相遇——2023年东乡县第二届民族文化艺术旅游暨唐汪杏子采摘系列活动掠影
- 杭州:社区食堂升级让更多人受惠
- 学院获批第二批省级学校急救教育试点单位
- 东北证券:黄金板块配置价值凸显,关注量增逻辑显著或单位利润弹性突出的个股
- 强对流天气预警!5省份部分地区将有雷暴大风或冰雹