Author: San Yan Technology

Today, I accidentally came across a picture.

According to the picture, OpenAI’s GPT-4 is ranked last among the 11 large models (with the first ranking being 0). Some internet users even added the words “GPT4: How can I plead my innocence?”

Evening Must-Read | Vitalik Buterin’s Speech Transcript The Past and Present of Ethereum
Reshaping the DeFi Interaction Experience through Telegram Bot
Another former executive of FTX pleads guilty, could have up to $1.5 billion in assets confiscated.

This inevitably makes people curious. Earlier this year, after ChatGPT became popular, other companies started talking about large models.

And now, in just over half a year, GPT has already “hit rock bottom”?

So, I wanted to see how GPT is actually ranked.

Different Testing Times Different Testing Teams GPT-4 Ranks Eleventh

From the information shown in the previous picture, this ranking comes from the C-Eval ranking list.

The C-Eval ranking list, full name C-Eval Global Large Model Comprehensive Examination and Testing Ranking List, is a Chinese language model comprehensive examination and evaluation suite jointly built by Tsinghua University, Shanghai Jiaotong University, and the University of Edinburgh.

It is reported that this suite covers four major directions: humanities, social sciences, science and engineering, and other specialties, including 52 disciplines, covering multiple knowledge areas such as calculus and linear algebra. There are a total of 13,948 Chinese knowledge and reasoning questions, with difficulty levels ranging from high school, undergraduate, graduate, to professional exams.

So, I checked the latest C-Eval ranking list.

The latest ranking on the C-Eval list matches the ranking shown in the previous picture, with GPT-4 ranking last among the top eleven large models.

According to the C-Eval ranking list, these results represent zero-shot or few-shot tests, but few-shot does not necessarily perform better than zero-shot.

C-Eval states that many models fine-tuned according to their instructions perform better in zero-shot. Many of the models they tested have both zero-shot and few-shot results, and the ranking list shows the better overall average score.

The C-Eval ranking list also notes that models with “*” in their names indicate that the results were obtained through testing by the C-Eval team, while other results were obtained through user submissions.

In addition, I also noticed that there is a significant difference in the time when these large models submitted their test results.

The test result submission time for GPT-4 was May 15th, while the top-ranked Yun Tian Shu submitted their results on August 31st; Galaxy, ranked second, submitted theirs on August 23rd; and YaYi, ranked third, submitted theirs on September 4th.

Furthermore, among the top 16 ranked models, only GPT-4’s name has an “*” indicating that it was tested by the C-Eval team.

So, I looked at the complete C-Eval ranking list again.

The latest C-Eval ranking list includes a total of 66 large models.

Among them, only 11 models have an “*” in their name, indicating that they were tested by the C-Eval team, and all of them submitted their test results on May 15th.

These large models tested by the C-Eval team, OpenAI’s GPT-4 ranked eleventh, ChatGPT ranked thirty-sixth, and Tsinghua Zhijian AI’s ChatGLM-6B ranked sixtieth, while Fudan’s MOSS ranked sixty-fourth.

Although these rankings reflect the rapid development momentum of large models in China, the author believes that since the tests were not conducted by the same team at the same time, they are not sufficient to fully prove the strength or weakness of these large models.

This is just like a class of students, each taking the exam at different times and answering different test papers. How can you rely on the scores of each student to compare their performance?

What do the developers of large models say?Many claim to surpass ChatGPT in Chinese proficiency and other abilities

Recently, the circle of large models has been quite lively.

Once again, eight companies including Baidu and ByteDance have obtained filing approvals for their large model products under the “Interim Measures for the Management of Generative Artificial Intelligence Services” and can officially launch services for the public. Other companies have also successively released their own large model products.

So how do the developers of these large models introduce their products?

On July 7th, at the “Opportunities and Risks in the General Artificial Intelligence Industry in the Era of Large Models” forum of the 2023 World Artificial Intelligence Conference, Qiu Xipeng, a professor at the School of Computer Science and Technology of Fudan University and the person in charge of the MOSS system, stated that since the release of Fudan’s MOSS large-scale conversational language model in February this year, it has been continuously iterating, and “the latest MOSS has surpassed ChatGPT in Chinese proficiency.”

In late July, NetEase Youdao launched its translation large model. Youdao CEO Zhou Feng publicly stated that “in internal tests, in the direction of Chinese-English translation, it has surpassed ChatGPT in translation ability and also exceeded the level of Google Translate.”

In late August, at the 2023 Yabuli Forum Summer Summit, Liu Qingfeng, the founder and chairman of iFLYTEK, said in his speech, “The code generation and completion capabilities of iFLYTEK’s Xinghuo large model have surpassed ChatGPT, and other capabilities are catching up rapidly. The logic, algorithms, method system, and data preparation for code ability are all ready, all that is needed is time and computing power.”

In a recent press release, SenseTime stated that in August this year, its new model internlm-123b completed training, and the number of parameters increased to 123 billion. “In a total of 300,000 test questions from 51 well-known evaluation sets worldwide, its overall test scores ranked second globally, surpassing models such as gpt-3.5-turbo and Meta’s newly released llama2-70b.”

According to SenseTime, “internlm-123 ranked first in 12 out of the major evaluations. Among them, the agieval score in the comprehensive evaluation set was 57.8, surpassing gpt-4 to take the first place; the evaluation score for commonsenseqa knowledge questioning was 88.5, ranking first; internlm-123b achieved first place in all five evaluation items for reading comprehension.”

In addition, it ranked first in five reasoning evaluations.

Earlier this month, Zuoyebang officially released its self-developed Galaxy model.

Zuoyebang stated that the Galaxy model achieved outstanding results in the C-Eval and CMMLU benchmark evaluations, which are two authoritative benchmarks for large language models. The data shows that the Zuoyebang Galaxy model ranked first in the C-Eval leaderboard with an average score of 73.7; it also ranked first in the CMMLU leaderboard for the Five-shot and Zero-shot evaluations, with average scores of 74.03 and 73.85, respectively. It became the first educational large model to rank first in average scores in both of these authoritative rankings.

Yesterday, Baichuan Intelligence announced the official open source of its fine-tuned versions of Baichuan 2-7B, Baichuan 2-13B, Baichuan 2-13B-Chat, and their 4-bit quantized versions.

Wang Xiaochuan, the founder and CEO of Baichuan Intelligence, said that the fine-tuned Chat model, in the Chinese domain, has achieved practical performance that surpasses closed-source models like ChatGPT-3.5 in Q&A and summarization environments.

Today, at the 2023 Tencent Global Digital Ecology Conference, Tencent officially released the Hybrid model. Jiang Jie, Vice President of Tencent Group, stated that the Chinese capability of the Tencent Hybrid model has surpassed GPT-3.5.

In addition to these self-introductions by developers, there are also some media and teams that have evaluated large models.

In early August, a team led by Shen Yang, a professor and doctoral supervisor at the School of Journalism and Communication of Tsinghua University, released a “Comprehensive Performance Evaluation Report of Large Language Models.” The report shows that Baidu Wenxin Yiyu leads domestically in the comprehensive scores of 20 indicators in three dimensions, outperforming ChatGPT, with a strong ranking in Chinese semantic understanding and some Chinese capabilities superior to GPT-4.

In mid-August, media reported that on August 11, Xiaomi’s large model MiLM-6B appeared on the C-Eval and CMMLU large model evaluation lists. Currently, MiLM-6B ranks 10th in the overall C-Eval leaderboard and 1st in the same parameter level. In CMMLU, it ranks 1st among Chinese-oriented large models.

On August 12, Tianjin University released the “Large Model Evaluation Report.” The report shows that GPT-4 and Baidu Wenxin Yiyu are significantly ahead of other models in terms of comprehensive performance, with similar scores and at the same level. Wenxin Yiyu has already surpassed ChatGPT in most Chinese tasks and is gradually narrowing the gap with GPT-4.

In late August, media reported that Kwai’s self-developed large language model “KwaiYii” has entered the internal testing phase. In the latest CMMLU Chinese-oriented rankings, KwaiYii-13B, the 13B version of KwaiYii, ranks first in both the five-shot and zero-shot categories, showing strength in humanities and specific Chinese topics, with an average score of over 61.

From the above content, it can be seen that although these large models claim to be at the top of certain rankings or surpass ChatGPT in certain aspects, they mostly excel in specific domains.

In addition, there are some comprehensive scores that exceed GPT-3.5 or GPT-4, but GPT’s testing was conducted in May. Who can guarantee that GPT has not made any progress in the past three months?

OpenAI’s Situation

According to a report from UBS Group in February, just two months after the launch of ChatGPT, it had surpassed 100 million monthly active users by the end of January 2023, becoming the fastest-growing consumer application in history.

However, the development of ChatGPT has not been smooth sailing.

In July of this year, many GPT-4 users complained that the performance of GPT-4 had declined compared to its previous reasoning abilities.

Some users pointed out issues on Twitter and the OpenAI developer forum, focusing on weaker logic, more incorrect answers, inability to track provided information, difficulty following instructions, forgetting to add parentheses in basic software code, and only remembering the most recent prompts, among other things.

In August, another report claimed that OpenAI may be facing a potential financial crisis and could go bankrupt by the end of 2024.

The report stated that OpenAI spends about $700,000 per day to operate its artificial intelligence service, ChatGPT. Currently, the company is trying to achieve profitability through GPT-3.5 and GPT-4, but it has not yet generated enough revenue to achieve a balance of income and expenses.

However, OpenAI may have a new turning point.

Recently, OpenAI announced that it will hold its first developer conference in November.

Although OpenAI stated that it will not release GPT-5, it claimed that hundreds of developers from around the world will have an early look at “new tools” and exchange ideas with the OpenAI team.

This may indicate that ChatGPT has made new progress.

According to a report from The Paper, on August 30th, an insider revealed that OpenAI expects to generate over $1 billion in revenue in the next 12 months by selling AI software and the computing power to run it.

Today, there are reports that later this month, Morgan Stanley will launch a generative artificial intelligence chatbot developed in collaboration with OpenAI.

Those who deal with Morgan Stanley’s bankers are either rich or influential. If this upcoming generative AI chatbot can bring a different experience to Morgan Stanley’s clients, it could be a huge gain for OpenAI.

The arrival of the era of artificial intelligence is unstoppable. As for who is better, it cannot be solely determined by oneself; it must be rated by users. We also believe that large-scale models in China will definitely surpass ChatGPT in specific capabilities and overall abilities.

Like what you're reading? Subscribe to our top stories.

We will continue to update Gambling Chain; if you have any questions or suggestions, please contact us!

GPT

Gambling Chain

More than half a year has passed, ChatGPT’s ranking is quickly falling to the bottom.

Different Testing Times Different Testing Teams GPT-4 Ranks Eleventh

What do the developers of large models say?Many claim to surpass ChatGPT in Chinese proficiency and other abilities

OpenAI’s Situation

Like what you're reading? Subscribe to our top stories.

Was this article helpful?

Exploring the relationship between Rollup and application development Is Micro-Rollup the future?

Xshares friend.tech enters a new stage. Is there hope for the innovation of simulated trading?

Products used

GC Wallet