Google Says Its New AI Model Uses Five Times More Text Data For Training

According to internal documentation viewed by CNBC, Google has unveiled a new large language model called PaLM 2. It is a general-use model trained on 3.6 trillion tokens, almost five times as much data as its predecessor from 2022 and can perform more advanced levels of coding, math, and creative writing tasks.

Tokens are strings of words that teach the model to predict the next word in a sequence and are crucial building blocks for training language models.

Google’s previous version of PaLM, Pathways Language Model, was released in 2022 and trained on 780 billion tokens. While Google has been reluctant to publish details of its training data, the company has been eager to showcase the capabilities of its AI technology and embed it into search, emails, word processing, and spreadsheets.

The lack of disclosure regarding training data by Google and OpenAI, the Microsoft-backed creator of ChatGPT, is due to the competitive nature of the business. Both companies are racing to attract users who prefer conversational chatbots to traditional search engines.

However, the research community is calling for greater transparency as the AI arms race intensifies.

Google claims that PaLM 2 is smaller than previous LLMs, which is significant as it means the company’s technology is becoming more efficient while handling more complex tasks. The model is trained on 3.6 trillion tokens, a measure of its complexity, compared to the initial PaLM’s 540 billion parameters.

Google has not yet provided any comment on the matter.

Google said in a blog post about PaLM 2 that the model uses a “new technique” called “compute-optimal scaling.” That makes the LLM “more efficient with overall better performance, including faster inference, fewer parameters to serve, and a lower serving cost.”

Google has officially announced the release of PaLM 2, confirming previous reports by CNBC. This advanced model has been trained in 100 languages and is capable of performing a wide range of tasks. Currently, it powers 25 features and products, including the experimental chatbot Bard. PaLM 2 comes in four different sizes: Gecko, Otter, Bison, and Unicorn.

Public disclosures indicate that PaLM 2 surpasses any existing model in terms of power. Facebook’s LLM, named LLaMA, which was announced in February, has been trained on 1.4 trillion tokens.

The training size of OpenAI’s ChatGPT, known from the release of GPT-3, was 300 billion tokens at that time. OpenAI launched GPT-4 in March, stating that it demonstrates “human-level performance” in many professional tests.

LaMDA, another conversational language model introduced by Google two years ago and mentioned alongside Bard in February, has been trained on 1.5 trillion tokens, as revealed in the latest documents seen by CNBC.

As AI applications rapidly gain popularity, the controversies surrounding the underlying technology are becoming more intense. In February, El Mahdi El Mhamdi, a senior Google Research scientist, resigned due to the company’s lack of transparency.

During a hearing of the Senate Judiciary subcommittee on privacy and technology, OpenAI CEO Sam Altman testified and agreed with lawmakers that a new system for managing AI is necessary.

“For a very new technology we need a new framework,” Altman said. “Certainly companies like ours bear a lot of responsibility for the tools that we put out in the world.”

Leave a Reply

Your email address will not be published. Required fields are marked *