Meta has launched a new web crawler designed to amass extensive data from across the internet.
Last month, Meta introduced the Meta External Agent, an advanced web crawler engineered to gather vast amounts of data from publicly accessible websites. This automated tool, which essentially “scrapes” web content such as news articles and online forum discussions, is part of Meta’s broader strategy to refine its artificial intelligence models.
A representative from Dark Visitors, a company specializing in blocking web scrapers, compared Meta External Agent to OpenAI’s GPTBot, which also scrapes data for AI training purposes. Two other web-scraping trackers confirmed the existence and function of Meta’s new bot.
Despite the launch being undocumented by Meta, a review of Meta’s developer website through the Internet Archive reveals that the company discreetly updated its site to acknowledge the new crawler in late July. Meta has not made a formal announcement about the Meta External Agent, but a company spokesperson noted that Meta has long used a similar tool, previously known as Facebook External Hit, for various purposes including generating link previews.
The spokesperson elaborated, “Like other companies, we train our generative AI models on content that is publicly available online. We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”
The practice of scraping web data for AI training has sparked controversy, with many artists, writers, and content creators filing lawsuits against AI companies for using their work without permission. Recently, some companies like OpenAI and Perplexity have begun compensating content providers for access to their data. Data from Dark Visitors shows that while approximately 25% of the world’s top websites block GPTBot, only 2% have barred Meta’s new crawler.
To block a web scraper, websites typically use a robots.txt file, which directs scrapers to avoid specific sites. However, this file only works if the scraper’s name is known and the scraper adheres to these instructions, which is not always the case. Scraper bots can choose to ignore robots.txt, making it an unreliable method for data protection.
Meta’s data scraping efforts are integral to training its generative AI models, such as the Llama series, which powers various AI applications on Meta’s platforms. Although Meta has not disclosed the exact training data for its latest model, Llama 3, the initial Llama model relied on extensive datasets from sources like Common Crawl.
This year, Mark Zuckerberg boasted that Meta’s AI training data exceeded the volume of the Common Crawl’s 3 billion monthly web page scrapes. Yet, the debut of the Meta External Agent suggests that such an extensive dataset may still fall short, prompting Meta’s substantial $40 billion investment in AI development for the year.