ByteDance, the parent company of TikTok, is stepping up its efforts in the race to train generative AI models with the launch of a new web-scraping tool. Dubbed Bytespider, the bot was reportedly introduced in April and has already become one of the most aggressive web scrapers in operation.
Research from bot management company Kasada and bot monitoring firm Dark Visitors revealed that ByteDance’s Bytespider scrapes web data 25 times faster than GPTbot, OpenAI’s web scraper for its ChatGPT platform. It is also scraping at a rate 3,000 times faster than ClaudeBot, the scraper used by Anthropic for its Claude platform.
A scraping frenzy
Since its debut, Bytespider’s activity has only increased, with noticeable spikes in scraping over the past six weeks, as per a report by Fortune.
It appears ByteDance is trying to quickly gather as much data as possible to catch up with other tech giants like Google, Meta, and OpenAI, all of which use web scrapers to collect vast amounts of online data to train their large language and multimodal models (LLMs or LMMs).
However, ByteDance’s scraper, like those used by other AI companies, does not adhere to the robots.txt file, which is meant to signal scrapers to avoid taking data from specific websites.
Though robots.txt isn’t legally enforceable, the disregard for it has stirred controversy as web scraping is often seen as infringing on copyright, particularly when used to train AI models.
Impact Shorts
More ShortsAs generative AI tools rely heavily on web data to function, scraping has become a contentious issue, with many individuals and organisations arguing that their work is being copied without compensation. The practice has been around for decades, primarily for search engines, but the rise of AI has introduced new legal and ethical concerns.
ByteDance’s AI push
ByteDance’s aggressive scraping efforts come at a time when the company is under scrutiny, particularly in the US. President Joe Biden has signed legislation requiring ByteDance to either sell TikTok or shut it down, citing national security concerns.
Despite this, ByteDance seems determined to advance its AI capabilities.
ByteDance’s scraping frenzy suggests the company is working on a new large language model. Reports from earlier this year indicate that ByteDance was behind in the generative AI race and even relied on OpenAI to help build its own model, a move that violated OpenAI’s terms of service.
In early 2023, ByteDance launched Duabo, a chat-based LLM, but the model’s development was completed before the more recent data collection efforts.
One potential application for ByteDance’s new LLM is improving TikTok’s search functionality. TikTok recently updated its search feature to focus on keywords for ads, allowing advertisers to target trending words in real-time. With a more robust AI model trained on up-to-date web data, TikTok could further enhance its search capabilities, creating a more competitive environment for advertisers currently relying on Google.
The rapid data collection and AI advancements suggest that ByteDance is eager to not only catch up but potentially reshape the landscape of search and AI, especially within the context of TikTok’s massive user base. If successful, these efforts could make TikTok’s search environment highly appealing to advertisers looking to reach larger audiences through precise, data-driven keywords and trends.