X, formerly known as Twitter, has just updated its terms of service (again) to explicitly forbid data scraping and crawling its platform without prior written consent.
The updated terms, set to take effect on September 29, 2023, introduce strict controls on unauthorized data collection methods and comes just eight days after it amended its Privacy Policy, stating that the platform will begin collecting users’ biometric data and professional education and employment history.
The previous version of the terms permitted crawling as long as it adhered to the guidelines outlined in the robots.txt file – an instructional file given to “crawlers” (or programs) about what parts of a website they are allowed to visit. However, the revised terms have eliminated this provision, mandating that any form of scraping or crawling must secure explicit written consent from X.
Web Crawling vs. Web Scraping
While both may sound very similar, they operate for two different purposes.
Web “crawling” grabs other web pages to create indices or collections of data, while web “scraping” downloads webpages to extract a specific set of data for analysis – e.g. product details, pricing information, SEO data, etc.
Essentially, “web scraping” simply extracts publicly available data from a website and imports it into any local file/folder on your computer through the use of a “crawler” program that looks for the specific set of data the user is looking for and additional targets to crawl, while “web crawling” discovers target URL(s) or other links for the purpose of creating an index or multiple indices of data.
Data scraping is one of the most effective ways to extract data from the web and doesn’t require an internet connection.
In conjunction with the updated terms of service, X has recently made alterations to its robots.txt file. This file directs web crawlers, including those from Google, regarding which sections of the site they are permitted to access. These amendments have effectively curtailed access to specific data types, including likes, retweets associated with particular posts, and account-related information like likes, media, and photos.
The decision to bolster restrictions on scraping and data access comes on the heels of X’s recent platform modifications. These adjustments included temporarily preventing logged-out users from viewing posts and subsequently eliminating the login requirement for accessing tweets.
X’s CEO, Elon Musk, cited the need for these measures in response to excessive data scraping, which was adversely affecting the platform’s performance for regular users.
Musk has vocally opposed companies scraping Twitter/X data for training AI models in the past. He previously issued a legal threat against Microsoft, alleging their unlawful use of the platform’s data for AI training.
In July, Musk initiated a legal action against “John Doe” defendants involved in unauthorized data collection.
The impact of these stringent measures on data accessibility and X’s relationship with web crawlers, including those from tech giants like Google, remains to be seen.
Editor’s note: This article was written by an nft now staff member in collaboration with OpenAI’s GPT-3.