As Mark Graham, Director of the Wayback Machine put in an email, the Internet Archive’s web materials are comprised of “many different collections driven by many organizations that have different approaches to crawling.” At the time of this writing, the primary web holdings of the Archive total more than 4.1 million items across 7,357 distinct collections, while its Archive-It program has over 440 partner organizations overseeing specific targeted collections. In the place of a single standardized continuous crawl with stable criteria and algorithms, there is a vibrant collage of inputs that all feed into the Archive’s sum holdings. In contrast, the Internet Archive is comprised of a myriad independent datasets, feeds and crawls, each of which has very different characteristics and rules governing its construction, with some run by the Archive and others by its many partners and contributors. They traditionally operate in continuous crawling mode, in which the goal is to scour the web 24/7/365 and attempt to identify and ingest every available URL. Most large web crawling operations today operate vast farms of standardized crawlers all operating in unison, sharing a common set of rules and behaviors. Perhaps the first and most important detail to understand about the Internet Archive’s web crawling activities is that it operates far more like a traditional library archive than a modern commercial search engine.
0 Comments
Leave a Reply. |