Heritrix information

Heritrix
	Screenshot of Heritrix Admin Console.
Stable release	3.4.0-20220727 / 28 July 2022; 21 months ago
Repository	github.com/internetarchive/heritrix3 ;
Written in	Java
Operating system	Linux/Unix-like/Windows (unsupported)
Type	Web crawler
License	Apache License
Website	github.com/internetarchive/heritrix3/wiki

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Heritrix was developed jointly by the Internet Archive and the Nordic national libraries on specifications written in early 2003. The first official release was in January 2004, and it has been continually improved by employees of the Internet Archive and other interested parties.

For many years Heritrix was not the main crawler used to crawl content for the Internet Archive's web collection.^[2] The largest contributor to the collection, as of 2011, is Alexa Internet.^[2] Alexa crawls the web for its own purposes,^[2] using a crawler named ia_archiver. Alexa then donates the material to the Internet Archive.^[2] The Internet Archive itself did some of its own crawling using Heritrix, but only on a smaller scale.^[2]

Starting in 2008, the Internet Archive began performance improvements to do its own wide scale crawling, and now does collect most of its content.^[3]^{[failed verification]}

^ "Release 3.4.0-20220727". 28 July 2022. Retrieved 5 October 2022.
^ ^a ^b ^c ^d ^e Kris (September 6, 2011). "Re: Control over the Internet Archive besides just 'Disallow /'?". Pro Webmasters Stack Exchange. Stack Exchange, Inc. Retrieved January 7, 2013.
^ "Wayback Machine: Now with 240,000,000,000 URLs - Internet Archive Blogs". blog.archive.org. Retrieved 11 September 2017.

[wikidata-df79d1ec354c70be35a8df2aa21d8fbde9d02898-v11-1] "Release 3.4.0-20220727". 28 July 2022. Retrieved 5 October 2022.

[Kris-2] Kris (September 6, 2011). "Re: Control over the Internet Archive besides just 'Disallow /'?". Pro Webmasters Stack Exchange. Stack Exchange, Inc. Retrieved January 7, 2013.

[3] "Wayback Machine: Now with 240,000,000,000 URLs - Internet Archive Blogs". blog.archive.org. Retrieved 11 September 2017.

Heritrix information

and 15 Related for: Heritrix information

Heritrix

Wayback Machine

Internet Memory Foundation

Web archiving

Internet Archive

List of Web archiving initiatives

Libarc

National and University Library of Iceland

Web crawler

Jason Scott

Heritor

Webarchiv

International Internet Preservation Consortium

PADICAT

Australian Web Archive


Screenshot of Heritrix Admin Console.

Stable release	3.4.0-20220727^[1] / 28 July 2022; 21 months ago (28 July 2022)

Repository	github.com/internetarchive/heritrix3
Written in	Java
Operating system	Linux/Unix-like/Windows (unsupported)
Type	Web crawler
License	Apache License
Website	github.com/internetarchive/heritrix3/wiki