The topic of this article may not meet Wikipedia's notability guidelines for products and services. Please help to demonstrate the notability of the topic by citing reliable secondary sources that are independent of the topic and provide significant coverage of it beyond a mere trivial mention. If notability cannot be shown, the article is likely to be merged, redirected, or deleted. Find sources: "Norconex Web Crawler" – news · newspapers · books · scholar · JSTOR(October 2023) (Learn how and when to remove this message)
Other names
Norconex HTTP Collector
Developer(s)
Norconex Inc.
Initial release
2016
Stable release
3.0.2
/ 2022-01-05
Repository
GitHub Repository
Written in
Java
Operating system
Cross-platform
License
Apache License
Website
Norconex Web Crawler
Norconex Web Crawler is a free and open-source web crawling and web scraping Software written in Java and released under an Apache License. It can export data to many repositories such as Apache Solr, Elasticsearch, Microsoft Azure Cognitive Search, Amazon CloudSearch and more.[1][2][3]
The Crawler can be run on its own or embedded in your own Java application.[4][5]
Some key features are:
Multi-threaded
Extract text from a variety of file formats (HTML, PDF, Word, etc.)
Extract metadata associated with documents
Supports pages rendered with JavaScript
Incremental crawls
Supports external commands to parse or manipulate documents
Send extracted data to a variety of repositories
Some well-known companies and products using Norconex Web Crawler are: Apache Solr Ecosystem, Department of National Defence, Universities Canada, U.S. Department of Education, Department of National Defence.[6][7]
^"Committers". opensource.norconex.com.
^Hoppa, Jocelyn (10 February 2020). "Importing Data from the Web with Norconex & Neo4j". Graph Database & Analytics.
^"Deploy a Norconex HTTP Collector Indexer Plugin | Cloud Search". Google for Developers.
^Valcheva, Silvia (11 February 2018). "10 Best Open Source Web Crawlers: Web Data Extraction Software". Blog For Data-Driven Business.
^"Norconex HTTP Collector". Softpedia. Retrieved 25 September 2023.
Webcrawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web...
NorconexWebCrawler is a free and open-source web crawling and web scraping Software written in Java and released under an Apache License. It can export...