Global Information Lookup Global Information

Focused crawler information


A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process.[1] Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy[2] while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

A focused crawler must predict the probability that an unvisited page will be relevant before actually downloading the page.[3] A possible predictor is the anchor text of links; this was the approach taken by Pinkerton[4] in a crawler developed in the early days of the Web. Topical crawling was first introduced by Filippo Menczer.[5][6] Chakrabarti et al. coined the term 'focused crawler' and used a text classifier[7] to prioritize the crawl frontier. Andrew McCallum and co-authors also used reinforcement learning[8][9] to focus crawlers. Diligenti et al. traced the context graph[10] leading up to relevant pages, and their text content, to train classifiers. A form of online reinforcement learning has been used, along with features extracted from the DOM tree and text of linking pages, to continually train[11] classifiers that guide the crawl. In a review of topical crawling algorithms, Menczer et al.[12] show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. It has been shown that spatial information is important to classify Web documents.[13]

Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes.[14] In addition, ontologies can be automatically updated in the crawling process. Dong et al.[15] introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages.

Crawlers are also focused on page properties other than topics. Cho et al.[16] study a variety of crawl prioritization policies and their effects on the link popularity of fetched pages. Najork and Weiner[17] show that breadth-first crawling, starting from popular seed pages, leads to collecting large-PageRank pages early in the crawl. Refinements involving detection of stale (poorly maintained) pages have been reported by Eiron et al.[18] A kind of semantic focused crawler, making use of the idea of reinforcement learning has been introduced by Meusel et al.[19] using online-based classification algorithms in combination with a bandit-based selection strategy to efficiently crawl pages with markup languages like RDFa, Microformats, and Microdata.

The performance of a focused crawler depends on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points. Davison[20] presented studies on Web links and text that explain why focused crawling succeeds on broad topics; similar studies were presented by Chakrabarti et al.[21] Seed selection can be important for focused crawlers and significantly influence the crawling efficiency.[22] A whitelist strategy is to start the focus crawl from a list of high quality seed URLs and limit the crawling scope to the domains of these URLs. These high quality seeds should be selected based on a list of URL candidates which are accumulated over a sufficiently long period of general web crawling. The whitelist should be updated periodically after it is created.

  1. ^ Soumen Chakrabarti, Focused Web Crawling, in the Encyclopedia of Database Systems.
  2. ^ Controversial topics
  3. ^ Improving the Performance of Focused Web Crawlers[1], Sotiris Batsakis, Euripides G. M. Petrakis, Evangelos Milios, 2012-04-09
  4. ^ Pinkerton, B. (1994). Finding what people want: Experiences with the WebCrawler. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland.
  5. ^ Menczer, F. (1997). ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery Archived 2012-12-21 at the Wayback Machine. In D. Fisher, ed., Proceedings of the 14th International Conference on Machine Learning (ICML97). Morgan Kaufmann.
  6. ^ Menczer, F. and Belew, R.K. (1998). Adaptive Information Agents in Distributed Textual Environments Archived 2012-12-21 at the Wayback Machine. In K. Sycara and M. Wooldridge (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents '98). ACM Press.
  7. ^ Focused crawling: a new approach to topic-specific Web resource discovery, Soumen Chakrabarti, Martin van den Berg and Byron Dom, WWW 1999.
  8. ^ A machine learning approach to building domain-specific search engines, Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore, IJCAI 1999.
  9. ^ Using Reinforcement Learning to Spider the Web Efficiently, Jason Rennie and Andrew McCallum, ICML 1999.
  10. ^ Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). Focused crawling using context graphs Archived 2008-03-07 at the Wayback Machine. In Proceedings of the 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt.
  11. ^ Accelerated focused crawling through online relevance feedback, Soumen Chakrabarti, Kunal Punera, and Mallela Subramanyam, WWW 2002.
  12. ^ Menczer, F., Pant, G., and Srinivasan, P. (2004). Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Trans. on Internet Technology 4(4): 378–419.
  13. ^ Recognition of common areas in a Web page using visual information: a possible application in a page classification, Milos Kovacevic, Michelangelo Diligenti, Marco Gori, Veljko Milutinovic, Data Mining, 2002. ICDM 2003.
  14. ^ Dong, H., Hussain, F.K., Chang, E.: State of the art in semantic focused crawlers. Computational Science and Its Applications – ICCSA 2009. Springer-Verlag, Seoul, Korea (July 2009) pp. 910-924
  15. ^ Dong, H., Hussain, F.K.: SOF: A semi-supervised ontology-learning-based focused crawler. Concurrency and Computation: Practice and Experience. 25(12) (August 2013) pp. 1623-1812
  16. ^ Junghoo Cho, Hector Garcia-Molina, Lawrence Page: Efficient Crawling Through URL Ordering. Computer Networks 30(1-7): 161-172 (1998)
  17. ^ Marc Najork, Janet L. Wiener: Breadth-first crawling yields high-quality pages. WWW 2001: 114-118
  18. ^ Nadav Eiron, Kevin S. McCurley, John A. Tomlin: Ranking the web frontier. WWW 2004: 309-318.
  19. ^ Meusel R., Mika P., Blanco R. (2014). Focused Crawling for Structured Data. ACM International Conference on Information and Knowledge Management, Pages 1039-1048.
  20. ^ Brian D. Davison: Topical locality in the Web. SIGIR 2000: 272-279.
  21. ^ Soumen Chakrabarti, Mukul Joshi, Kunal Punera, David M. Pennock: The structure of broad topics on the Web. WWW 2002: 251-262.
  22. ^ Jian Wu, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Prasenjit Mitra, Shuyi Zheng, C. Lee Giles, The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists, In proceedings of the 3rd Annual ACM Web Science Conference Pages 340-343, Evanston, IL, USA, June 2012.

and 25 Related for: Focused crawler information

Request time (Page generated in 0.8338 seconds.)

Focused crawler

Last Update:

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing...

Word Count : 1168

Web crawler

Last Update:

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web...

Word Count : 6933

Deep web

Last Update:

hidden-web sources (web forms) in different domains based on novel focused crawler techniques. Commercial search engines have begun exploring alternative...

Word Count : 2773

Vertical search

Last Update:

portions of the World Wide Web using a web crawler, vertical search engines typically use a focused crawler which attempts to index only relevant web pages...

Word Count : 832

Idles

Last Update:

number 1 album, Ultra Mono, in 2020. The band released their fourth album, Crawler, in 2021. Tangk, their fifth studio album, was released in 2024. Welsh...

Word Count : 2772

List of lightning phenomena

Last Update:

This is a list of lightning phenomena. Anvil crawler lightning, sometimes called spider lightning, is created when leaders propagate through horizontally-extensive...

Word Count : 2044

Master X Master

Last Update:

Williams, Mike (April 28, 2016). "Master x Master: You Got Your Dungeon Crawler in My MOBA!". USgamer. Zamora, Gabriel (March 1, 2016). "Master X Master...

Word Count : 1439

Continuous track

Last Update:

from the original on 2012-06-22. "The Zavolzhsky Crawler Vehicle Plant". Russia: Zavolzhsky Crawler Vehicle Plant. Archived from the original on 2013-11-27...

Word Count : 6091

Toyota FJ Cruiser

Last Update:

FJ Crawler was able to be customized in many ways with many optional accessories to achieve the customers satisfaction. For example, the FJ Crawler could...

Word Count : 4933

Dogpile

Last Update:

directory), WebCrawler, Infoseek, AltaVista, HotBot, WhatUseek (directory), and World Wide Web Worm. It naturally drew comparisons with MetaCrawler, a multi-threaded...

Word Count : 1143

Search engine

Last Update:

headings found in the web pages the crawler encountered. One of the first "all text" crawler-based search engines was WebCrawler, which came out in 1994. Unlike...

Word Count : 7560

CNH Industrial

Last Update:

including backhoe loaders and skid-steer loaders in North America, and crawler excavators in Western Europe. Case Construction Equipment offers construction...

Word Count : 2852

GenieKnows

Last Update:

subject matter of a web page. GenieKnows uses such algorithms as a focused crawler to download web pages, identify pages that are on topic with the vertical...

Word Count : 774

Komatsu D575A

Last Update:

The Komatsu D575A is a 1,150 horsepower (860 kW) tractor crawler produced in a 'SR' or Super Ripper bulldozer/ripper configuration, or as a dedicated...

Word Count : 1900

Neeva

Last Update:

and for traditional links it used the help of its partners and its own crawler. The company was based in Mountain View, California and had 25 employees...

Word Count : 337

Digimon World 2

Last Update:

Digimon World. Digimon World 2 is a dungeon crawler RPG, a departure from its predecessor Digimon World, which focused on raising Digimon like pets. The player...

Word Count : 759

Apache Nutch

Last Update:

Apache Nutch is a highly extensible and scalable open source web crawler software project. Nutch is coded entirely in the Java programming language, but...

Word Count : 625

Search engine optimization

Last Update:

address of a page, or URL, to the various engines, which would send a web crawler to crawl that page, extract links to other pages from it, and return information...

Word Count : 5736

System1

Last Update:

Infospace and its subsidiary HowStuffWorks, Dogpile, Zoo.com, MetaCrawler, WebCrawler was bought by System1. OpenMail rebranded as System1 shortly after...

Word Count : 958

HTTP 404

Last Update:

doi:10.1145/988672.988716. ISBN 978-1581138443. S2CID 587547. "Why is your crawler asking for strange URLs that have never existed on my site?". Yahoo Ysearch...

Word Count : 2308

DeepPeep

Last Update:

crawlers continue to search. What makes ACHE Crawler unique from other crawlers is that other crawlers are focused crawlers that gather Web pages that have...

Word Count : 1283

Wikipedia

Last Update:

Wikipedia for reuse presents challenges, since direct cloning via a web crawler is discouraged. Wikipedia publishes "dumps" of its contents, but these...

Word Count : 27077

Microsoft Bing

Last Update:

results from Inktomi. It consisted of a search engine, index, and web crawler. In early 1999, MSN Search launched a version which displayed listings...

Word Count : 9355

Missile vehicle

Last Update:

uses a tractor crawler drive instead of conventional pneumatic tires. An example of a single-missile vehicle with a tractor crawler drive is the French...

Word Count : 1103

Osees

Last Update:

critically acclaimed studio albums under this lineup, including Carrion Crawler/The Dream (2011) — which features live show staple "The Dream", and Floating...

Word Count : 1200

PDF Search Engine © AllGlobal.net