next up previous contents index
Next: Distributing the Gathering Up: Harvest User's Manual Previous: 1 Introduction to Harvest

2 Subsystem Overview

 

     

As illustrated in Figure 1, Harvest consists of several subsystems. The Gatherer subsystem collects indexing information (such as keywords, author names, and titles) from the resources available at Provider sites (such as FTP and HTTP servers). The Broker subsystem retrieves indexing information from one or more Gatherers, suppresses duplicate information, incrementally indexes the collected information, and provides a WWW query interface to it. The Replicator subsystem efficiently replicates Brokers around the Internet. Users can efficiently retrieve located information through the Cache subsystem. The Harvest Server Registry (HSR) is a distinguished Broker that holds information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet.

 
Figure 1: Harvest Software Components  

The Harvest software distribution contains a large amount of functionality, in approximately 160,000 lines of code. You don't need to install all of the software to have a useful system. Three common configurations are:

Running a Gatherer plus a Broker
will provide a World Wide Web [3]. accessible, structured content index of a set of information resources.

Running a Gatherer alone
will export content-based indexing information about a set of information resources that you gather. Other sites can then build indexes of those resources at much lower server and network costs, compared with remote sites building indexes by retrieving each resource through conventional protocols like FTP, Gopher, HTTP, and News.

 

Running a Broker alone
will provide a customizable index of information gathered by other sites. Other Harvest servers that provide indexing information can be found by searching the Harvest Server Registry.

We recommend that you start by running a Gatherer plus a Broker, which is the standard setup created by the binary software distribution. If your Broker becomes so popular that it creates bottlenecks, you can run a Replicator (see Section 7). You may also want to run an object cache (see Section 6), to reduce network traffic for popular data. Finally, you can distribute the gathering and brokering processes to optimize CPU and network use. We discuss this in the next subsection.





next up previous contents index
Next: Distributing the Gathering Up: Harvest User's Manual Previous: 1 Introduction to Harvest



Darren Hardy
Mon Apr 3 15:22:37 MDT 1995