- Symptom
-
The Gatherer doesn't pick up all the objects pointed to by some of my
RootNodes.
- Solution
-
The Gatherer places various limits on enumeration to prevent a
misconfigured Gatherer from abusing servers or running wildly. See
section 4.3 for details on how to override
these limits.
- Symptom
-
Local-Mapping did not work for me---it retrieved the objects via
the usual remote access protocols.
- Solution
-
A local mapping will fail if:
-
the local filename cannot be opened for reading
-
the local filename is not a regular file
-
the local filename has execute bits set
So for directories, symlinks and cgi scripts, the http server is
always contacted. We don't do URL translation for local mappings. If
your URL's have funny characters that must be escaped, then the local
mapping will also fail.
If you are using the hv source dn, you can turn on debugging in
src/common/url.c and see how the local filenames are constructed.
- Symptom
-
Using the
--full-text
option I see a lot of raw data in the content
summaries, with few keywords I can search.
- Solution
-
At present
--full-text
simply includes the full data content in the SOIF
summaries. Using the individual file type summarizing mechanism described
in Section 4.4.4 will work better in this regard, but
will require you to specify how data are extracted for each individual
file type. In a future version of Harvest we will change the Essence
--full-text
option to perform content extraction before including the
full text of documents.
- Symptom
-
The ``Last-Modification-Time'' of the gathered data is always 0.
- Solution
-
At present we do not fill in this field for HTTP documents; we use
MD5 [21] checksums instead. In a future version of Harvest, we
will set this field according to HTTP's MIME response header.
- Symptom
-
Gathered data are not being updated.
- Solution
-
The Gatherer does not automatically do periodic updates. See
Section 4.8 for details.
- Symptom
-
When I run my Gatherer after changing one of the files it recently
gathered, it does not retrieve the changed file.
- Solution
-
The Gatherer maintains a local disk cache to reduce network load from
restarting after a machine crash. You can force a reload by removing
running the urlpurge program before running it again (see
Appendix A.
- Symptom
-
The Gatherer puts slightly different URLs in the SOIF summaries than
I specified in the Gatherer configuration file.
- Solution
-
This happens because the Gatherer attempts to put URLs into a canonical
format. It does this by removing default port numbers http bookmarks,
and similar cosmetic changes. Also, by default, Essence (the content
extraction subsystem within the Gatherer) removes the standard
stoplist.cf types, which includes HTTP-Query (the cgi-bin stuff).
- Symptom
-
There are no Last-Modification-Time or MD5 attributes in my
gatherered SOIF data, so the Broker can't do duplicate elimination.
- Solution
-
If you gatherer remote, manually-created information (as in our
PC Software Broker), it is
pulled into Harvest using ``exploders'' that translate from the remote
format into SOIF. That means they don't have a direct way to fill in
the Last-Modification-Time or MD5 information per record. Note also
that this will mean one update to the remote records would cause all
records to look updated, which will result in more network load for
Brokers that collect from this Gatherer's data. As a solution, you can
compute MD5s for all objects, and store them as part of the record.
Then, when you run the exploder you only generate timestamps for the
ones for which the MD5s changed---giving you real last-modification
times.
- Symptom
-
When I search using keywords I know are in a document I have indexed
with Harvest, the document isn't found.
- Solution
-
Harvest uses a content extraction subsystem called Essence
that by default does not extract every keyword in a document. Instead,
it uses heuristics to try to select promising keywords. You can change
what keywords are selected by customizing the summarizers for that type
of data, as discussed in Section 4.4.4. Or, you can
tell Essence to use full text summarizing if you feel the added
disk space costs are merited, as discussed in
Section 4.5.
- Symptom
-
I'm running Harvest on HP-UX, but the essence process in the
Gatherer takes too much memory.
- Solution
-
The supplied regular expression library has memory leaks on HP-UX, so
you need to use the regular expression library supplied with HP-UX.
Change the Makefile in src/gatherer/essence to read:
REGEX_DEFINE = -DUSE_POSIX_REGEX
REGEX_INCLUDE =
REGEX_OBJ =
REGEX_TYPE = posix
- Symptom
-
I built the configuration files to customize how Essence types/content
extracts data, but it uses the standard typing/extracting mechanisms anyway.
- Solution
-
Verify that you have the Lib-Directory set to the lib/ directory
that you put your configuration files. Lib-Directory is defined in your
Gatherer configuration file.
- Symptom
-
Essence dumps core when run (from the Gatherer)
- Solution
-
Check if you're running a non-stock version of the Domain Naming
System (DNS) under SunOS. There is a version that fixes some security
holes, but is not compatible with the version of the DNS resolver
library with which we link essence for the binary Harvest
distribution. If this is indeed the problem, you can either run the
binary Harvest distribution on a stock SunOS machine, or rebuild
Harvest from source (more specifically, rebuild essence, linking with
the non-stock DNS resolver library).
- Symptom
-
I am having problems resolving host names on SunOS.
- Solution
-
In order to gather data from hosts outside of your organization, your system
must be able to resolve fully qualified domain names into IP addresses.
If your system cannot resolve hostnames, you will see error messages such
as ``Unknown Host.'' In this case, either:
-
the hostname you gave does not really exist; or
-
your system is not configured to use the DNS.
To verify that your system is configured for DNS, make sure that the
file /etc/resolv.conf exists and is readable. Read the
resolv.conf(5) manual page for information on this file. You can verify
that DNS is working with the nslookup command.
The Harvest executables for SunOS (4.1.3_U1) are statically linked with
the stock resolver library from /usr/lib/libresolv.a. If you seem
to have problems with the statically linked executables, please try to
compile Harvest from the source code (see Section 3).
This will make use of your local libraries, which may have been modified
for your particular organization.
Some sites may use Sun Microsystem's Network Information Service (NIS)
instead of, or in addition to, DNS. We believe that Harvest works on
systems where NIS has been properly configured. The NIS servers (the
names of which you can determine from the ypwhich command) must be
configured to query DNS servers for hostnames they do not know about.
See the -b option of the ypxfr command.
We would welcome reports of Harvest successfully working with NIS.
Please email us at harvest-dvl@cs.colorado.edu.
- Symptom
-
I cannot get the Gatherer to work across our firewall gateway.
- Solution
-
Harvest currently will not operate across a strict Internet firewall.
The Gatherer and Broker and Replicator can not (yet) request objects
through a proxy server. You can either run these Harvest components
internally (behind the firewall) or or else on the firewall host itself.
If you see the ``Host is unreachable'' message, these are the likely problems:
-
your connection to the Internet is temporarily down due to a circuit or
routing failure; or
-
you are behind a firewall.
If you see the ``Connection refused'' message, the likely problem is
that you are trying to connect with an unused port on the destination
machine. In other words, there is no program listening for connections
on that port.
The Harvest gatherer is essentially a WWW client. You should expect it
to work the same as Mosaic, but without proxy support. We would be
interested to hear about problems with Harvest and hostnames under the
condition that the gatherer is unable to contact a host, yet you are
able to use other network programs (Mosaic, telnet, ping) to that host
without going through a proxy.