next up previous contents index
Next: Other examples Up: 4.4.2 Summarizing SGML data Previous: Creating an summarizer

The SGML-based HTML summarizer

 

Starting with Version 1.2, Harvest summarizes HTML using the generic SGML summarizer described in Section 4.4.2. Below is the default SGML-to-SOIF table used by the HTML summarizer. The pathname to this file is $HARVEST_HOME/lib/gatherer/sgmls-lib/HTML/HTML.sum.tbl. Individual Gatherers may do customized HTML summarizing by placing a modified version of this file in the Gatherer lib directory.

HTML ELEMENT   SOIF ATTRIBUTES
------------   -----------------------
    <A>             keywords,parent
    <A:HREF>        url-references
    <ADDRESS>       address
    <B>             keywords,parent
    <BODY>          body
    <CITE>          references
    <CODE>          ignore
    <EM>            keywords,parent
    <H1>            headings
    <H2>            headings
    <H3>            headings
    <H4>            headings
    <H5>            headings
    <H6>            headings
    <HEAD>          head
    <I>             keywords,parent
    <IMG:SRC>       images
    <META:CONTENT>  $NAME
    <STRONG>        keywords,parent
    <TITLE>         title
    <TT>            keywords,parent
    <UL>            keywords,parent

In HTML, the document title is written as:

    <TITLE>My Home Page</TITLE>

The above translation table will place this in the SOIF summary as:

    title{13}:  My Home Page

Note that ``keywords,parent'' occurs frequently in the table. For any specially marked text (bold, emphasized, hypertext links, etc.), the words will be copied into the keywords attribute and also left in the content of the parent element. This keeps the body of the text readable by not removing certain words.

Any text that appears inside a pair of CODE tags will not show up in the summary because we specified ``ignore'' as the SOIF attribute.

URLs in HTML anchors are written as

    <A HREF="http://harvest.cs.colorado.edu/">

The specification for <A:HREF> in the above translation table causes this to appear as

    url-references{32}: http://harvest.cs.colorado.edu/

One of the most useful HTML tags is META. This allows the document writer to include arbitrary metadata in an HTML document. A Typical usage of the META element is:

    <META NAME="author" CONTENT="Joe T. Slacker">

By specifying ``<META:CONTENT> $NAME'' in the translation table, this comes out as:

    author{15}: Joe T. Slacker

HTML authors can easily add a list of keywords to their documents:

    <META NAME="keywords"  CONTENT="word1 word2">
    <META NAME="keywords"  CONTENT="word3 word4">



next up previous contents index
Next: Other examples Up: 4.4.2 Summarizing SGML data Previous: Creating an summarizer



Darren Hardy
Mon Apr 3 15:22:37 MDT 1995