DLESE Tools
v1.6.0

org.dlese.dpc.util
Class HTMLParser

java.lang.Object
  extended by org.dlese.dpc.util.HTMLParser

public class HTMLParser
extends Object

The HTMLParser class contains methods which allow an HTML document to be parsed. These methods allow text in the document to be extracted, as well as the contents of Meta tags Header (h1 , h2, h3, .. h6) tags, the Title tag, all the links in the page etc. Example html document at http://www.abc.org: (for help with explaining the methods in this API) ABC.ORG's MAIN PAGE

Welcome to ABC.ORG.

Hurricane season is here!

abc logo
Whether directly affected or not, students can benefit from the engaging learning experiences these dramatic events can provide. Keep abreast of the current storm at the Tropical Prediction Center, where you can view advisories, maps and forecast tracks.

Middle school students can learn about hurricane science and safety with the Hurricane Strike module, while more advanced students can utilize the multimedia technology of the online meteorology guide Hurricanes.

One of ABC's newest collections, the NASA Scientific Visualization Studio, offers data, images and animations from previous Atlantic storms.

Author:
Sonal Bhushan

Constructor Summary
HTMLParser(String resourcelocn)
          Constructor of an HTMLParser object
HTMLParser(String htmlcontent, String charset)
          Constructor of an HTMLParser object
 
Method Summary
 String[] getAllLinks()
          returns a String array of all the links in the html document.
 String getHeaderText()
          returns all the text in the html page which is contained within header tags (which includes
 String getImgAlts()
          returns a String containing all the text within the alt attribute of all the img tags in the html document
 String getLinkTitles()
          returns a String containing all the text within the title attribute of all the links in the html document
 String getMetaTagContentByName(String name)
          returns the content of the Meta tag whose name equals mname.
 String getTitleText()
          returns the title of the HTML page , i.e.
 String getWholeText()
          returns the text of the whole html document, stripped of all the HTML tags.
 boolean hasMetaTagName(String name)
          returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HTMLParser

public HTMLParser(String resourcelocn)
           throws org.htmlparser.util.ParserException
Constructor of an HTMLParser object

Parameters:
resourcelocn - either a URL or the name of an HTML file
Throws:
org.htmlparser.util.ParserException - e.g.: HTMLParser hp = new HTMLParser("http://www.dlese.org"); HTMLParser hp2 = new HTMLParser(testthis.htm);

HTMLParser

public HTMLParser(String htmlcontent,
                  String charset)
           throws org.htmlparser.util.ParserException
Constructor of an HTMLParser object

Parameters:
htmlcontent - String containing the HTML to be parsed
charset - if null, the default encoding is used
Throws:
org.htmlparser.util.ParserException
Method Detail

getHeaderText

public String getHeaderText()
                     throws org.htmlparser.util.ParserException
returns all the text in the html page which is contained within header tags (which includes

-

). If none of these tags are present in the page, it returns an empty string. e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getHeaderText()); This prints out the following : Welcome to ABC.ORG Hurricane season is here!

Returns:
text in the header tags in the html document
Throws:
org.htmlparser.util.ParserException

getTitleText

public String getTitleText()
                    throws org.htmlparser.util.ParserException
returns the title of the HTML page , i.e. the text enclosed by the tag. If this tag is not present in the page, it returns an empty string. e.g. : HTMLParserhp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getTitleText()); This prints out the following : ABC.ORG's MAIN PAGE

Returns:
text in the title tag(s) in the html doc.
Throws:
org.htmlparser.util.ParserException

hasMetaTagName

public boolean hasMetaTagName(String name)
                       throws org.htmlparser.util.ParserException
returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); boolean containskeywords = hp.hasMetaTagName("keywords"); boolean containsxyz = hp.hasMetaTagName("xyz"); In this code, containskeywords will be true, and containsxyz will be false.

Parameters:
name - name of the Meta Tag
Returns:
true or false, if this tag is present or not
Throws:
org.htmlparser.util.ParserException

getMetaTagContentByName

public String getMetaTagContentByName(String name)
                               throws org.htmlparser.util.ParserException
returns the content of the Meta tag whose name equals mname. If such a tag does not exist, returns an empty string. E.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); if (hp.hasMetaTagName("organization")) { System.out.println(hp.getMetaTagContentByName("organization")); } This prints out the following : ABC Program Center

Parameters:
name - name of the Meta Tag
Returns:
The value of this meta tag
Throws:
org.htmlparser.util.ParserException

getAllLinks

public String[] getAllLinks()
                     throws org.htmlparser.util.ParserException
returns a String array of all the links in the html document.

Returns:
a string array of all the links
Throws:
org.htmlparser.util.ParserException

getLinkTitles

public String getLinkTitles()
                     throws org.htmlparser.util.ParserException
returns a String containing all the text within the title attribute of all the links in the html document

Returns:
all the text within the title attribute of all the links in the doc.
Throws:
org.htmlparser.util.ParserException

getImgAlts

public String getImgAlts()
                  throws org.htmlparser.util.ParserException
returns a String containing all the text within the alt attribute of all the img tags in the html document

Returns:
all the text within the alt attribute of all the img tahs in the html doc
Throws:
org.htmlparser.util.ParserException

getWholeText

public String getWholeText()
                    throws org.htmlparser.util.ParserException
returns the text of the whole html document, stripped of all the HTML tags. This text also includes the text within the alt attribute of all the img tags, as well as the text within the title attribute of all the link tags.

Returns:
The wholeText value
Throws:
org.htmlparser.util.ParserException

DLESE Tools
v1.6.0