Overview

Package

Class

Tree

Deprecated

Index

Help

DLESE Tools
v1.6.0

PREV CLASS NEXT CLASS

FRAMES NO FRAMES All Classes

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.dlese.dpc.util
Class HTMLParser

java.lang.Object
  org.dlese.dpc.util.HTMLParser

public class HTMLParser
extends Object
extends Object

The HTMLParser class contains methods which allow an HTML document to be parsed. These methods allow text in the document to be extracted, as well as the contents of Meta tags Header (h1 , h2, h3, .. h6) tags, the Title tag, all the links in the page etc. Example html document at http://www.abc.org: (for help with explaining the methods in this API) ABC.ORG's MAIN PAGE

Welcome to ABC.ORG.

Hurricane season is here!

Whether directly affected or not, students can benefit from the engaging learning experiences these dramatic events can provide. Keep abreast of the current storm at the Tropical Prediction Center, where you can view advisories, maps and forecast tracks.

Middle school students can learn about hurricane science and safety with the Hurricane Strike module, while more advanced students can utilize the multimedia technology of the online meteorology guide Hurricanes.

One of ABC's newest collections, the NASA Scientific Visualization Studio, offers data, images and animations from previous Atlantic storms.

Author:: Sonal Bhushan

Constructor Summary
`HTMLParser(String resourcelocn)` Constructor of an HTMLParser object
`HTMLParser(String htmlcontent, String charset)` Constructor of an HTMLParser object

Method Summary
`String[]`	`getAllLinks()` returns a String array of all the links in the html document.
`String`	`getHeaderText()` returns all the text in the html page which is contained within header tags (which includes
`String`	`getImgAlts()` returns a String containing all the text within the alt attribute of all the img tags in the html document
`String`	`getLinkTitles()` returns a String containing all the text within the title attribute of all the links in the html document
`String`	`getMetaTagContentByName(String name)` returns the content of the Meta tag whose name equals mname.
`String`	`getTitleText()` returns the title of the HTML page , i.e.
`String`	`getWholeText()` returns the text of the whole html document, stripped of all the HTML tags.
`boolean`	`hasMetaTagName(String name)` returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

HTMLParser

public HTMLParser(String resourcelocn)
           throws org.htmlparser.util.ParserException

Constructor of an HTMLParser object

Parameters:: resourcelocn - either a URL or the name of an HTML file
Throws:: org.htmlparser.util.ParserException - e.g.: HTMLParser hp = new HTMLParser("http://www.dlese.org"); HTMLParser hp2 = new HTMLParser(testthis.htm);

HTMLParser

public HTMLParser(String htmlcontent,
                  String charset)
           throws org.htmlparser.util.ParserException

Constructor of an HTMLParser object

Parameters:: htmlcontent - String containing the HTML to be parsed; charset - if null, the default encoding is used
Throws:: org.htmlparser.util.ParserException

Method Detail

getHeaderText

public String getHeaderText()
                     throws org.htmlparser.util.ParserException

returns all the text in the html page which is contained within header tags (which includes

-

). If none of these tags are present in the page, it returns an empty string. e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getHeaderText()); This prints out the following : Welcome to ABC.ORG Hurricane season is here!

Returns:
text in the header tags in the html document
Throws:
`org.htmlparser.util.ParserException`

getTitleText

public String getTitleText()
                    throws org.htmlparser.util.ParserException

returns the title of the HTML page , i.e. the text enclosed by the tag. If this tag is not present in the page, it returns an empty string. e.g. : HTMLParserhp = new HTMLParser("http://www.abc.org"); System.out.println(hp.getTitleText()); This prints out the following : ABC.ORG's MAIN PAGE

Returns:: text in the title tag(s) in the html doc.
Throws:: org.htmlparser.util.ParserException

hasMetaTagName

public boolean hasMetaTagName(String name)
                       throws org.htmlparser.util.ParserException

returns true if the html document contains a Meta tag with a name equal to mname , otherwise returns false e.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); boolean containskeywords = hp.hasMetaTagName("keywords"); boolean containsxyz = hp.hasMetaTagName("xyz"); In this code, containskeywords will be true, and containsxyz will be false.

Parameters:: name - name of the Meta Tag
Returns:: true or false, if this tag is present or not
Throws:: org.htmlparser.util.ParserException

getMetaTagContentByName

public String getMetaTagContentByName(String name)
                               throws org.htmlparser.util.ParserException

returns the content of the Meta tag whose name equals mname. If such a tag does not exist, returns an empty string. E.g. : HTMLParser hp = new HTMLParser("http://www.abc.org"); if (hp.hasMetaTagName("organization")) { System.out.println(hp.getMetaTagContentByName("organization")); } This prints out the following : ABC Program Center

Parameters:: name - name of the Meta Tag
Returns:: The value of this meta tag
Throws:: org.htmlparser.util.ParserException

getAllLinks

public String[] getAllLinks()
                     throws org.htmlparser.util.ParserException

returns a String array of all the links in the html document.

Returns:: a string array of all the links
Throws:: org.htmlparser.util.ParserException

getLinkTitles

public String getLinkTitles()
                     throws org.htmlparser.util.ParserException

returns a String containing all the text within the title attribute of all the links in the html document

Returns:: all the text within the title attribute of all the links in the doc.
Throws:: org.htmlparser.util.ParserException

getImgAlts

public String getImgAlts()
                  throws org.htmlparser.util.ParserException

returns a String containing all the text within the alt attribute of all the img tags in the html document

Returns:: all the text within the alt attribute of all the img tahs in the html doc
Throws:: org.htmlparser.util.ParserException

getWholeText

public String getWholeText()
                    throws org.htmlparser.util.ParserException

returns the text of the whole html document, stripped of all the HTML tags. This text also includes the text within the alt attribute of all the img tags, as well as the text within the title attribute of all the link tags.

Returns:: The wholeText value
Throws:: org.htmlparser.util.ParserException

Overview

Package

Class

Tree

Deprecated

Index

Help

DLESE Tools
v1.6.0

PREV CLASS NEXT CLASS

FRAMES NO FRAMES All Classes

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.dlese.dpc.util Class HTMLParser

Welcome to ABC.ORG.

Hurricane season is here!

HTMLParser

HTMLParser

getHeaderText

-

getTitleText

hasMetaTagName

getMetaTagContentByName

getAllLinks

getLinkTitles

getImgAlts

getWholeText

org.dlese.dpc.util
Class HTMLParser