Stemmer (DLESE Tools API Documentation v1.6.0)

Overview

Package

Class

Tree

Deprecated

Index

Help

DLESE Tools
v1.6.0

PREV CLASS NEXT CLASS

FRAMES NO FRAMES All Classes

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.dlese.dpc.index
Class Stemmer

java.lang.Object
  org.dlese.dpc.index.Stemmer

public class Stemmer
extends Object
extends Object

Stemmer implements the Porter stemming algorithm. The Stemmer class transforms a word or array of words into their morphological root form. For example, this algorithm converts the words 'ocean,' 'oceans' and 'oceanic' to the single root 'ocean'.

The static methods getStem(String term) and getStems(String[] terms) can be used to quickly convert a word or words to their root form. Example code:

import org.dlese.dpc.index.Stemmer; ... String word = "oceanic"; String stem = Stemmer.getStem(word); // stem now equals 'ocean'
String string = "A group of words that need to be stemmed"; String[] words = string.split("\\s+"); // Split on white space String[] stems = Stemmer.getStems(words); for(int i = 0; i < stems.length; i++){ ... do something with the stems ... }

For more information about the Porter stemming algorithm, see http://www.tartarus.org/~martin/PorterStemmer .

Author:: Martin Porter, Unknown contributors, John Weatherley

Constructor Summary
`Stemmer()` Constructor for the Stemmer object

Method Summary
`void`	`add(char ch)` Add a character to the word being stemmed.
`void`	`add(char[] w, int wLen)` Adds wLen characters to the word being stemmed contained in a portion of a char[] array.
`char[]`	`getResultBuffer()` Returns a reference to a character buffer containing the results of the stemming process.
`int`	`getResultLength()` Returns the length of the word resulting from the stemming process.
`static String`	`getStem(String term)` Gets the stem of the given english word.
`static String[]`	`getStems(String[] terms)` Gets the stems of the given english words.
`static void`	`main(String[] args)` Test program for demonstrating the Stemmer.
`void`	`stem()` Stem the word placed into the Stemmer buffer through calls to add().
`static String`	`stemWordsInLuceneClause(String string)` Stems each of the words in a given Lucene clause String, returning the same String with the word parts in stemmed form.
`static String`	`stemWordsInString(String string)` Stems each of the words or tokens in a given String, returning a String of stemmed tokens with all other characters removed.
`String`	`toString()` After a word has been stemmed, it can be retrieved by toString(), or a reference to the internal buffer can be retrieved by getResultBuffer and getResultLength (which is generally more efficient.)

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Constructor Detail

Stemmer

public Stemmer()

Constructor for the Stemmer object

Method Detail

getStem

public static final String getStem(String term)

Gets the stem of the given english word. For proper results, the input should contain letters only [a-zA-Z].

Parameters:: term - A term in english.
Returns:: The stem value of the english term.

getStems

public static final String[] getStems(String[] terms)

Gets the stems of the given english words. For proper results, the input should contain letters only [a-zA-Z].

Parameters:: terms - A group of terms in english.
Returns:: The stems values for each term.

stemWordsInString

public static final String stemWordsInString(String string)

Stems each of the words or tokens in a given String, returning a String of stemmed tokens with all other characters removed. Token characters include letters and numbers [a-zA-Z0-9], representing the class of tokens that are searchable by Lucene. Note: the tokens "AND" and "OR" (upper case) are left unchanged.

Example:

oceans and rain AND 44rains http://dlese.org/oceans
is transformed to
ocean and rain AND 44rain http dlese org ocean

Parameters:: string - A word, phrase, or any arbitrary String.
Returns:: A String containing each letter/number token in stemmed form.

stemWordsInLuceneClause

public static final String stemWordsInLuceneClause(String string)

Stems each of the words in a given Lucene clause String, returning the same String with the word parts in stemmed form. The method leaves all words that may be found in a Lucene clause unchanged such as 'AND', 'OR' and field specificers such as 'titles:'.

Example:

titles:("oceans AND oceans44 OR 44oceans and oceanic")^20 or cooled
is transformed to
titles:("ocean AND oceans44 OR 44ocean and ocean")^20 or cool

Parameters:: string - A word, phrase, Lucene clause, or any arbitrary String.
Returns:: Each non-clause word is stemmed in place, leaving non-word characters and clause words unchanged.

add

public void add(char ch)

Add a character to the word being stemmed. When you are finished adding characters, you can call stem(void) to stem the word.

Parameters:: ch - DESCRIPTION

add

public void add(char[] w,
                int wLen)

Adds wLen characters to the word being stemmed contained in a portion of a char[] array. This is like repeated calls of add(char ch), but faster.

Parameters:: w - DESCRIPTION; wLen - DESCRIPTION

toString

public String toString()

After a word has been stemmed, it can be retrieved by toString(), or a reference to the internal buffer can be retrieved by getResultBuffer and getResultLength (which is generally more efficient.)

Overrides:: toString in class Object

Returns:: DESCRIPTION

getResultLength

public int getResultLength()

Returns the length of the word resulting from the stemming process.

Returns:: The resultLength value

getResultBuffer

public char[] getResultBuffer()

Returns a reference to a character buffer containing the results of the stemming process. You also need to consult getResultLength() to determine the length of the result.

Returns:: The resultBuffer value

stem

public void stem()

Stem the word placed into the Stemmer buffer through calls to add(). Returns true if the stemming process resulted in a word different from the input. You can retrieve the result with getResultLength()/getResultBuffer() or toString().

main

public static void main(String[] args)

Test program for demonstrating the Stemmer. It reads text from a a list of files, stems each word, and writes the result to standard output. Note that the word stemmed is expected to be in lower case: forcing lower case must be done outside the Stemmer class. Usage: Stemmer file-name file-name ...

Parameters:: args - The command line arguments