DLESE Tools
v1.6.0

org.dlese.dpc.index.writer
Class XMLFileIndexingWriter

java.lang.Object
  extended by org.dlese.dpc.index.writer.FileIndexingServiceWriter
      extended by org.dlese.dpc.index.writer.XMLFileIndexingWriter
All Implemented Interfaces:
DocWriter
Direct Known Subclasses:
DleseAnnoFileIndexingServiceWriter, DleseCollectionFileIndexingWriter, ItemFileIndexingWriter, NCSCollectionFileIndexingWriter, NewsOppsFileIndexingWriter, SimpleXMLFileIndexingWriter

public abstract class XMLFileIndexingWriter
extends FileIndexingServiceWriter

Creates a Lucene Document from any XML file by stripping the XML tags to extract and index the content. The reader for this type of Document is XMLDocReader.

The Lucene Document fields that are created by this class are (in addition the the ones listed for FileIndexingServiceWriter):

collection - The collection associated with this resource.

Author:
John Weatherley
See Also:
FileIndexingService, XMLDocReader

Constructor Summary
XMLFileIndexingWriter()
          Constructor for the XMLFileIndexingWriter.
 
Method Summary
protected abstract  String[] _getIds()
          Return unique IDs for the item being indexed, one for each collection that catalogs the resource.
protected  void addCustomFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile)
          Adds the full content of the XML to the default search field.
protected abstract  void addFields(org.apache.lucene.document.Document newDoc, org.apache.lucene.document.Document existingDoc, File sourceFile)
          Adds additional fields that are unique the document format being indexed.
protected  BoundingBox getBoundingBox()
          Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply.
protected  String[] getCollections()
          Returns unique collection keys for the item being indexed.
 org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document existingDoc)
          Creates a Lucene Document for the XML that is equal to the exsiting Document.
abstract  String getDescription()
          Return a description for the document being indexed, or null if none applies.
 String getDocGroup()
          Gets the collection specifier, for example 'dcc', 'comet'.
protected  Document getDom4jDoc()
          Gets the dom4j Document for use by sub-classes
protected  String getFieldContent(String[] values, String useVocabMapping, String metadataFormat)
          Gets the vocab encoded keys for the given values, separated by the '+' symbol.
protected  String getFieldContent(String value, String useVocabMapping, String metadataFormat)
          Gets the encoded vocab key for the given content.
protected  String getFieldName(String vocabFieldString, String metadataFormat)
          Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'.
 String[] getIds()
          Returns the ids for the item being indexed.
protected  SimpleLuceneIndex getIndex()
          Gets the index used by this XML File Indexer
protected  ResultDocList getMyAnnoResultDocs()
          Gets the annotations for this record, null or zero length if none available.
protected  DleseCollectionDocReader getMyCollectionDoc()
          Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.
static String getOaiModtime(File sourceFile, org.apache.lucene.document.Document existingDoc)
          Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.
 String getPrimaryId()
          Returns the unique primary record ID for the item being indexed.
protected  RecordDataService getRecordDataService()
          Gets the recordDataService used by this XML File Indexer
 List getRelatedIds()
          Gets the ids of related records.
 Map getRelatedIdsMap()
          Gets the ids of related records.
 List getRelatedUrls()
          Gets the urls of related records.
 Map getRelatedUrlsMap()
          Gets the urls of related records.
protected  String getTermStringFromStringArray(String[] vals)
          Gets the appropriate terms from a string array of metadata fields.
abstract  String getTitle()
          Return a title for the document being indexed, or null if none applies.
abstract  String[] getUrls()
          Return the URL(s) to the resource being indexed, or null if none apply.
protected abstract  Date getWhatsNewDate()
          Returns the date used to determine "What's new" in the library, or null if none is available.
protected abstract  String getWhatsNewType()
          Returns the type of category for "What's new" in the library, or null if none is available.
protected  XMLIndexer getXmlIndexer()
          Gets the XMLIndexer for use by sub-classes
protected  XMLIndexerFieldsConfig getXmlIndexerFieldsConfig()
          Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.
abstract  boolean indexFullContentInDefaultAndStems()
          Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class.
abstract  void init(File source, org.apache.lucene.document.Document existingDoc)
          This method is called prior to processing and may be used to for any necessary set-up.
 
Methods inherited from class org.dlese.dpc.index.writer.FileIndexingServiceWriter
abortIndexing, addDocToRemove, addToAdminDefaultField, addToDefaultField, create, destroy, getConfigAttributes, getDocsource, getDocType, getFileContent, getFileIndexingPlugin, getFileIndexingService, getLuceneDoc, getPreviousRecordDoc, getReaderClass, getSessionAttributes, getSourceDir, getSourceFile, getValidationReport, isMakingDeletedDoc, isValidationEnabled, prtln, prtlnErr, setConfigAttributes, setDebug, setFileIndexingPlugin, setFileIndexingService, setIsMakingDeletedDoc, setValidationEnabled
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

XMLFileIndexingWriter

public XMLFileIndexingWriter()
Constructor for the XMLFileIndexingWriter.

Method Detail

getIds

public String[] getIds()
                throws Exception
Returns the ids for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.

Returns:
The id String
Throws:
Exception - If error
See Also:
getIds()

getPrimaryId

public String getPrimaryId()
                    throws Exception
Returns the unique primary record ID for the item being indexed. If more than one record catalogs the same item, this represents the primary ID.

Returns:
The id String
Throws:
Exception - If error
See Also:
getIds()

getRelatedIds

public List getRelatedIds()
                   throws IllegalStateException,
                          Exception
Gets the ids of related records.

Returns:
The related ids value, or null if none
Throws:
IllegalStateException - If called prior to calling method #indexFields
Exception - If error

getRelatedUrls

public List getRelatedUrls()
                    throws IllegalStateException,
                           Exception
Gets the urls of related records.

Returns:
The related urls value, or null if none
Throws:
IllegalStateException - If called prior to calling method #indexFields
Exception - If error

getRelatedIdsMap

public Map getRelatedIdsMap()
                     throws IllegalStateException,
                            Exception
Gets the ids of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the ids of the target records.

Returns:
The related ids value, or null if none
Throws:
IllegalStateException - If called prior to calling method #indexFields
Exception - If error

getRelatedUrlsMap

public Map getRelatedUrlsMap()
                      throws IllegalStateException,
                             Exception
Gets the urls of related records. The Map key contains the relationship (isAnnotatedBy, etc.) and the Map value contains a List of Strings that indicate the urls of the target records.

Returns:
The related urls value, or null if none
Throws:
IllegalStateException - If called prior to calling method #indexFields
Exception - If error

getCollections

protected String[] getCollections()
                           throws Exception
Returns unique collection keys for the item being indexed. For example "dcc" (single collection) or "dcc dwel" (multiple collections). If more than one collection is provided, the first one must be the primary collection. May be overridden by sub-classes as appropriate (overridden by ADNFileIndexingWriter).

Returns:
The collection keys
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getDocGroup

public String getDocGroup()
                   throws Exception
Gets the collection specifier, for example 'dcc', 'comet'.

Specified by:
getDocGroup in class FileIndexingServiceWriter
Returns:
The collection specifier
Throws:
Exception - If error occured

getBoundingBox

protected BoundingBox getBoundingBox()
                              throws Exception
Return the geospatial BoundingBox footprint that represnets the resource being indexed, or null if none apply. Override if nessary.

Returns:
BoundingBox, or null
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

init

public abstract void init(File source,
                          org.apache.lucene.document.Document existingDoc)
                   throws Exception
This method is called prior to processing and may be used to for any necessary set-up. This method should throw and exception with appropriate message if an error occurs.

Specified by:
init in class FileIndexingServiceWriter
Parameters:
source - The source file being indexed
existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
Throws:
Exception - If an error occured during set-up.

_getIds

protected abstract String[] _getIds()
                             throws Exception
Return unique IDs for the item being indexed, one for each collection that catalogs the resource. For example "DLESE-000-000-000-001" (single ID) or "DLESE-000-000-000-036 COMET-60" (multiple IDs). If more than one ID is present, the first one is the primary.

Returns:
The id(s)
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getTitle

public abstract String getTitle()
                         throws Exception
Return a title for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'title' and is also indexed in the 'default' field.

Returns:
The title String
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getDescription

public abstract String getDescription()
                               throws Exception
Return a description for the document being indexed, or null if none applies. The String is tokenized, stored and indexed under the field key 'description' and is also indexed in the 'default' field.

Returns:
The description String
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getUrls

public abstract String[] getUrls()
                          throws Exception
Return the URL(s) to the resource being indexed, or null if none apply. If more than one URL references the resource, the first one is the primary. The URL Strings are tokenized and indexed under the field key 'uri' and is also indexed in the 'default' field. It is also stored in the index untokenized under the field key 'url.'

Returns:
The url String(s)
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

indexFullContentInDefaultAndStems

public abstract boolean indexFullContentInDefaultAndStems()
Return true to have the full XML content indexed in the 'default' and 'stems' fields, false if handled by the sub-class. If true, the content is indexed using the #addToDefaultField method.

Returns:
True to have the full XML content indexed in the 'default' and 'stems'

getWhatsNewDate

protected abstract Date getWhatsNewDate()
                                 throws Exception
Returns the date used to determine "What's new" in the library, or null if none is available.

Returns:
The what's new date for the item or null if not available.
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getWhatsNewType

protected abstract String getWhatsNewType()
                                   throws Exception
Returns the type of category for "What's new" in the library, or null if none is available. Must be a simple lower case String with no spaces, for example 'itemnew,' 'itemannocomplete,' 'itemannoinprogress,' 'annocomplete,' 'annoinprogress,' 'collection'.

Returns:
The what's new type.
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

addFields

protected abstract void addFields(org.apache.lucene.document.Document newDoc,
                                  org.apache.lucene.document.Document existingDoc,
                                  File sourceFile)
                           throws Exception
Adds additional fields that are unique the document format being indexed. When implementing this method, use the add method of the Document class to add a Field.

The following Lucene Field types are available for indexing with the Document:
Field.Text(string name, string value) -- tokenized, indexed, stored
Field.UnStored(string name, string value) -- tokenized, indexed, not stored
Field.Keyword(string name, string value) -- not tokenized, indexed, stored
Field.UnIndexed(string name, string value) -- not tokenized, not indexed, stored
Field(String name, String string, boolean store, boolean index, boolean tokenize) -- allows control to do anything you want

Example code:
protected void addCustomFields(Document newDoc, Document existingDoc) throws Exception {
  String customContent = "Some content";
  newDoc.add(Field.Text("mycustomefield", customContent));
}

Parameters:
newDoc - The new Document that is being created for this resource
existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
sourceFile - The sourceFile that is being indexed
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

addCustomFields

protected void addCustomFields(org.apache.lucene.document.Document newDoc,
                               org.apache.lucene.document.Document existingDoc,
                               File sourceFile)
                        throws Exception
Adds the full content of the XML to the default search field. Strips the XML tags to extract the content. Will not work properly if the XML is not well-formed.

Specified by:
addCustomFields in class FileIndexingServiceWriter
Parameters:
newDoc - The new Document that is being created for this resource
existingDoc - An existing Document that currently resides in the index for the given resource, or null if none was previously present
sourceFile - The feature to be added to the CustomFields attribute
Throws:
Exception - This method should throw and Exception with appropriate error message if an error occurs.

getDeletedDoc

public org.apache.lucene.document.Document getDeletedDoc(org.apache.lucene.document.Document existingDoc)
                                                  throws Throwable
Creates a Lucene Document for the XML that is equal to the exsiting Document.

Overrides:
getDeletedDoc in class FileIndexingServiceWriter
Parameters:
existingDoc - An existing FileIndexingService Document that currently resides in the index for the given file
Returns:
A Lucene FileIndexingService Document
Throws:
Throwable - Thrown if error occurs

getMyAnnoResultDocs

protected ResultDocList getMyAnnoResultDocs()
                                     throws Exception
Gets the annotations for this record, null or zero length if none available.

Returns:
The myAnnoResultDocs value
Throws:
Exception - NOT YET DOCUMENTED

getXmlIndexerFieldsConfig

protected XMLIndexerFieldsConfig getXmlIndexerFieldsConfig()
Gets the XMLIndexerFieldsConfig to use for XML indexing, or null if none available.

Returns:
The xmlIndexerFieldsConfig value

getFieldContent

protected String getFieldContent(String[] values,
                                 String useVocabMapping,
                                 String metadataFormat)
                          throws Exception
Gets the vocab encoded keys for the given values, separated by the '+' symbol.

Parameters:
values - The valuse to encode.
useVocabMapping - The mapping to use, for example "contentStandards".
metadataFormat - The metadata format, for example 'adn'
Returns:
The encoded vocab keys.
Throws:
Exception - If error.

getFieldContent

protected String getFieldContent(String value,
                                 String useVocabMapping,
                                 String metadataFormat)
                          throws Exception
Gets the encoded vocab key for the given content.

Parameters:
value - The value to encode
useVocabMapping - The vocab mapping to use, for example 'contentStandard'
metadataFormat - The metadata format, for example 'adn'
Returns:
The encoded value, or unchanged if unable to encode
Throws:
Exception - If error

getFieldName

protected String getFieldName(String vocabFieldString,
                              String metadataFormat)
                       throws Exception
Gets the field ID, for example 'gr', for a given vocab, for example 'gradeRange'. If unable to get the field ID, the vocab field String is returned unchanged.

Parameters:
vocabFieldString - The field, for example 'gradeRange'
metadataFormat - The metadata format, for example 'adn'
Returns:
The field key, for example 'gr', or unchanged if unable to determine
Throws:
Exception - If error

getTermStringFromStringArray

protected String getTermStringFromStringArray(String[] vals)
Gets the appropriate terms from a string array of metadata fields. Uses all terms found after the last colon ":" found in the string.

Parameters:
vals - Metadata fields that must be delemited by colons.
Returns:
The individual terms used for indexing.

getXmlIndexer

protected XMLIndexer getXmlIndexer()
                            throws Exception
Gets the XMLIndexer for use by sub-classes

Returns:
The XMLIndexer
Throws:
Exception - If error

getDom4jDoc

protected Document getDom4jDoc()
                        throws Exception
Gets the dom4j Document for use by sub-classes

Returns:
The Document
Throws:
Exception - If error

getMyCollectionDoc

protected DleseCollectionDocReader getMyCollectionDoc()
Gets the DLESECollectionDocReader for the collection in which this item is a part, or null if not available.

Returns:
The myCollectionDoc value

getOaiModtime

public static final String getOaiModtime(File sourceFile,
                                         org.apache.lucene.document.Document existingDoc)
Gets the oaiModtime for the given File or Document, set to 3 minutes in the future to account for any delay in indexing updates.

Parameters:
sourceFile - The source file
existingDoc - The existing Doc
Returns:
The oaiModtime value

getRecordDataService

protected RecordDataService getRecordDataService()
Gets the recordDataService used by this XML File Indexer

Returns:
The recordDataService, or null if not available.

getIndex

protected SimpleLuceneIndex getIndex()
Gets the index used by this XML File Indexer

Returns:
The index, or null if not available.

DLESE Tools
v1.6.0