This is a text indexer and a set of text extractors for popular binary file formats. When a document is created or updated the indexer uses the ExtractorManager to obtain a list of extractors for a given NodeDescriptor. The indexer extracts text from the document and uses Lucene to index the text for optimized searching. DASL Searches that use the contains clause are handled by TextContainsExpression and TextContainsExpressionFactory. There are four extractors included for extracting text from the four most popular binary file formats. With the exception of PowerPoint, I used available libraries (MIT/BSD) to handle the actual extraction. I used the textmining library, a POI wrapper, to extract text from word(POI's Word library doesn't strip the formatting tags). I used the PDFBox library to extract text from PDF files. I used the high level excel library in POI to extract text from excel, and I used POI's low level OLE library to extract text from PowerPoint. I'm going to attach the jar's that are not already included with slide. I'm also attaching the file log4j.jar. This is needed by PDFBox. I don't understand why the log4j jar included with Slide doesn't work. I just put both in my WAR and it worked.
Created attachment 11969 [details] TextContentIndexer
Created attachment 11970 [details] TextContainsExpression
Created attachment 11971 [details] TextContainsExpressionFactory
Created attachment 11972 [details] MSWordExtractor
Created attachment 11973 [details] MSExcelExtractor
Created attachment 11974 [details] MSPowerPointExtractor
Created attachment 11975 [details] PDFExtractor
Created attachment 11976 [details] Sample Domain.xml config
Created attachment 11977 [details] textmining library need by MSWordExtractor
I think this can be closed. Oliver?
Sure. This has been checked in.