Bug 29842 - A lucene-based text indexer and extractors for popular binary formats
Summary: A lucene-based text indexer and extractors for popular binary formats
Status: CLOSED FIXED
Alias: None
Product: Slide
Classification: Unclassified
Component: Search (show other bugs)
Version: Nightly
Hardware: Other other
: P3 enhancement (vote)
Target Milestone: ---
Assignee: Slide Developer List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-06-28 13:58 UTC by Ryan Rhodes
Modified: 2004-11-16 19:05 UTC (History)
0 users



Attachments
TextContentIndexer (9.83 KB, text/plain)
2004-06-28 14:27 UTC, Ryan Rhodes
Details
TextContainsExpression (4.53 KB, text/plain)
2004-06-28 14:28 UTC, Ryan Rhodes
Details
TextContainsExpressionFactory (3.85 KB, patch)
2004-06-28 14:28 UTC, Ryan Rhodes
Details | Diff
MSWordExtractor (1.08 KB, text/plain)
2004-06-28 14:29 UTC, Ryan Rhodes
Details
MSExcelExtractor (1.99 KB, text/plain)
2004-06-28 14:29 UTC, Ryan Rhodes
Details
MSPowerPointExtractor (1.91 KB, text/plain)
2004-06-28 14:29 UTC, Ryan Rhodes
Details
PDFExtractor (1.23 KB, text/plain)
2004-06-28 14:30 UTC, Ryan Rhodes
Details
Sample Domain.xml config (19.60 KB, text/plain)
2004-06-28 14:30 UTC, Ryan Rhodes
Details
textmining library need by MSWordExtractor (225.93 KB, application/octet-stream)
2004-06-28 14:31 UTC, Ryan Rhodes
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ryan Rhodes 2004-06-28 13:58:38 UTC
This is a text indexer and a set of text extractors for popular binary file 
formats.

When a document is created or updated the indexer uses the ExtractorManager to 
obtain a list of extractors for a given NodeDescriptor.  The indexer extracts 
text from the document and uses Lucene to index the text for optimized 
searching.

DASL Searches that use the contains clause are handled by 
TextContainsExpression and TextContainsExpressionFactory.

There are four extractors included for extracting text from the four most 
popular binary file formats.  With the exception of PowerPoint, I used 
available libraries (MIT/BSD) to handle the actual extraction.  I used the 
textmining library, a POI wrapper, to extract text from word(POI's Word 
library doesn't strip the formatting tags).  I used the PDFBox library to 
extract text from PDF files.  I used the high level excel library in POI to 
extract text from excel, and I used POI's low level OLE library to extract 
text from PowerPoint.

I'm going to attach the jar's that are not already included with slide.  I'm 
also attaching the file log4j.jar.  This is needed by PDFBox.  I don't 
understand why the log4j jar included with Slide doesn't work.  I just put 
both in my WAR and it worked.
Comment 1 Ryan Rhodes 2004-06-28 14:27:34 UTC
Created attachment 11969 [details]
TextContentIndexer
Comment 2 Ryan Rhodes 2004-06-28 14:28:11 UTC
Created attachment 11970 [details]
TextContainsExpression
Comment 3 Ryan Rhodes 2004-06-28 14:28:41 UTC
Created attachment 11971 [details]
TextContainsExpressionFactory
Comment 4 Ryan Rhodes 2004-06-28 14:29:07 UTC
Created attachment 11972 [details]
MSWordExtractor
Comment 5 Ryan Rhodes 2004-06-28 14:29:29 UTC
Created attachment 11973 [details]
MSExcelExtractor
Comment 6 Ryan Rhodes 2004-06-28 14:29:51 UTC
Created attachment 11974 [details]
MSPowerPointExtractor
Comment 7 Ryan Rhodes 2004-06-28 14:30:10 UTC
Created attachment 11975 [details]
PDFExtractor
Comment 8 Ryan Rhodes 2004-06-28 14:30:40 UTC
Created attachment 11976 [details]
Sample Domain.xml config
Comment 9 Ryan Rhodes 2004-06-28 14:31:17 UTC
Created attachment 11977 [details]
textmining library need by MSWordExtractor
Comment 10 Unico Hommes 2004-07-11 19:22:32 UTC
I think this can be closed. Oliver?
Comment 11 Oliver Zeigermann 2004-07-11 19:27:36 UTC
Sure. This has been checked in.