29842 – A lucene-based text indexer and extractors for popular binary formats

Bug 29842 - A lucene-based text indexer and extractors for popular binary formats

Summary: A lucene-based text indexer and extractors for popular binary formats

Status:	CLOSED FIXED

Alias:	None

Product:	Slide
Classification:	Unclassified
Component:	Search (show other bugs)
Version:	Nightly
Hardware:	Other other

Importance:	P3 enhancement (vote)
Target Milestone:	---
Assignee:	Slide Developer List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2004-06-28 13:58 UTC by Ryan Rhodes
Modified:	2004-11-16 19:05 UTC (History)
CC List:	0 users

Attachments
TextContentIndexer (9.83 KB, text/plain) 2004-06-28 14:27 UTC, Ryan Rhodes	Details
TextContainsExpression (4.53 KB, text/plain) 2004-06-28 14:28 UTC, Ryan Rhodes	Details
TextContainsExpressionFactory (3.85 KB, patch) 2004-06-28 14:28 UTC, Ryan Rhodes	Details \| Diff
MSWordExtractor (1.08 KB, text/plain) 2004-06-28 14:29 UTC, Ryan Rhodes	Details
MSExcelExtractor (1.99 KB, text/plain) 2004-06-28 14:29 UTC, Ryan Rhodes	Details
MSPowerPointExtractor (1.91 KB, text/plain) 2004-06-28 14:29 UTC, Ryan Rhodes	Details
PDFExtractor (1.23 KB, text/plain) 2004-06-28 14:30 UTC, Ryan Rhodes	Details
Sample Domain.xml config (19.60 KB, text/plain) 2004-06-28 14:30 UTC, Ryan Rhodes	Details
textmining library need by MSWordExtractor (225.93 KB, application/octet-stream) 2004-06-28 14:31 UTC, Ryan Rhodes	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ryan Rhodes 2004-06-28 13:58:38 UTC

This is a text indexer and a set of text extractors for popular binary file 
formats.

When a document is created or updated the indexer uses the ExtractorManager to 
obtain a list of extractors for a given NodeDescriptor.  The indexer extracts 
text from the document and uses Lucene to index the text for optimized 
searching.

DASL Searches that use the contains clause are handled by 
TextContainsExpression and TextContainsExpressionFactory.

There are four extractors included for extracting text from the four most 
popular binary file formats.  With the exception of PowerPoint, I used 
available libraries (MIT/BSD) to handle the actual extraction.  I used the 
textmining library, a POI wrapper, to extract text from word(POI's Word 
library doesn't strip the formatting tags).  I used the PDFBox library to 
extract text from PDF files.  I used the high level excel library in POI to 
extract text from excel, and I used POI's low level OLE library to extract 
text from PowerPoint.

I'm going to attach the jar's that are not already included with slide.  I'm 
also attaching the file log4j.jar.  This is needed by PDFBox.  I don't 
understand why the log4j jar included with Slide doesn't work.  I just put 
both in my WAR and it worked.

Comment 1 Ryan Rhodes 2004-06-28 14:27:34 UTC

Created attachment 11969 [details]
TextContentIndexer

Comment 2 Ryan Rhodes 2004-06-28 14:28:11 UTC

Created attachment 11970 [details]
TextContainsExpression

Comment 3 Ryan Rhodes 2004-06-28 14:28:41 UTC

Created attachment 11971 [details]
TextContainsExpressionFactory

Comment 4 Ryan Rhodes 2004-06-28 14:29:07 UTC

Created attachment 11972 [details]
MSWordExtractor

Comment 5 Ryan Rhodes 2004-06-28 14:29:29 UTC

Created attachment 11973 [details]
MSExcelExtractor

Comment 6 Ryan Rhodes 2004-06-28 14:29:51 UTC

Created attachment 11974 [details]
MSPowerPointExtractor

Comment 7 Ryan Rhodes 2004-06-28 14:30:10 UTC

Created attachment 11975 [details]
PDFExtractor

Comment 8 Ryan Rhodes 2004-06-28 14:30:40 UTC

Created attachment 11976 [details]
Sample Domain.xml config

Comment 9 Ryan Rhodes 2004-06-28 14:31:17 UTC

Created attachment 11977 [details]
textmining library need by MSWordExtractor

Comment 10 Unico Hommes 2004-07-11 19:22:32 UTC

I think this can be closed. Oliver?

Comment 11 Oliver Zeigermann 2004-07-11 19:27:36 UTC

Sure. This has been checked in.