Bug 35208

Summary: [PATCH] HSLF Update: new (quicker but greedy) text extractor
Product: POI Reporter: Nick Burch <apache>
Component: POI OverallAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: normal    
Priority: P2    
Version: unspecified   
Target Milestone: ---   
Hardware: Other   
OS: other   
Attachments: org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor

Description Nick Burch 2005-06-03 18:33:48 UTC
To quote from the javadoc of this single class:
 * This class will get all the text from a Powerpoint Document, including
 *  all the bits you didn't want, and in a somewhat random order, but will
 *  do it very fast.
 * The class ignores most of the hslf classes, and doesn't use 
 *  HSLFSlideShow. Instead, it just does a very basic scan through the
 *  file, grabbing all the text records as it goes. It then returns the
 *  text, either as a single string, or as a vector of all the individual
 *  strings.
 * Because of how it works, it will return a lot of "crud" text that you 
 *  probably didn't want! It will return text from master slides. It will
 *  return duplicate text, and some mangled text (powerpoint files often
 *  have duplicate copies of slide text in them). You don't get any idea
 *  what the text was associated with.
 * Almost everyone will want to use @see PowerPointExtractor instead. There
 *  are only a very small number of cases (eg some performance sensitive
 *  lucene indexers) that would ever want to use this!


File should go in org.apache.poi.hslf.extractor. Also needs a single line change
in org.apache.poi.hslf.record.Record:


Index: Record.java
===================================================================
RCS file:
/home/cvspublic/jakarta-poi/src/scratchpad/src/org/apache/poi/hslf/record/Record.java,v
retrieving revision 1.1
diff -u -r1.1 Record.java
--- Record.java 28 May 2005 05:36:00 -0000      1.1
+++ Record.java 3 Jun 2005 16:31:00 -0000
@@ -122,7 +122,7 @@
         *  (not including the size of the header), this code assumes you're
         *  passing in corrected lengths
         */
-       protected static Record createRecordForType(long type, byte[] b, int
start, int len) {
+       public static Record createRecordForType(long type, byte[] b, int start,
int len) {
                // Default is to use UnknownRecordPlaceholder
                // When you create classes for new Records, add them here
                switch((int)type) {
Comment 1 Nick Burch 2005-06-03 18:34:20 UTC
Created attachment 15292 [details]
org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor
Comment 2 Nick Burch 2005-06-09 17:15:52 UTC
Added to cvs