Bug 35208 - [PATCH] HSLF Update: new (quicker but greedy) text extractor
Summary: [PATCH] HSLF Update: new (quicker but greedy) text extractor
Alias: None
Product: POI
Classification: Unclassified
Component: POI Overall (show other bugs)
Version: unspecified
Hardware: Other other
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2005-06-03 18:33 UTC by Nick Burch
Modified: 2005-06-09 09:15 UTC (History)
0 users

org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor (6.20 KB, text/x-java)
2005-06-03 18:34 UTC, Nick Burch

Note You need to log in before you can comment on or make changes to this bug.
Description Nick Burch 2005-06-03 18:33:48 UTC
To quote from the javadoc of this single class:
 * This class will get all the text from a Powerpoint Document, including
 *  all the bits you didn't want, and in a somewhat random order, but will
 *  do it very fast.
 * The class ignores most of the hslf classes, and doesn't use 
 *  HSLFSlideShow. Instead, it just does a very basic scan through the
 *  file, grabbing all the text records as it goes. It then returns the
 *  text, either as a single string, or as a vector of all the individual
 *  strings.
 * Because of how it works, it will return a lot of "crud" text that you 
 *  probably didn't want! It will return text from master slides. It will
 *  return duplicate text, and some mangled text (powerpoint files often
 *  have duplicate copies of slide text in them). You don't get any idea
 *  what the text was associated with.
 * Almost everyone will want to use @see PowerPointExtractor instead. There
 *  are only a very small number of cases (eg some performance sensitive
 *  lucene indexers) that would ever want to use this!

File should go in org.apache.poi.hslf.extractor. Also needs a single line change
in org.apache.poi.hslf.record.Record:

Index: Record.java
RCS file:
retrieving revision 1.1
diff -u -r1.1 Record.java
--- Record.java 28 May 2005 05:36:00 -0000      1.1
+++ Record.java 3 Jun 2005 16:31:00 -0000
@@ -122,7 +122,7 @@
         *  (not including the size of the header), this code assumes you're
         *  passing in corrected lengths
-       protected static Record createRecordForType(long type, byte[] b, int
start, int len) {
+       public static Record createRecordForType(long type, byte[] b, int start,
int len) {
                // Default is to use UnknownRecordPlaceholder
                // When you create classes for new Records, add them here
                switch((int)type) {
Comment 1 Nick Burch 2005-06-03 18:34:20 UTC
Created attachment 15292 [details]
Comment 2 Nick Burch 2005-06-09 17:15:52 UTC
Added to cvs