47731 – Word Extractor considers text copied from some website as an embedded object

Bug 47731 - Word Extractor considers text copied from some website as an embedded object

Summary: Word Extractor considers text copied from some website as an embedded object

Status:	RESOLVED FIXED

Alias:	None

Product:	POI
Classification:	Unclassified
Component:	HWPF (show other bugs)
Version:	3.2-FINAL
Hardware:	PC Windows Server 2003

Importance:	P2 enhancement (vote)
Target Milestone:	---
Assignee:	POI Developers List

URL:
Keywords:

Depends on:
Blocks:

Reported:	2009-08-24 22:50 UTC by Gitu
Modified:	2011-08-09 12:43 UTC (History)
CC List:	2 users (show)

Attachments
This attachment contains text copied from a web page (877.00 KB, application/msword) 2009-08-31 22:04 UTC, Gitu	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Gitu 2009-08-24 22:50:21 UTC

Hi,

I have copied some text from some web page and pasted that in a word document.
Now, when I use WordExtractor to extract the content of that document, then complete content gets extracted but the summary information comes multiple times.

After investigating I came to know that each part in that document is considered as an embedded object and hence for each embedded object, summary is getting extracted ie. same value is coming those many times.

I also wanted to know if considering an HTML content as an Embedded object is a valid behaviour.

I have attached a document which can reproduce the scenario.

Many thanks in advance,
Gitu

Comment 1 Yegor Kozlov 2009-08-31 10:05:14 UTC

You seem to forget to attach the file. Please re-attach.

Yegor

Comment 2 Gitu 2009-08-31 22:04:02 UTC

Created attachment 24197 [details]
This attachment contains text copied from a web page

Attached the document!!

Thanks,
Gitu

Comment 3 Sergey Vladimirov 2011-07-24 18:55:07 UTC

Text extractor does extract all text from document, but not from included OLE objects. Those objects can be actually other Word documents and/or Excel stylesheet and/or vector images.

There is can be an enchancement to TextExtractor to allow extracting text from OLE objects, but surely current behaviour not a bug. Changing importance to "enchancement".

Comment 4 Sergey Vladimirov 2011-08-09 12:43:12 UTC

Fixed/improved in r1155337, will be part of 3.8-beta4 release.