Bug 65246 - when convert word to html, style is not correct, the word contains a complex tables
Summary: when convert word to html, style is not correct, the word contains a complex ...
Status: RESOLVED CLOSED
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: unspecified
Hardware: PC All
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-04-16 09:15 UTC by 陈起
Modified: 2021-04-17 15:05 UTC (History)
0 users



Attachments
fixed AbstractWordConverter.java (46.67 KB, text/x-csrc)
2021-04-16 09:15 UTC, 陈起
Details
demo-error (30.38 KB, application/x-zip-compressed)
2021-04-16 09:20 UTC, 陈起
Details
(.doc)结构 (109.62 KB, image/png)
2021-04-17 13:07 UTC, 陈起
Details
(.html)结构 (31.30 KB, image/png)
2021-04-17 13:08 UTC, 陈起
Details

Note You need to log in before you can comment on or make changes to this bug.
Description 陈起 2021-04-16 09:15:13 UTC
Created attachment 37811 [details]
fixed AbstractWordConverter.java

An error result appear, when word contains a complex tables and convert to html!
I'm find this bug happend in the class place:
org.apache.poi.hwpf.converter.WordToHtmlConverter#processTable():678
org.apache.poi.hwpf.converter.AbstractWordConverter#getNumberRowsSpanned()

I'm uploaded an example word(.doc) in the attachment, other an demo, and part of fixed .java info.
Comment 1 陈起 2021-04-16 09:20:11 UTC
Created attachment 37812 [details]
demo-error
Comment 2 Andreas Beeker 2021-04-16 12:16:49 UTC
@Devs: Should we move the converter classes from scratchpad to examples?

Pro: they seem to be quite scratchy and I still remember the last time we had a CVE because of some XXE - although that was already in the examples

scratchpad has its name for a reason.
so moving the classes shouldn't be a real discussion.

Con: users might be slow to adapt and we end up spending more time in the usual Stackoverflow questions/answers than dealing with new feature requests.
Comment 3 PJ Fanning 2021-04-17 11:16:17 UTC
makes sense to move to poi-examples
Comment 4 陈起 2021-04-17 13:07:34 UTC
Created attachment 37813 [details]
(.doc)结构

这是(.doc)中的文档结构
Comment 5 陈起 2021-04-17 13:08:30 UTC
Created attachment 37814 [details]
(.html)结构

这个是转换之后的(.html)文档结构
Comment 6 陈起 2021-04-17 13:10:17 UTC
Thanks for the developer's reply, but I don't seem to fully understand the meaning of your reply message.
This error is an abnormal situation that occurred in the production environment of our application. The user has a (.doc) document. The (.doc) is converted to (.html) through our back-end service and then output to the front page for display. Here we use The code is as follows:

WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());

However, the displayed (.html) format and the table structure in the (.doc) document are incorrect. After inspection, the (.doc) document is correct, so I found the apache-poi source code for debugging, and then traced the following code:

org.apache.poi.hwpf.converter.WordToHtmlConverter#processTable()
org.apache.poi.hwpf.converter.AbstractWordConverter#getNumberRowsSpanned()

I found that something was wrong when getting cells across rows. I uploaded the sample code and screenshots with wrong results. attachment 2 [details] is my error demonstration example, attachment 3 [details] is the structure in (.doc), attachment 4 is the conversion. The structure after becoming (.html).
After that, I tried to fix this (AbstractWordConverter#getNumberRowsSpanned()) class method and released it to the production environment. The current situation seems to be no problem. In the attachment 1 [details] above, I tried to repair the method, but I am not sure whether it is correct or the bug is still left.
Comment 7 Andreas Beeker 2021-04-17 14:32:09 UTC
The meaning of moving something to the examples is, that we - i.e. the ones who are +1 for that move - think, that this is one of the added features, which shouldn't be in the library in the first place.

A few of the examples are:
* converting from biff to ooxml formats or vice versa
* comparing documents
* converting from office to another format (PDF)

There are exemption from this rule, like the rendering of slideshows to images, because I take care and am interested in the further development.

on the other side we have so much dead code which we usually just drag from release to release and (converting from) HWPF, which no one actively develops, is especially a problem.

Therefore I've asked, if the other few PMCs have the same intention.

Once it's in the examples, we might extend it, but basically the examples are just a template for your own development.
Comment 8 陈起 2021-04-17 15:05:29 UTC
Thank you for your explanation. I seem to understand. What you mean is that we should not fix this error example to the release version immediately. You don’t have much energy to develop (HWPF). If this phenomenon is only an individual case , At best, this should be put into the sample library, so what should we do now? I haven't figured out whether to put this use case in the examples library. If you do, it doesn't seem to be of much use, so I think it's better not to do this.