Bug 57847

Summary: doc to html conversion does not create bullet points
Product: POI Reporter: Madhava Kulkarni <madhavabk>
Component: HWPFAssignee: POI Developers List <dev>
Status: NEEDINFO ---    
Severity: major    
Priority: P2    
Version: 3.10-FINAL   
Target Milestone: ---   
Hardware: PC   
OS: Mac OS X 10.4   
Attachments: Document has some style and bullets to check

Description Madhava Kulkarni 2015-04-22 08:43:59 UTC
Created attachment 32676 [details]
Document has some style and bullets to check

When a document is converted to HTML using WordToHtmlConverter the bullet points are not listed as UL/LI elements in HTML copy.
due to this when code is pasted in tinymce etc does not look like bullet points. Also the bullet marker chosen gets changed in html version
Comment 1 Madhava Kulkarni 2015-04-22 10:43:26 UTC
Basically following code is converting unordered list into paragraph elements.
Also, some how its picking up wrong bullet element/char

AbstractWordConverter.java:1094

 String label = AbstractWordUtils.getBulletText(
                            numberingState, hwpfList,
                            (char) paragraph.getIlvl() );
Comment 2 Nick Burch 2015-04-24 01:19:04 UTC
For slightly complicated reasons, we have two different .doc -> .html converters, one in the POI codebase (WordToHtmlConverter) and one in the Tika codebase (org.apache.tika.parser.microsoft.WordExtractor)

If you could, it'd be great if you could try your same file with Apache Tika, and see if that manages to get the lists out. (Grab the tika-app jar and run it with --html for a quick way to check)

If Apache Tika does it right, we can hopefully bring over the logic to the AbstractWordConverter family of converters. If not, we can look to fix it in both at the same time!
Comment 3 Madhava Kulkarni 2015-04-24 05:36:25 UTC
Tika did not work here but it showed another bug of Tika which removed bullets itself. 
did like this -
java -jar tika-app-1.8.jar --html ~/Documents/sample.doc  > test.html
Here it removed the bullets itself.