Bug 57847 - doc to html conversion does not create bullet points
Summary: doc to html conversion does not create bullet points
Alias: None
Product: POI
Classification: Unclassified
Component: HWPF (show other bugs)
Version: 3.10-FINAL
Hardware: PC Mac OS X 10.4
: P2 major (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2015-04-22 08:43 UTC by Madhava Kulkarni
Modified: 2015-04-24 05:36 UTC (History)
0 users

Document has some style and bullets to check (32.00 KB, application/msword)
2015-04-22 08:43 UTC, Madhava Kulkarni

Note You need to log in before you can comment on or make changes to this bug.
Description Madhava Kulkarni 2015-04-22 08:43:59 UTC
Created attachment 32676 [details]
Document has some style and bullets to check

When a document is converted to HTML using WordToHtmlConverter the bullet points are not listed as UL/LI elements in HTML copy.
due to this when code is pasted in tinymce etc does not look like bullet points. Also the bullet marker chosen gets changed in html version
Comment 1 Madhava Kulkarni 2015-04-22 10:43:26 UTC
Basically following code is converting unordered list into paragraph elements.
Also, some how its picking up wrong bullet element/char


 String label = AbstractWordUtils.getBulletText(
                            numberingState, hwpfList,
                            (char) paragraph.getIlvl() );
Comment 2 Nick Burch 2015-04-24 01:19:04 UTC
For slightly complicated reasons, we have two different .doc -> .html converters, one in the POI codebase (WordToHtmlConverter) and one in the Tika codebase (org.apache.tika.parser.microsoft.WordExtractor)

If you could, it'd be great if you could try your same file with Apache Tika, and see if that manages to get the lists out. (Grab the tika-app jar and run it with --html for a quick way to check)

If Apache Tika does it right, we can hopefully bring over the logic to the AbstractWordConverter family of converters. If not, we can look to fix it in both at the same time!
Comment 3 Madhava Kulkarni 2015-04-24 05:36:25 UTC
Tika did not work here but it showed another bug of Tika which removed bullets itself. 
did like this -
java -jar tika-app-1.8.jar --html ~/Documents/sample.doc  > test.html
Here it removed the bullets itself.