Created attachment 32676 [details] Document has some style and bullets to check When a document is converted to HTML using WordToHtmlConverter the bullet points are not listed as UL/LI elements in HTML copy. due to this when code is pasted in tinymce etc does not look like bullet points. Also the bullet marker chosen gets changed in html version
Basically following code is converting unordered list into paragraph elements. Also, some how its picking up wrong bullet element/char AbstractWordConverter.java:1094 String label = AbstractWordUtils.getBulletText( numberingState, hwpfList, (char) paragraph.getIlvl() );
For slightly complicated reasons, we have two different .doc -> .html converters, one in the POI codebase (WordToHtmlConverter) and one in the Tika codebase (org.apache.tika.parser.microsoft.WordExtractor) If you could, it'd be great if you could try your same file with Apache Tika, and see if that manages to get the lists out. (Grab the tika-app jar and run it with --html for a quick way to check) If Apache Tika does it right, we can hopefully bring over the logic to the AbstractWordConverter family of converters. If not, we can look to fix it in both at the same time!
Tika did not work here but it showed another bug of Tika which removed bullets itself. did like this - java -jar tika-app-1.8.jar --html ~/Documents/sample.doc > test.html Here it removed the bullets itself.