Bug 49541 - Mapping of symbol characters to unicode equivalent
Summary: Mapping of symbol characters to unicode equivalent
Alias: None
Product: POI
Classification: Unclassified
Component: HSLF (show other bugs)
Version: 3.6-FINAL
Hardware: PC Windows XP
: P2 normal (vote)
Target Milestone: ---
Assignee: POI Developers List
Depends on:
Reported: 2010-07-02 06:55 UTC by Piotr Lipski
Modified: 2014-12-29 19:48 UTC (History)
0 users

test case (306.50 KB, application/vnd.ms-powerpoint)
2010-07-02 06:55 UTC, Piotr Lipski

Note You need to log in before you can comment on or make changes to this bug.
Description Piotr Lipski 2010-07-02 06:55:39 UTC
Created attachment 25685 [details]
test case


I tried to extract text from attached ppt. I get '75 years' instead of  '≥75 years'. 

Piotr Lipski
Comment 1 Nick Burch 2010-07-02 08:20:23 UTC
This could well be a case of microsoft making up their own codepoints for stuff

Could you please confirm which character number is used in the file (use org.apache.poi.poifs.dev.POIFSViewer or similar to track it down), then confirm what unicode codepoint your character should actually be?
Comment 2 Piotr Lipski 2010-07-02 08:26:57 UTC
It should be \u2265 instead I get \uf0b3.
Comment 3 Andreas Beeker 2014-12-29 19:48:07 UTC
I've added the method StringUtil.mapMsCodepointString() which converts the symbol characters to the unicode equivalents.
To keep the strings in sync with the binary representation, I've decided not to include this as the default in TextBox.getText() & Co.
Applied with r1648415