Bug 56563

Summary: Multithreading bug when reading 2 similar files
Product: POI Reporter: Uri Sherman <urishe>
Component: HSSFAssignee: POI Developers List <dev>
Status: RESOLVED FIXED    
Severity: major CC: seebass
Priority: P2    
Version: 3.10-FINAL   
Target Milestone: ---   
Hardware: All   
OS: All   
Attachments: A file for which the problem reproduces

Description Uri Sherman 2014-05-26 07:54:00 UTC
Created attachment 31660 [details]
A file for which the problem reproduces

When reading two copies of the same file (doesn't necessarily need to be the exact same file, but do need to contain the same styles or something along those lines),
from two threads simultaneously (each thread processes its own file), the format string of cell styles get mixed up.

The following code demonstrates this, printing to System.out every time a non date cell is mistakenly recognized as date. Note that starting only one of the threads yields no System.out messages.

public class CellFormatBugExample implements Runnable {

	public static void main(String[] args) {
		new Thread(new CellFormatBugExample("C:/temp/file_1.xls")).start();
		new Thread(new CellFormatBugExample("C:/temp/file_2.xls")).start();
	}

	String filePath;

	public CellFormatBugExample(String filePath) {
		this.filePath = filePath;
	}

	@Override
	public void run() {

		File inputFile = new File(filePath);
		try (FileInputStream stream = new FileInputStream(inputFile)) {

			Workbook wb = WorkbookFactory.create(stream);
			Sheet sheet = wb.getSheetAt(0);

			for (Row row : sheet) {
				for (Integer idxCell = 0; idxCell < row.getLastCellNum(); idxCell++) {

					Cell cell = row.getCell(idxCell);
					cell.getCellStyle().getDataFormatString();
					if (cell.getCellType() == HSSFCell.CELL_TYPE_NUMERIC) {
						boolean isDate = HSSFDateUtil.isCellDateFormatted(cell);
						if (idxCell > 0 && isDate) {
							System.out.println("cell " + idxCell + " is not a date!");
						}
					}
				}
			}
		}
		catch (Exception e) {
			e.printStackTrace();
		}
	}
}

Make another copy of the attached file - "file_2.xls" and run the code to reproduce.
Digging around a bit, seems the cause for this is a caching bug in the HSSFCellStyle.getDataFormatString() method.
As far as I could understand a simple resolution would be to synchronize access to that method.
Comment 1 Dominik Stadler 2014-05-26 19:15:21 UTC
This was introduced in Bug 55612, unfortunately only a lock on a static object will suffice unless we replace it with something a bit more sophisticated, e.g. a thread-local.
Comment 2 Dominik Stadler 2014-05-26 20:06:33 UTC
Fixed via r1597637, we now use ThreadLocals to keep the cache and thus avoid multi-threading issues.

There should not be a huge overhead per Thread as we keep a string, a short and a list of (int+boolean+string), which is replaced for every cache-entry, so it cannot grow unbounded.
Comment 3 Dominik Stadler 2014-05-26 20:17:27 UTC
*** Bug 56453 has been marked as a duplicate of this bug. ***
Comment 4 Uri Sherman 2014-05-27 06:57:14 UTC
A short suggestion - I saw you are trying to avoid synchronization on a static variable here.

Note that every call to DateUtil.isCellDateFormatted(a-numeric-cell) ends up in a call to DateUtil.isADateFormat() which itself also maintains a static cache of the last result and synchronizes on DateUtil.class.

In my case for example, the issue with the formatString arises by a call to DateUtil.isCellDateFormatted(cell), so I still get monitor locks upon each call even though you use thread locals for the style formatString cache. 

Seems like it would be a good idea to use the same strategy and switch the DateUtil.isADateFormat() cache to be ThreadLocal based rather than synchronization based.
Comment 5 Dominik Stadler 2014-06-05 20:15:57 UTC
Thanks for the note, I have moved this into a separate Bug 56595 as this one is resolved and the note concerns a different code-location.