|Summary:||ValueIndexer for Multi-Byte Numeric Types Broken|
|Product:||Xindice||Reporter:||Terry Rosenbaum <Terry.Rosenbaum>|
|Component:||DB Engine||Assignee:||Xindice Developers <xindice-dev>|
|Version:||cvs head (1.1)|
Indexer patch for numerical values
Rebuild collection/index utility
Description Terry Rosenbaum 2003-04-21 21:49:58 UTC
I've discovered that defining a ValueIndexer with multi-byte numeric types (e.g. long, int, short, etc.) does not work correctly when resolving relational comparisons other than equals (e.g. can produce incorrect results for gretater than, less than, etc.). The reason is that all data is handled as anonymous byte arrays using the Value class and the Value class has no type associated with the data. Thus, the Value.compareTo method is not able to properly perform comparisons for such data. It performs byte-by-byte comparisons of the data. As an example, ValueIndexer encodes the value 1049903940000 as the bytes 0, 0, 0, -12, 115, 38, -63, -96 and encodes the value 1050687000291 as the bytes 0, 0, 0, -12, -95, -45, 78, -29. Thus, when comparing a Value representing 1049903940000 to a Value representing 1050687000291 (in pseudocode: Value(1049903940000).compareTo(1050687000291)), the incorrect result of 5 is returned indicating that 1049903940000 is greater than 1050687000291 (which is incorrect). The problem can be encountered during XPath comparisons if a ValueIndexer is defined as a numeric indexer for an attribute or element. e.g. if you define an index to index the Time attribute as type long: <Bug Time="1050687000291" /> <index name="BugIndex" class="org.apache.xindice.core.indexer.ValueIndexer" pattern="Bug@Time" type="long" /> And use an XPath like /Bug[(1049903940000 < @Time) and (1050687000291 > @Time)] you may see incorrect query results. Part of the problem is in the code of Value.compareTo(Value): short s1 = (short)(b1 >>> 0); short s2 = (short)(b2 >>> 0); return s1 > s2 ? (i+1) : -(i+1); This code attempts to shed the sign bits via ">>> 0" which has no effect. The result is that s1 and s2 are sign-extended versions of b1 and b2 and the comparison becomes a signed comparison rather than the desired unsigned comparison. This can be fixed by using: int i1 = (int)b1 & 0xff; int i2 = (int)b2 & 0xff; return i1 > i2 ? (i+1) : -(i+1); But, this only fixes the problem for 0 and positive values being compared. It does not work for the entire domain of numeric values. One possible way to fix this would be to add a type field to the Value object and perform conversion based on type prior to performing comparisons. A drawback of this approach would be that exisiting persisted Value objects (e.g. in existing databases) would be rendered incompatible.
Comment 1 Terry Rosenbaum 2003-06-05 07:39:08 UTC
Another possible way to fix this might be through the use of a comparator function. One might alter IndexQuery to allow the setting of a comparator for Value objects. The ValueIndexer could set the comparator to the proper comparator for the index type (string, integer, long, boolean, etc.) in its queryMatches method. This might be a feasible solution allowing backwards compatibility with existing persisted databases.
Comment 2 Kevin Ross 2003-06-17 04:35:35 UTC
As was discussed in a recent thread on the mail list, backwards compatibility means an export and import, so we can consider your solution, or that may untie your hands as far as the 'best' solution goes.
Comment 3 Natalia Shilenkova 2006-11-19 11:44:52 UTC
Created attachment 19143 [details] Patch Here is the possible solution - to encode values depending on type of indexer in the way they can be compared as byte arrays. Please note that it does not fix existing indexers, so indexes have to be recreated.
Comment 4 Natalia Shilenkova 2006-11-19 11:53:52 UTC
One more thing, in the patch I changed the way float/double are handled, instead of storing only 4 digits after the dot, it will store full range of values allowed for float/double. Indexer types short, integer, and long will always use 8 bytes to store the value, and both float and double use 8 bytes. Does it make a sense to change it so value takes exact amount of space - 2 byte for short, 4 bytes for integer and so on?
Comment 6 Vadim Gritsenko 2006-11-26 14:54:44 UTC
(In reply to comment #3) > Here is the possible solution - to encode values depending on type of indexer > in the way they can be compared as byte arrays. Please note that it does not > fix existing indexers, so indexes have to be recreated. This patch changes the way how Values are compared, and as a result all collections using BTree filer (not only indexes) have to be re-created. Can existing import/export be a solution for an upgrade, or do we need an upgrade utility?
Comment 7 Natalia Shilenkova 2006-12-27 18:03:16 UTC
Created attachment 19309 [details] Indexer patch for numerical values
Comment 9 Natalia Shilenkova 2006-12-27 18:10:29 UTC
Created attachment 19311 [details] Rebuild collection/index utility
Comment 10 Vadim Gritsenko 2007-01-30 17:09:12 UTC
Looks good, patch applied (with minor changes). However, some more work is still needed on (how it is now called) DatabaseRebuild utility. Need some shell script so it is more convenient to use. Would be nice to have single step process, with copy & re-indexing done in one step. Better usage documentation. Looking forward to new patches! :)
Comment 11 Vadim Gritsenko 2007-01-30 17:09:57 UTC
(closing this bug as indexing is working now. pls create new issue for any follow up patches).