Class CharsToNameCanonicalizer
For optimal performance, usage pattern should be one where matches
should be very common (especially after "warm-up"), and as with most hash-based
maps/sets, that hash codes are uniformly distributed. Also, collisions
are slightly more expensive than with HashMap or HashSet, since hash codes
are not used in resolving collisions; that is, equals() comparison is
done with all symbols in same bucket index.
Finally, rehashing is also more expensive, as hash codes are not
stored; rehashing requires all entries' hash codes to be recalculated.
Reason for not storing hash codes is reduced memory usage, hoping
for better memory locality.
Usual usage pattern is to create a single "master" instance, and either use that instance in sequential fashion, or to create derived "child" instances, which after use, are asked to return possible symbol additions to master instance. In either case benefit is that symbol table gets initialized so that further uses are more efficient, as eventually all symbols needed will already be in symbol table. At that point no more Symbol String allocations are needed, nor changes to symbol table itself.
Note that while individual SymbolTable instances are NOT thread-safe (much like generic collection classes), concurrently used "child" instances can be freely used without synchronization. However, using master table concurrently with child instances can only be done if access to master instance is read-only (i.e. no modifications done).
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescription(package private) static final classThis class is a symbol table entry.private static final classImmutable value class used for sharing information as efficiently as possible, by only require synchronization of reference manipulation but not access to contents. -
Field Summary
FieldsModifier and TypeFieldDescriptionprotected CharsToNameCanonicalizer.Bucket[]Overflow buckets; if primary doesn't match, lookup is done from here.protected booleanWhether any canonicalization should be attempted (whether using intern or not.protected final intFeature flags ofTokenStreamFactorythat uses this canonicalizer.protected booleanFlag that indicates whether underlying data structures for the main hash area are shared or not.protected intMask used to get index from hash values; equal to_buckets.length - 1, when _buckets.length is a power of two.protected intWe need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases.protected BitSetLazily constructed structure that is used to keep track of collision buckets that have overflowed once: this is used to detect likely attempts at denial-of-service attacks that uses hash collisions.protected final CharsToNameCanonicalizerSharing of learnt symbols is done by optional linking of symbol table instances with their parents.protected final intSeed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance.protected intCurrent size (number of entries); needed to know if and when rehash.protected intLimit that indicates maximum size this instance can hold before it needs to be expanded and rehashed.protected final StreamReadConstraintsConstraints used byTokenStreamFactorythat uses this canonicalizer.protected String[]Primary matching symbols; it's expected most match occur from here.protected final AtomicReference<CharsToNameCanonicalizer.TableInfo> Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table.private static final intDefault initial table size.static final int(package private) static final intAlso: to thwart attacks based on hash collisions (which may or may not be cheap to calculate), we will need to detect "too long" collision chains.static final intLet's only share reasonably sized symbol tables.private static final intLet's not expand symbol tables past some maximum size; this should protected against OOMEs caused by large documents with unique (~= random) names. -
Constructor Summary
ConstructorsModifierConstructorDescriptionprivateCharsToNameCanonicalizer(StreamReadConstraints src, int factoryFeatures, int seed) Main method for constructing a root symbol table instance.privateCharsToNameCanonicalizer(CharsToNameCanonicalizer parent, StreamReadConstraints src, int factoryFeatures, int seed, CharsToNameCanonicalizer.TableInfo parentState) Internal constructor used when creating child instances. -
Method Summary
Modifier and TypeMethodDescriptionprivate String_addSymbol(char[] buffer, int start, int len, int h, int index) private String_findSymbol2(char[] buffer, int start, int len, CharsToNameCanonicalizer.Bucket b) private void_handleSpillOverflow(int bucketIndex, CharsToNameCanonicalizer.Bucket newBucket, int mainIndex) Method called when an overflow bucket has hit the maximum expected length: this may be a case of DoS attack.int_hashToIndex(int rawHash) Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index.protected void_reportTooManyCollisions(int maxLen) private static int_thresholdSize(int hashAreaSize) intMethod for checking number of primary hash buckets this symbol table uses.intcalcHash(char[] buffer, int start, int len) Implementation of a hashing method for variable length Strings.intintMethod mostly needed by unit tests; calculates number of entries that are in collision list.private voidMethod called when copy-on-write is needed; generally when first change is made to a derived symbol table.static CharsToNameCanonicalizercreateRoot(TokenStreamFactory owner) instance.static CharsToNameCanonicalizercreateRoot(TokenStreamFactory owner, int seed) findSymbol(char[] buffer, int start, int len, int h) inthashSeed()"Factory" method; will create a new child instance of this symbol table.intMethod mostly needed by unit tests; calculates length of the longest collision chain.booleanprivate voidmergeChild(CharsToNameCanonicalizer.TableInfo childState) Method that allows contents of child table to potentially be "merged in" with contents of this symbol table.private voidrehash()Method called when size (number of entries) of symbol table grows so big that load factor is exceeded.voidrelease()Method called by the using code to indicate it is done with this instance.intsize()void
-
Field Details
-
HASH_MULT
public static final int HASH_MULT- See Also:
-
DEFAULT_T_SIZE
private static final int DEFAULT_T_SIZEDefault initial table size. Shouldn't be miniscule (as there's cost to both array realloc and rehashing), but let's keep it reasonably small. For systems that properly reuse factories it doesn't matter either way; but when recreating factories often, initial overhead may dominate.- See Also:
-
MAX_T_SIZE
private static final int MAX_T_SIZELet's not expand symbol tables past some maximum size; this should protected against OOMEs caused by large documents with unique (~= random) names.- See Also:
-
MAX_ENTRIES_FOR_REUSE
public static final int MAX_ENTRIES_FOR_REUSELet's only share reasonably sized symbol tables. Max size set to 3/4 of 16k; this corresponds to 64k main hash index. This should allow for enough distinct names for almost any case.- See Also:
-
MAX_COLL_CHAIN_LENGTH
static final int MAX_COLL_CHAIN_LENGTHAlso: to thwart attacks based on hash collisions (which may or may not be cheap to calculate), we will need to detect "too long" collision chains.Note: longest chain we have been able to produce without malicious intent has been 38 (with "tools.jackson.core.main.TestWithTonsaSymbols"); our setting should be reasonable here.
- See Also:
-
_parent
Sharing of learnt symbols is done by optional linking of symbol table instances with their parents. When parent linkage is defined, and child instance is released (call torelease), parent's shared tables may be updated from the child instance. -
_tableInfo
Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table. Child tables do NOT use the reference. -
_streamReadConstraints
Constraints used byTokenStreamFactorythat uses this canonicalizer.- Since:
- 2.16
-
_seed
protected final int _seedSeed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance. This is done for security reasons, to avoid potential DoS attack via hash collisions. -
_factoryFeatures
protected final int _factoryFeaturesFeature flags ofTokenStreamFactorythat uses this canonicalizer. -
_canonicalize
protected boolean _canonicalizeWhether any canonicalization should be attempted (whether using intern or not.NOTE: non-final since we may need to disable this with overflow.
-
_symbols
Primary matching symbols; it's expected most match occur from here. -
_buckets
Overflow buckets; if primary doesn't match, lookup is done from here.Note: Number of buckets is half of number of symbol entries, on assumption there's less need for buckets.
-
_size
protected int _sizeCurrent size (number of entries); needed to know if and when rehash. -
_sizeThreshold
protected int _sizeThresholdLimit that indicates maximum size this instance can hold before it needs to be expanded and rehashed. Calculated using fill factor passed in to constructor. -
_indexMask
protected int _indexMaskMask used to get index from hash values; equal to_buckets.length - 1, when _buckets.length is a power of two. -
_longestCollisionList
protected int _longestCollisionListWe need to keep track of the longest collision list; this is needed both to indicate problems with attacks and to allow flushing for other cases. -
_overflows
Lazily constructed structure that is used to keep track of collision buckets that have overflowed once: this is used to detect likely attempts at denial-of-service attacks that uses hash collisions.
-
-
Constructor Details
-
CharsToNameCanonicalizer
Main method for constructing a root symbol table instance. -
CharsToNameCanonicalizer
private CharsToNameCanonicalizer(CharsToNameCanonicalizer parent, StreamReadConstraints src, int factoryFeatures, int seed, CharsToNameCanonicalizer.TableInfo parentState) Internal constructor used when creating child instances.
-
-
Method Details
-
_thresholdSize
private static int _thresholdSize(int hashAreaSize) -
createRoot
instance. Root instance is never used directly; its main use is for storing and sharing underlying symbol arrays as needed.- Parameters:
owner- Factory that will use the root instance; used for accessing configuration- Returns:
- Root instance to use for constructing new child instances
-
createRoot
-
makeChild
"Factory" method; will create a new child instance of this symbol table. It will be a copy-on-write instance, ie. it will only use read-only copy of parent's data, but when changes are needed, a copy will be created.Note: It is generally not safe to both use makeChild/mergeChild, AND to use instance actively. Instead, a separate 'root' instance should be used on which only makeChild/mergeChild are called, but instance itself is not used as a symbol table.
- Returns:
- Actual canonicalizer instance that can be used by a parser
-
release
public void release()Method called by the using code to indicate it is done with this instance. This lets instance merge accumulated changes into parent (if need be), safely and efficiently, and without calling code having to know about parent information. -
mergeChild
Method that allows contents of child table to potentially be "merged in" with contents of this symbol table.Note that caller has to make sure symbol table passed in is really a child or sibling of this symbol table.
-
size
public int size()- Returns:
- Number of symbol entries contained by this canonicalizer instance
-
bucketCount
public int bucketCount()Method for checking number of primary hash buckets this symbol table uses.- Returns:
- number of primary slots table has currently
-
maybeDirty
public boolean maybeDirty() -
hashSeed
public int hashSeed() -
collisionCount
public int collisionCount()Method mostly needed by unit tests; calculates number of entries that are in collision list. Value can be at most (size()- 1), but should usually be much lower, ideally 0.- Returns:
- Number of collisions in the primary hash area
- Since:
- 2.1
-
maxCollisionLength
public int maxCollisionLength()Method mostly needed by unit tests; calculates length of the longest collision chain. This should typically be a low number, but may be up tosize()- 1 in the pathological case- Returns:
- Length of the collision chain
- Since:
- 2.1
-
findSymbol
-
_findSymbol2
-
_addSymbol
-
_handleSpillOverflow
private void _handleSpillOverflow(int bucketIndex, CharsToNameCanonicalizer.Bucket newBucket, int mainIndex) Method called when an overflow bucket has hit the maximum expected length: this may be a case of DoS attack. Deal with it based on settings by either clearing up bucket (to avoid indefinite expansion) or throwing exception. Currently, the first overflow for any single bucket DOES NOT throw an exception, only second time (per symbol table instance) -
_hashToIndex
public int _hashToIndex(int rawHash) Helper method that takes in a "raw" hash value, shuffles it as necessary, and truncates to be used as the index.- Parameters:
rawHash- Raw hash value to use for calculating index- Returns:
- Index value calculated
-
calcHash
public int calcHash(char[] buffer, int start, int len) Implementation of a hashing method for variable length Strings. Most of the time intention is that this calculation is done by caller during parsing, not here; however, sometimes it needs to be done for parsed "String" too.- Parameters:
buffer- Input buffer that contains name to decodestart- Pointer to the first character of the namelen- Length of String; has to be at least 1 (caller guarantees)- Returns:
- Hash code calculated
-
calcHash
-
copyArrays
private void copyArrays()Method called when copy-on-write is needed; generally when first change is made to a derived symbol table. -
rehash
private void rehash()Method called when size (number of entries) of symbol table grows so big that load factor is exceeded. Since size has to remain power of two, arrays will then always be doubled. Main work is really redistributing old entries into new String/Bucket entries. -
_reportTooManyCollisions
protected void _reportTooManyCollisions(int maxLen) -
verifyInternalConsistency
public void verifyInternalConsistency()
-