Class ByteQuadsCanonicalizer

java.lang.Object
tools.jackson.core.sym.ByteQuadsCanonicalizer

public class ByteQuadsCanonicalizer extends Object
Replacement for BytesToNameCanonicalizer which aims at more localized memory access due to flattening of name quad data. Performance improvement modest for simple JSON document data binding (maybe 3%), but should help more for larger symbol tables, or for binary formats like Smile.

Hash area is divided into 4 sections:

  1. Primary area (1/2 of total size), direct match from hash (LSB)
  2. Secondary area (1/4 of total size), match from hash (LSB) >> 1
  3. Tertiary area (1/8 of total size), match from hash (LSB) >> 2
  4. Spill-over area (remaining 1/8) with linear scan, insertion order
and within every area, entries are 4 ints, where 1 - 3 ints contain 1 - 12 UTF-8 encoded bytes of name (null-padded), and last int is offset in _names that contains actual name Strings.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    private static final class 
    Immutable value class used for sharing information as efficiently as possible, by only require synchronization of reference manipulation but not access to contents.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    protected int
    Total number of Strings in the symbol table; only used for child tables.
    protected final boolean
    Flag that indicates whether we should throw an exception if enough hash collisions are detected (true); or just worked around (false).
    protected int[]
    Primary hash information area: consists of 2 * _hashSize entries of 16 bytes (4 ints), arranged in a cascading lookup structure (details of which may be tweaked depending on expected rates of collisions).
    protected boolean
    Flag that indicates whether underlying data structures for the main hash area are shared or not.
    protected int
    Number of slots for primary entries within _hashArea; which is at most 1/8 of actual size of the underlying array (4-int slots, primary covers only half of the area; plus, additional area for longer symbols after hash area).
    protected final InternCache
    Entity that knows how to intern Strings, if needed, or null if no interning is wanted.
    protected int
    Offset within _hashArea that follows main slots and contains quads for longer names (13 bytes or longer), and points to the first available int that may be used for appending quads of the next long name.
    protected String[]
    Array that contains String instances matching entries in _hashArea.
    protected final ByteQuadsCanonicalizer
    Reference to the root symbol table, for child tables, so that they can merge table information back as necessary.
    protected int
    Offset within _hashArea where secondary entries start
    protected final int
    Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance.
    protected int
    Pointer to the offset within spill-over area where there is room for more spilled over entries (if any).
    Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table.
    protected int
    Constant that determines size of buckets for tertiary entries: 1 << _tertiaryShift is the size, and shift value is also used for translating from primary offset into tertiary bucket (shift right by 4 + _tertiaryShift).
    protected int
    Offset within _hashArea where tertiary entries start
    protected static final int
    Initial size of the primary hash area.
    protected static final int
    Let's only share reasonably sized symbol tables.
    private static final int
    Let's not expand symbol tables past some maximum size; with unique (~= random) names.
    private static final int
    No point in trying to construct tiny tables, just need to resize soon.
    private static final int
     
    private static final int
     
    private static final int
     
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    protected
    ByteQuadsCanonicalizer(int sz, int seed)
    Constructor used for creating per-TokenStreamFactory "root" symbol tables: ones used for merging and sharing common symbols
    private
    Alternate constructor used in cases where a "placeholder" child instance is needed when symbol table is not really used, but caller needs a non-null placeholder to keep code functioning with minimal awareness of distinction (all lookups fail to match any name without error; add methods should NOT be called).
    private
    ByteQuadsCanonicalizer(ByteQuadsCanonicalizer parent, int seed, ByteQuadsCanonicalizer.TableInfo state, boolean intern, boolean failOnDoS)
    Constructor used when creating a child instance
  • Method Summary

    Modifier and Type
    Method
    Description
    private int
    _appendLongName(int[] quads, int qlen)
     
    private final int
    _calcOffset(int hash)
     
    (package private) static int
    _calcTertiaryShift(int primarySlots)
     
    private boolean
     
    private int
    Method called to find the location within hash table to add a new symbol in.
    private String
    _findSecondary(int origOffset, int q1)
     
    private String
    _findSecondary(int origOffset, int q1, int q2)
     
    private String
    _findSecondary(int origOffset, int hash, int[] q, int qlen)
     
    private String
    _findSecondary(int origOffset, int q1, int q2, int q3)
     
    protected void
     
    private int
     
    private final int
    Helper method that calculates start of the spillover area
    private <T> T
     
    private boolean
    _verifyLongName(int[] q, int qlen, int spillOffset)
     
    private boolean
    _verifyLongName2(int[] q, int qlen, int spillOffset)
     
    private void
     
    addName(String name, int q1)
     
    addName(String name, int[] q, int qlen)
     
    addName(String name, int q1, int q2)
     
    addName(String name, int q1, int q2, int q3)
     
    int
     
    int
    calcHash(int q1)
     
    int
    calcHash(int[] q, int qlen)
     
    int
    calcHash(int q1, int q2)
     
    int
    calcHash(int q1, int q2, int q3)
     
    Factory method to call to create a symbol table instance with a randomized seed value.
    protected static ByteQuadsCanonicalizer
    createRoot(int seed)
     
    findName(int q1)
     
    findName(int[] q, int qlen)
     
    findName(int q1, int q2)
     
    findName(int q1, int q2, int q3)
     
    int
     
    boolean
     
    makeChild(int flags)
    Factory method used to create actual symbol table instance to use for parsing.
    Method similar to makeChild(int) but one that only creates real instance of TokenStreamFactory.Feature.CANONICALIZE_PROPERTY_NAMES is enabled: otherwise a "bogus" instance is created.
    boolean
    Method called to check to quickly see if a child symbol table may have gotten additional entries.
    private void
     
    static int
     
    private void
    nukeSymbols(boolean fill)
    Helper method called to empty all shared symbols, but to leave arrays allocated
    int
    Method mostly needed by unit tests; calculates number of entries that are in the primary slot set.
    private void
     
    void
    Method called by the using code to indicate it is done with this instance.
    int
    Method mostly needed by unit tests; calculates number of entries in secondary buckets
    int
     
    int
    Method mostly needed by unit tests; calculates number of entries in shared spill-over area
    int
    Method mostly needed by unit tests; calculates number of entries in tertiary buckets
     
    int
     

    Methods inherited from class Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • DEFAULT_T_SIZE

      protected static final int DEFAULT_T_SIZE
      Initial size of the primary hash area. Each entry consumes 4 ints (16 bytes), and secondary area is same as primary; so default size will use 2kB of memory (plus 64x4 or 64x8 (256/512 bytes) for references to Strings, and Strings themselves).
      See Also:
    • MAX_T_SIZE

      private static final int MAX_T_SIZE
      Let's not expand symbol tables past some maximum size; with unique (~= random) names. Size is in
      See Also:
    • MIN_HASH_SIZE

      private static final int MIN_HASH_SIZE
      No point in trying to construct tiny tables, just need to resize soon.
      See Also:
    • MAX_ENTRIES_FOR_REUSE

      protected static final int MAX_ENTRIES_FOR_REUSE
      Let's only share reasonably sized symbol tables. Max size set to 3/4 of 8k; this corresponds to 256k main hash index. This should allow for enough distinct names for almost any case, while preventing ballooning for cases where names are unique (or close thereof).
      See Also:
    • _parent

      protected final ByteQuadsCanonicalizer _parent
      Reference to the root symbol table, for child tables, so that they can merge table information back as necessary.
    • _tableInfo

      Member that is only used by the root table instance: root passes immutable state info child instances, and children may return new state if they add entries to the table. Child tables do NOT use the reference.
    • _seed

      protected final int _seed
      Seed value we use as the base to make hash codes non-static between different runs, but still stable for lifetime of a single symbol table instance. This is done for security reasons, to avoid potential DoS attack via hash collisions.
    • _interner

      protected final InternCache _interner
      Entity that knows how to intern Strings, if needed, or null if no interning is wanted.
      Since:
      2.16
    • _failOnDoS

      protected final boolean _failOnDoS
      Flag that indicates whether we should throw an exception if enough hash collisions are detected (true); or just worked around (false).
    • _hashArea

      protected int[] _hashArea
      Primary hash information area: consists of 2 * _hashSize entries of 16 bytes (4 ints), arranged in a cascading lookup structure (details of which may be tweaked depending on expected rates of collisions).
    • _hashSize

      protected int _hashSize
      Number of slots for primary entries within _hashArea; which is at most 1/8 of actual size of the underlying array (4-int slots, primary covers only half of the area; plus, additional area for longer symbols after hash area).
    • _secondaryStart

      protected int _secondaryStart
      Offset within _hashArea where secondary entries start
    • _tertiaryStart

      protected int _tertiaryStart
      Offset within _hashArea where tertiary entries start
    • _tertiaryShift

      protected int _tertiaryShift
      Constant that determines size of buckets for tertiary entries: 1 << _tertiaryShift is the size, and shift value is also used for translating from primary offset into tertiary bucket (shift right by 4 + _tertiaryShift).

      Default value is 2, for buckets of 4 slots; grows bigger with bigger table sizes.

    • _count

      protected int _count
      Total number of Strings in the symbol table; only used for child tables.
    • _names

      protected String[] _names
      Array that contains String instances matching entries in _hashArea. Contains nulls for unused entries. Note that this size is twice that of _hashArea
    • _spilloverEnd

      protected int _spilloverEnd
      Pointer to the offset within spill-over area where there is room for more spilled over entries (if any). Spill over area is within fixed-size portion of _hashArea.
    • _longNameOffset

      protected int _longNameOffset
      Offset within _hashArea that follows main slots and contains quads for longer names (13 bytes or longer), and points to the first available int that may be used for appending quads of the next long name. Note that long name area follows immediately after the fixed-size main hash area (_hashArea).
    • _hashShared

      protected boolean _hashShared
      Flag that indicates whether underlying data structures for the main hash area are shared or not. If they are, then they need to be handled in copy-on-write way, i.e. if they need to be modified, a copy needs to be made first; at this point it will not be shared any more, and can be modified.

      This flag needs to be checked both when adding new main entries, and when adding new collision list queues (i.e. creating a new collision list head entry)

    • MULT

      private static final int MULT
      See Also:
    • MULT2

      private static final int MULT2
      See Also:
    • MULT3

      private static final int MULT3
      See Also:
  • Constructor Details

    • ByteQuadsCanonicalizer

      protected ByteQuadsCanonicalizer(int sz, int seed)
      Constructor used for creating per-TokenStreamFactory "root" symbol tables: ones used for merging and sharing common symbols
      Parameters:
      sz - Initial primary hash area size
      seed - Random seed valued used to make it more difficult to cause collisions (used for collision-based DoS attacks).
    • ByteQuadsCanonicalizer

      private ByteQuadsCanonicalizer(ByteQuadsCanonicalizer parent, int seed, ByteQuadsCanonicalizer.TableInfo state, boolean intern, boolean failOnDoS)
      Constructor used when creating a child instance
    • ByteQuadsCanonicalizer

      private ByteQuadsCanonicalizer(ByteQuadsCanonicalizer.TableInfo state)
      Alternate constructor used in cases where a "placeholder" child instance is needed when symbol table is not really used, but caller needs a non-null placeholder to keep code functioning with minimal awareness of distinction (all lookups fail to match any name without error; add methods should NOT be called).
      Since:
      2.13
  • Method Details

    • createRoot

      public static ByteQuadsCanonicalizer createRoot()
      Factory method to call to create a symbol table instance with a randomized seed value.
      Returns:
      Root instance to use for constructing new child instances
    • createRoot

      protected static ByteQuadsCanonicalizer createRoot(int seed)
    • makeChild

      public ByteQuadsCanonicalizer makeChild(int flags)
      Factory method used to create actual symbol table instance to use for parsing.
      Parameters:
      flags - Bit flags of active TokenStreamFactory.Features enabled.
      Returns:
      Actual canonicalizer instance that can be used by a parser
    • makeChildOrPlaceholder

      public ByteQuadsCanonicalizer makeChildOrPlaceholder(int flags)
      Method similar to makeChild(int) but one that only creates real instance of TokenStreamFactory.Feature.CANONICALIZE_PROPERTY_NAMES is enabled: otherwise a "bogus" instance is created.
      Parameters:
      flags - Bit flags of active TokenStreamFactory.Features enabled.
      Returns:
      Actual canonicalizer instance that can be used by a parser if (and only if) canonicalization is enabled; otherwise a non-null "placeholder" instance.
    • release

      public void release()
      Method called by the using code to indicate it is done with this instance. This lets instance merge accumulated changes into parent (if need be), safely and efficiently, and without calling code having to know about parent information.
    • mergeChild

      private void mergeChild(ByteQuadsCanonicalizer.TableInfo childState)
    • size

      public int size()
      Returns:
      Number of symbol entries contained by this canonicalizer instance
    • bucketCount

      public int bucketCount()
      Returns:
      number of primary slots table has currently
    • maybeDirty

      public boolean maybeDirty()
      Method called to check to quickly see if a child symbol table may have gotten additional entries. Used for checking to see if a child table should be merged into shared table.
      Returns:
      Whether main hash area has been modified
    • hashSeed

      public int hashSeed()
    • isCanonicalizing

      public boolean isCanonicalizing()
      Returns:
      True for "real", canonicalizing child tables; false for root table as well as placeholder "child" tables.
      Since:
      2.13
    • primaryCount

      public int primaryCount()
      Method mostly needed by unit tests; calculates number of entries that are in the primary slot set. These are "perfect" entries, accessible with a single lookup
      Returns:
      Number of entries in the primary hash area
    • secondaryCount

      public int secondaryCount()
      Method mostly needed by unit tests; calculates number of entries in secondary buckets
      Returns:
      Number of entries in the secondary hash area
    • tertiaryCount

      public int tertiaryCount()
      Method mostly needed by unit tests; calculates number of entries in tertiary buckets
      Returns:
      Number of entries in the tertiary hash area
    • spilloverCount

      public int spilloverCount()
      Method mostly needed by unit tests; calculates number of entries in shared spill-over area
      Returns:
      Number of entries in the linear spill-over areay
    • totalCount

      public int totalCount()
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • findName

      public String findName(int q1)
    • findName

      public String findName(int q1, int q2)
    • findName

      public String findName(int q1, int q2, int q3)
    • findName

      public String findName(int[] q, int qlen)
    • _calcOffset

      private final int _calcOffset(int hash)
    • _findSecondary

      private String _findSecondary(int origOffset, int q1)
    • _findSecondary

      private String _findSecondary(int origOffset, int q1, int q2)
    • _findSecondary

      private String _findSecondary(int origOffset, int q1, int q2, int q3)
    • _findSecondary

      private String _findSecondary(int origOffset, int hash, int[] q, int qlen)
    • _verifyLongName

      private boolean _verifyLongName(int[] q, int qlen, int spillOffset)
    • _verifyLongName2

      private boolean _verifyLongName2(int[] q, int qlen, int spillOffset)
    • addName

      public String addName(String name, int q1) throws StreamConstraintsException
      Parameters:
      name - Name to add
      q1 - Quad representation of the name
      Returns:
      name (possibly interned)
      Throws:
      StreamConstraintsException - if the constraint exceptions
    • addName

      public String addName(String name, int q1, int q2) throws StreamConstraintsException
      Parameters:
      name - Name to add
      q1 - First quad of name representation
      q2 - Second quad of name representation
      Returns:
      name (possibly interned)
      Throws:
      StreamConstraintsException - if the constraint exceptions
    • addName

      public String addName(String name, int q1, int q2, int q3) throws StreamConstraintsException
      Parameters:
      name - Name to add
      q1 - First quad of name representation
      q2 - Second quad of name representation
      q3 - Third quad of name representation
      Returns:
      name (possibly interned)
      Throws:
      StreamConstraintsException - if the constraint exceptions
    • addName

      public String addName(String name, int[] q, int qlen) throws StreamConstraintsException
      Parameters:
      name - Name to add
      q - Quads of name representation
      qlen - Number of quads in q
      Returns:
      name (possibly interned)
      Throws:
      StreamConstraintsException - if the constraint exceptions
    • _verifySharing

      private void _verifySharing()
    • _findOffsetForAdd

      private int _findOffsetForAdd(int hash) throws StreamConstraintsException
      Method called to find the location within hash table to add a new symbol in.
      Parameters:
      hash - Hash of name for which to find location
      Throws:
      StreamConstraintsException - If name length exceeds maximum allowed.
    • _resizeAndFindOffsetForAdd

      private int _resizeAndFindOffsetForAdd(int hash) throws StreamConstraintsException
      Throws:
      StreamConstraintsException
    • multiplyByFourFifths

      public static int multiplyByFourFifths(int number)
    • _checkNeedForRehash

      private boolean _checkNeedForRehash()
    • _appendLongName

      private int _appendLongName(int[] quads, int qlen)
    • calcHash

      public int calcHash(int q1)
    • calcHash

      public int calcHash(int q1, int q2)
    • calcHash

      public int calcHash(int q1, int q2, int q3)
    • calcHash

      public int calcHash(int[] q, int qlen)
      Parameters:
      q - int array
      qlen - length
      Returns:
      hash
      Throws:
      IllegalArgumentException - if qlen is less than 4
    • rehash

      private void rehash() throws StreamConstraintsException
      Throws:
      StreamConstraintsException
    • nukeSymbols

      private void nukeSymbols(boolean fill)
      Helper method called to empty all shared symbols, but to leave arrays allocated
    • _spilloverStart

      private final int _spilloverStart()
      Helper method that calculates start of the spillover area
    • _reportTooManyCollisions

      protected void _reportTooManyCollisions() throws StreamConstraintsException
      Throws:
      StreamConstraintsException
    • _calcTertiaryShift

      static int _calcTertiaryShift(int primarySlots)
    • _throwInternalError

      private <T> T _throwInternalError(String msg)