[leafnode-list] better hash for message.id/ ?

Tue Nov 4 23:41:39 CET 2008

hi,

I just ran a test (on my freebsd box) to test the "even-ness" of the
distribution of Message-IDs in var/spool/news/message.id/* :

# find /news/message.id/* -type d -print |
  xargs -I \% -R 1 -n 1 sh -c "ls % |wc -l" |
  sed 1d |
  ministat -s -w74

result:

x <stdin>
+--------------------------------------------------------------------------+
|                              x                                           |
|                       x      x           x                               |
|                       x      x     x     x                               |
|                       x      x     x     x                               |
|                  x    x      x     x     x                               |
|                  x    x      x     x     x                               |
|                  x    xx     x     x     x                               |
|                  x x  xx     x     x     x                               |
|                  x x  xx     x     x     x                               |
|                  x x  xx     x     x     x                               |
|               x  x x  xx     x     xx    x                               |
|               x  x xx xx     x  x  xx    x                               |
|               x  x xx xx     x  x  xx    x                               |
|               x  x xx xx     x  x  xx    x                               |
|               x xx xx xx  xx xx xx xx    xx                              |
|            x  x xx xx xx  xx xx xx xx xx xx                              |
|            x  x xx xx xx  xx xx xx xx xx xx                              |
|            x  x xx xx xx xxx xx xx xx xx xx                              |
|            x  x xx xx xx xxx xx xx xx xx xx                              |
|            x  x xx xx xx xxx xx xx xx xx xx                              |
|            x  x xx xx xx xxx xx xx xx xx xx                              |
|            x  x xx xxxxx xxx xx xx xx xx xx                              |
|            x  x xx xxxxx xxx xx xx xx xx xx  x                           |
|            x  x xx xxxxx xxx xx xx xx xx xx  x                           |
|            x  xxxxxxxxxx xxx xx xx xx xx xx  x  x                        |
|            x xxxxxxxxxxx xxxxxx xxxxx xx xx  x  x                        |
|           xx xxxxxxxxxxxxxxxxxx xxxxx xx xx xxx x                        |
|           xx xxxxxxxxxxxxxxxxxx xxxxx xx xx xxx x                        |
|           xx xxxxxxxxxxxxxxxxxx xxxxx xx xxxxxx xx                       |
|           xx xxxxxxxxxxxxxxxxxx xxxxxxxx xxxxxx xx                       |
|           xx xxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxx xxxx                     |
|           xx xxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxx                    |
|           xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x                  |
|       x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx  x    x         |
|       xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx  x    x         |
|     x xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  x    x         |
|     x xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  x    xx        |
|     x xxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  x  x xx        |
|     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx x xxx       |
|x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxx xxx   x  x|
|                  |___________MA____________|                             |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x 999            47           170            97     99.132132     22.308067

The result shows that the hash calculating the location of a particular
message-id might be considered good enough, but propably far from
optimal.  The code is this:

unsigned int
msgid_hash(const char *name)
{
    int i = 0;
    unsigned int r = 0;
    int stop = 0;

    do {
        if (name[i] == '>')
            ++stop;
        r += (int)((unsigned char)name[i]);
        r += ++i;
    } while (!stop && name[i]);

    r = r % 999 + 1;
    return r;
}

Freebsd has this simple BSD-licensed function for hashing strings in
/usr/include/sys/hash.h:

/* Convenience */
#ifndef HASHINIT
#define HASHINIT        5381
#define HASHSTEP(x,c)   (((x << 5) + x) + (c))
#endif

...

/*
 * Return a 32-bit hash of the given string.
 */
static __inline uint32_t
hash32_str(const void *buf, uint32_t hash)
{
        const unsigned char *p = buf;

        while (*p)
                hash = HASHSTEP(hash, *p++);

        return hash;
}

There's a simple hash function by Weinberger, see Aho, Sethi, Ullman
book on compilers from 1986 or search for hashpjw.  It basically goes
like this:

unsigned long int
string_hash(char *s)
{
  register char *p;
  register unsigned long int h = 0, g;

  ASSERT((s != NULL) && (strlen(s) > 0));

  p = s;
  while (*p)
  {
    h = (h<<4) + *p++;
    g = h & 0xf0000000;
    if (g != 0)
    {
      h ^= (g>>24);
      h ^= g;
    }
  }
  return h;
}

I should note that "ministat" is in freebsd base, but it may not be in
eg. linux distributions.

Nothing much to do today, I'm just waiting for the outcome of the US
elections ...

-c