Short, friendly base32 slugs from timestamps

wahern · on Jan 19, 2023

For my current project I decided to try out 120-bit/15-byte identifiers. 120 is evenly divisible by 4 (hex), 5 (base32), 6 (base64), and 8. Neither base32 nor base64 encodings need padding. You can usually leave padding out regardless, but with 120 bits it's not even a choice you have to worry about, and there are no wasted bits.

cakoose · on Jan 19, 2023

Having it be evenly divisible by all those bases is neat!

But I wouldn't say "no wasted bits" unless 120 bits is also exactly the right length for a particular use case. For example, if my use case would be served better by 64 bits, I'd prefer losing some bits to padding than extending it to 120 bits.

andreareina · on Jan 19, 2023

If it’s served by 64 bits it’s probably served by 60 right? 64 bits is too few to generate randomly.

the_mitsuhiko · on Jan 19, 2023

For URLs you can also do something else. If you know that your primary format is UUID v4 you can encode your UUIDs into nice URLs in base62 format. They come out as 23 alphanumeric characters and you can reconstruct the 128 bits of UUID perfectly from them.

kijin · on Jan 19, 2023

If you're going to change the order defined in RFC 4648, you might as well change the character set, too. Since one of the goals is to minimize the use of characters that can be confused for another, I would add back 8 and 9 and remove two of i, l and o.

junon · on Jan 19, 2023

This is what crockford32 does by the way. It's my preferred base32 alternative when there's a chance humans have to transmit something.

gmac · on Jan 19, 2023

I like to also remove 0 and 1 — because the human entering the code probably doesn't know that 0, o and O will be considered equivalent — and A and E — because accidental profanity isn't actually prevented by excluding only U and I.

This gives base 28, of course, which is admittedly slightly less neat, but still perfectly usable. For example, in TypeScript I use this:

    const targetChars = '23456789bcdfghjkmnpqrstvwxyz';  // must be <=36 chars
    const sourceChars = '0123456789abcdefghijklmnopqr';  // what we get from toString(radix)
    const radix = targetChars.length;

    export function tokenFromId(id: number) {
      return [...id.toString(radix)]
        .map(origChar => targetChars.charAt(sourceChars.indexOf(origChar)))
        .join('');
    }

    export function idFromToken(token: string) {
      return parseInt([...token.toLowerCase()]
        .map(pathChar => sourceChars.charAt(targetChars.indexOf(pathChar)))
        .join('')
        , radix);
    }

junon · on Jan 19, 2023

To each their own, but the point is that the human doesn't need to know that 0, o and O are considered the same.

gmac · on Jan 19, 2023

The point is, if they don't know it, then they may worry whether or not they're getting the code right.

junon · on Jan 21, 2023

Why does that matter when it doesn't matter if they get it right?

gmac · on Jan 23, 2023

Worrying is not good UX.

jamesfisher · on Jan 19, 2023

This removes '1' but not 'I'. But if you want to avoid ambiguity, you need to remove _both_ characters, not just one. Users don't know your IDs never contain a '1', so when they see 'I', they'll still think, "hmm, is that a 1?"

KMag · on Jan 19, 2023

Crockford's version of base32 [0] (sometimes called crockford32) treats 0, o, and O the same when decoding, likewise for 1, i, I, l and L. It also doesn't use the letter U, to reduce the chances of including English obscenities.

So, you don't need to remove ambiguities if you can collapse the ambiguities into equivalence classes.

If you're going to use some flavor of base32, crockford32 is hard to argue against. I could bikeshed some alternatives (I would use the letter U and include Q in the 0 equivalence class, and not worry too much about obscenities), but of the somewhat common base32 standards, crockford32 is the best, IMNSHO.

[0] https://www.crockford.com/base32.html

yellowapple · on Jan 19, 2023

On that note, it's probably worth shamlessly plugging Base32H, which is my own take on base-32 inspired heavily by Crockford's: https://base32h.github.io/

Base32H has some advantages and disadvantages compared to Crockford's (see: https://base32h.github.io/comparisons#crockfords-base32 ); long story short: L/l is its own digit, 5/S/s are merged, and U/u/V/v are merged. In my totally-not-biased-at-all opinion, the advantages outweigh the disadvantages enough to have warranted creating Base32H instead of just using Crockford's.

If I was willing to break from duotrigesimal, I'd probably merge 0/O/o/Q/q, 1/I/i/L/l, 2/Z/z, 3/E/e, 6/G/g, 7/T/t, 8/B/b, and 9/P/p. The resulting Base24H would hinder readability of encoded words, and would break the alignment to whole bits, but it's probably close to the optimal intersection of information density, unambiguity, and convenience.

layer8 · on Jan 19, 2023

> you don't need to remove ambiguities if you can collapse the ambiguities into equivalence classes.

It still leaves users who don’t know or can’t assume that normalization is applied worrying about which character exactly is being displayed.

yellowapple · on Jan 19, 2023

The users don't need to know or assume anything if the decoder accepts both characters as aliases for one another. The issue is further addressable by being explicit about the normalization being in effect and/or by being explicit about which one the encoder is emitting ("if it looks like a zero, then it's always a zero").

layer8 · on Jan 19, 2023

My point is that most developers implementing this won’t consider that. This is just one example of a more general phenomenon of making some UI behavior foolproof, but not considering that the user doesn’t know that it’s foolproof, so the user isn’t saved from wondering about what will happen and what is the correct input.

In the present case, instead of having to inform the user about the normalization, it would be simpler to refrain from using potentially ambiguous characters in the first place.

dahfizz · on Jan 19, 2023

If you are taking base32 as user input, you can translate '1' to 'I' so that the input is correct regardless of which the user types.

But, needing users to read / type a base32 ID is always bad UX anyway.

iainmerrick · on Jan 19, 2023

Yeah, best to avoid having to type in a big nonsense string in the first place. But if you do need to do it, base32 is at least better than the obvious alternatives (decimal, hex, base64)

iainmerrick · on Jan 19, 2023

It’s much clearer with lower-case “i” (which this article uses). Same situation with “o”.

kortex · on Jan 19, 2023

Another option is to generate a ULID and just take the timestamp part (6 bytes, I forget how many characters). Uses Crockford Base32 already so no need for the custom alphabet.

dahfizz · on Jan 19, 2023

That is definitely taking the long way around.... Why not just grab a timestamp and base32 it yourself? ULIDs also use millisecond timestamps, and use implementation magic to sort IDs within a millisecond. If you are just using the timestamp portion, you should use micro or nanoseconds.

kortex · on Jan 19, 2023

Well yeah that's kind of what I mean. On some libraries you can generate and encode just the timestamp part.

dahfizz · on Jan 19, 2023

Its literally one line of code,

    let slug = base32Encode( window.performance.now() )

Maybe 15 LoC if you don't want to import a b32 encoder. Hardly seems worth a dependency.

zlib · on Jan 19, 2023

I always go for Hashids (https://hashids.org/)

vore · on Jan 19, 2023

I've never been sure of the use case of hashids. They're not really a security measure since they can be easily defeated; you might as well just expose the sequential ID at that point. If you do need something with better properties, you should probably be using UUIDv6.

junon · on Jan 19, 2023

> they can be easily defeated

How? I mean they're of course no cryptographic measure but with the salt you have some secrecy.

paulgb · on Jan 19, 2023

If an attacker can generate sequential hashids, they can decipher the alphabet order that hashids use without needing to know the seed, and then use the seed to invert other hashids.

I wrote a library that generates short IDs with the goal of making the similarity between two codes have nothing to do with sequence order.

https://docs.rs/block-id/latest/block_id/

hoherd · on Jan 19, 2023

Fwiw, hashids do not solve OP's use case:

> Hashids is a small open-source library that generates short, unique, non-sequential ids from numbers.

vs the summary in the article:

> So now when I list atoms from my filesystem or in S3, they come back in the same order that I wrote them.

tokumei · on Jan 19, 2023

Hashids are nice

stavros · on Jan 19, 2023

I use shortuuid[0] for that stuff, which also omits the capital letter I, and has some other niceties (I wrote the library). It works really well, and I like how small the IDs are.

[0] https://github.com/skorokithakis/shortuuid

bartwe · on Jan 19, 2023

Careful to not show these to end users as some may generate swearwords and such.

jminter · on Jan 19, 2023

Nice post!

To me, "math/big" feels like a big hammer for working on an int64; a nice alternative is "encoding/binary":

  func atomSlug(publishedAt time.Time) string {
      b := make([]byte, 0, 8)
      b = binary.BigEndian.AppendUint64(b, uint64(publishedAt.Unix()))
      b = bytes.TrimLeft(b, "\x00")
      return lexicographicBase32Encoding.EncodeToString(b)
  }

pavon · on Jan 19, 2023

Now that you made me think about it, I think both this approach and the original math.big one will cause more base32 digits to be added early than needed, breaking sorting earlier than needed, since slugs with different number of chars won't sort properly (think 1, 10, 11, ..., 2, 20...).

Right now the time is converted to 4 bytes which then uses 7 base32 characters to encode. When the time rolls over to 33 bits in 2106, it will increase to 5 bytes encoded as 8 base32 chars, even though 7 base32 chars have enough bits (35) to represent the number.

So perhaps a better way would be something like:

    func atomSlug(publishedAt time.Time) string {
        b := make([]byte, 0, 8)
        b = binary.BigEndian.AppendUint64(b, uint64(publishedAt.Unix()))
        s = lexicographicBase32Encoding.EncodeToString(b)
        return strings.TrimLeft(s, '2'); // '2' is all zero bits
    }

This will stick with 7 base32 chars until you exceed 35 bits in 3058.

throwaway81523 · on Jan 19, 2023

This is clever and thoughtful, but it encodes a 32-bit timestamp in seven base32 characters. It would have been simpler and just one more character to use eight hex digits. Alternatively, maybe go to 4 second resolution and use 30 bit timestamps, so they fit in six base32 chars instead of seven.

pavon · on Jan 19, 2023

Those approaches would both rollover at 2106, breaking sorting, and you'd potentially start getting collisions around 2566. With his approach he is starting with a 63-bit timestamp (Go uses signed int64s), then math.big.Bytes is dropping leading 0 bytes, meaning it will add more characters as needed.

throwaway81523 · on Jan 20, 2023

That's a separate thing, the 15 byte uuid's. I mean the timestamp slugs. Those are 32 bit timestamps that run out at the same time other Unix timestamps do.

foota · on Jan 19, 2023

Now I wish the standard for unix timestamps was in hex, thanks!

oittaa · on Jan 19, 2023

One simple workaround could be to use something like UUID version 7[0], using just the first part that encodes the time, and dropping the rest.

[0]: https://github.com/ietf-wg-uuidrev/rfc4122bis

dv888dv8v8 · on Jan 19, 2023

Given that UUID's purpose is in the name (_unique_), squeezing a timestamp in there seems to be a hack with unexpected consequences at best.

Linking to a github repo without any context does not prove your point.

paulgb · on Jan 19, 2023

It’s true, if the only goal of a UUID was minimizing collisions with a given number of bits, storing the timestamp would be a waste of entropy (assuming that generation rate is non-uniform over time.)

The reason they encode a timestamp (which I learned from the link upthread) is interesting, though. They knew that people use UUIDs as keys in data structures like BTrees, and using a timestamp improves memory/disk locality.

> Non-time-ordered UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic.

codetrotter · on Jan 19, 2023

> Linking to a github repo without any context does not prove your point.

The GitHub link in question is that of an IETF working group.

https://www.ietf.org/about/introduction/

Granted, linking to the draft of the standard may have been more clear. The draft is in the repo they linked. Direct link to the draft here:

https://ietf-wg-uuidrev.github.io/rfc4122bis/draft-00/draft-...

Alternatively, the currently published draft:

https://www.ietf.org/archive/id/draft-ietf-uuidrev-rfc4122bi...

> squeezing a timestamp in there seems to be a hack with unexpected consequences at best

It’s not a hack. It’s a future standard, and it is going to be widely used.

IETF is an organisation that publishes many of the standards that the internet and computer software is built with. Including publishing a standard for the existing UUID variants in common use today:

> UUIDs are standardized by the Open Software Foundation (OSF) as part of the Distributed Computing Environment (DCE).

> UUIDs are documented as part of ISO/IEC 11578:1996 "Information technology – Open Systems Interconnection – Remote Procedure Call (RPC)" and more recently in ITU-T Rec. X.667 | ISO/IEC 9834-8:2005.

> The Internet Engineering Task Force (IETF) published the Standards-Track RFC 4122, technically equivalent to ITU-T Rec. X.667 | ISO/IEC 9834-8.

https://en.wikipedia.org/wiki/Universally_unique_identifier

UUID was specifically created so that it could support different versions. Some of the bits in UUID identify the version.

Hence, a standard that says how to encode timestamp into UUID, and assigns a version for this format, UUID v7, is in fact the exact opposite of a hack.

And the reason they made UUID v7 is to have chronologically sortable UUIDs. They are inspired/based on ULID format, but made to conform to the UUID format.