Format for ID of YouTube video

YouTube videoId and channelId identifiers are single integer values represented in a slightly modified version of Base64 encoding. One difference versus the IETF RFC4648 recommendations is the substitution of two characters in the encoding alphabet:

 Payload  ASCII/Unicode      Base64     YouTube
 -------  -------------     ---------  ---------
  0...25  \x41 ... \x5A     'A'...'Z'  'A'...'Z'
 26...51  \x61 ... \x7A     'a'...'z'  'a'...'z'
 52...61  \x30 ... \x39     '0'...'9'  '0'...'9'
    62    \x2F vs. \x2D  →   '/' (2F)   '-' (2D)
    63    \x2B vs. \x5F  →   '+' (2B)   '_' (5F)

The substitution is likely due to the fact that, for some reason RFC4648 selected two characters that already had prominent and well-established functions in URLs.[note 1.] Obviously, for the usage under discussion here, that particular complication was best avoided.

Another difference from the official specification is that YouTube identifiers do not use the = padding character; it’s not necessary because the encoded lengths expected per respective decoded integer size are fixed and known (11 and 22 encoded ‘digits’ for 64 and 128 bits, respectively).

With one minor exception (discussed below), the full details of the Base64 mapping can be inferred from publicly accessible data. With a minimum of guesswork, it’s likely that the Base64 scheme used in the videoId and channelId strings is as follows:

    ——₀————₁————₂————₃————₄————₅————₆————₇————₈————₉———₁₀———₁₁———₁₂———₁₃———₁₄———₁₅—
     00ᴴ  01ᴴ  02ᴴ  03ᴴ  04ᴴ  05ᴴ  06ᴴ  07ᴴ  08ᴴ  09ᴴ  0Aᴴ  0Bᴴ  0Cᴴ  0Dᴴ  0Eᴴ  0Fᴴ
00→ 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
      A    B    C    D    E    F    G    H    I    J    K    L    M    N    O    P

    —₁₆———₁₇———₁₈———₁₉———₂₀———₂₁———₂₂———₂₃———₂₄———₂₅———₂₆———₂₇———₂₈———₂₉———₃₀———₃₁—
     10ᴴ  11ᴴ  12ᴴ  13ᴴ  14ᴴ  15ᴴ  16ᴴ  17ᴴ  18ᴴ  19ᴴ  1Aᴴ  1Bᴴ  1Cᴴ  1Dᴴ  1Eᴴ  1Fᴴ
01→ 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
      Q    R    S    T    U    V    W    X    Y    Z    a    b    c    d    e    f

    —₃₂———₃₃———₃₄———₃₅———₃₆———₃₇———₃₈———₃₉———₄₀———₄₁———₄₂———₄₃———₄₄———₄₅———₄₆———₄₇—
     20ᴴ  21ᴴ  22ᴴ  23ᴴ  24ᴴ  25ᴴ  26ᴴ  27ᴴ  28ᴴ  29ᴴ  2Aᴴ  2Bᴴ  2Cᴴ  2Dᴴ  2Eᴴ  2Fᴴ
10→ 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
      g    h    i    j    k    l    m    n    o    p    q    r    s    t    u    v

    —₄₈———₄₉———₅₀———₅₁———₅₂———₅₃———₅₄———₅₅———₅₆———₅₇———₅₈———₅₉———₆₀———₆₁———₆₂———₆₃—
     30ᴴ  31ᴴ  32ᴴ  33ᴴ  34ᴴ  35ᴴ  36ᴴ  37ᴴ  38ᴴ  39ᴴ  3Aᴴ  3Bᴴ  3Cᴴ  3Dᴴ  3Eᴴ  3Fᴴ
11→ 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
      w    x    y    z    0    1    2    3    4    5    6    7    8    9    -    _

The reason to believe that Base64 is being used is that, when we assume standard integer sizes of 64 and 128 bits for the encoder input, Base64 predicts the unusual character lengths (11 and 22 characters) of the YouTube channelId and videoId identifiers exactly. Furthermore, remainders calculated as per Base64 perfectly explain the observed distributional variation found in the l̲a̲s̲t̲ c̲h̲a̲r̲a̲c̲t̲e̲r̲ of each type of identifier string. Discussion of these points follows.

In both cases, the binary “data” that gets Base64-encoded is a single integer, either 64 or 128 bits, for (respectively) videoId vs. channelId. Accordingly, by using a Base64 decoder, a single integer can be recovered from the string identifier, and it can be quite useful to do this because, while each integer id contains exactly the same information as the Base64 string—and also allows the string to be recreated at any time—when compared to Base64 strings stored as Unicode, the binary representation is 63% smaller, has the maximal bit-density of 100%, aligns in memory better, sorts and hashes faster, and, perhaps most importantly, eliminates false collisions between identifiers that differ only in orthographic case. This last problem, though extremely improbable numerically, nevertheless cannot be ruled out when Base64 IDs are treated as case-insensitive, as some filesystems do (e.g. Windows, dating back to DOS).

That’s kinda important: if you’re using a videoId / channelId string as part of a Windows/NTFS filename, there’s a vanishingly miniscule—but nevertheless non-zero—chance of filename collisions due to those filesystems deploying case-insensitive path and file naming.

If you’re worried about this remotely possible problem, one way to mathematically eliminate it would be to re-encode the decoded integers–still obtained as described in this article–into either a base-10 (decimal) or (uniform-cased) hexadecimal representation, for use in path or file names on such filesystems.[note 2.] In this approach, the 64-bit videoId would need 20 decimal digits [0-9]or 8 hex digits [0-9,A-F] (vs. 11 Base64 digits). The 128-bit channelId would require a maximum of 39 decimal digits or 16 hex digits (vs. 22 Base64 digits).

Decoding to binary is trivial for the 64-bit case because you can use a UInt64 (ulong in C#) to hold the native binary value that comes back.

/// <summary> Recover the unique 64-bit value from an 11-character videoID </summary>
/// <remarks>
/// The method of padding shown here (i.e. 'b64pad') is provided to demonstrate the
/// full and correct padding requirement for Base64 in general. For our cases:
///
///    videoId    →  11 chars  →  b64pad[11 % 3]  →  b64pad[2]  →  "="
///    channelId  →  22-chars  →  b64pad[22 % 3]  →  b64pad[1]  →  "=="
///
/// Note however that, because it returns 'ulong', this function only works for videoId 
/// values, and the padding will always end up being "=". This is assumed in the revised
/// version of this code given further below, by just hard-coding the value "=".
/// </remarks>

static ulong YtEnc_to_videoId(String ytId)
{
    String b64 = ytId.Replace('-', '+').Replace('_', '/') + b64pad[ytId.Length % 3];

    return BitConverter.ToUInt64(Convert.FromBase64String(b64), 0);
}

static String[] b64pad = { "", "==", "=" };

For the case of the 128-bit values, it’s slightly trickier because, unless your compiler has an __int128 representation, you’ll have to figure out a way to store the whole thing and keep it combobulated as you pass it around. A simple value type (or System.Numerics.Vectors.Vector<T>, which manifests as a 128-bit SIMD hardware register, when available) will do the trick in .NET (not shown).

[edit:]
After further thought, a portion of my original post was not maximally complete. For fairness, the original excerpt is retained (you can skip it if desired); immediately below I explain the missing insight:


[original text:]
You might have noticed above that I wrote that you can recover “an” integer. Wouldn’t this be the value that was originally encoded? Not necessarily. And I’m not alluding to the signed/unsigned distinction which, it’s true, cannot be ascertained here (because it doesn’t change any facts about the binary image). It’s the numeric values themselves: Without some “Rosetta Stone” that would let us cross-check with absolute values known to be “correct”, the numeric alphabet mapping and also the endian-ness can’t be positively known, which means that there’s no guarantee that you’re recovering the same value that the computers at YouTube encoded. Fortunately, as long as YouTube never publicly exposes the so-called correct values in a less-opaque format somewhere else, this can’t possibly matter.

That’s because the decoded 64- or 128-bit values have no use except as an identifying token anyway, so our only requirements for the transform are distinct encoding (no two unique tokens collide) and reversibility (decoding recovers the original token identity). 

In other words, all we really care about is lossless round-tripping of the original Base64 string. Since Base64 is lossless and reversible (as long as you always stick to the same alphabet mapping and endianness assumption for both encoding and decoding) it satisfies our purposes. Your numeric values may not match up with those recorded in YouTube’s master vault, but you won’t be able to tell any difference.


[new analysis:]
It turns out that there are a few clues that can tell us about the “true” Base64 mapping. Only certain mappings predict the final-position characters that we observe, meaning the binary value for only those characters must have a certain number of LSB zeros. Heh.

Taken together with the overwhelmingly likely assumption that the alphabet and digit characters are mapped in ascending order, we can basically confirm the mapping to be what is shown in the tables above. The only remaining uncertainty about which the LSB analysis is inconclusive is the mapping of the - and _ characters for encoding Base64 digits 62 and 63 (i.e., respectively?).


The original text did discuss this LSB issue (see further below), but what I didn’t fully realize at the time was how LSB information acts to restrict the possible Base64 mappings.

[end edit.]

A last comment on the subject concerns endianness. You might in fact want to intentionally choose big-endian for the binary interpretation your app works with internally (even though it’s less common than little-endian nowadays and thus might not be the way YouTube ‘officially’ does it). The reason is that this is a case of dual views on the same value, such that the actual byte order is visibly exposed in the Base64 rendition. It’s helpful and less confusing to keep the sort order consistent between the binary value and the (somewhat more) human-readable Base64 string, but the sort of the little-endian binary values is a non-trivial scramble of the desired ASCII/lexical sort.

Because of the irregular overlap between the 6-bit pattern and 8-bit bytes, there’s no simple fix for the human-readable sorting problem once you encode to little-endian ID values (i.e. simply reversing their sort won’t work). Instead, you have to plan ahead and reverse the bytes of each binary value prior to decoding, in other words, apply the endianness transform. So if you care about the alphabetical display matching the sorting of the binary values, you might want to alter the function shown above so that it decodes into big-endian ulong values instead. Here’s that code:

// Recover the unique 64-bit value (big-endian) from an 11-character videoID
static ulong YtEnc_to_videoId(String ytId)
{
    var a = Convert.FromBase64String(ytId.Replace('-', '+').Replace('_', '/') + "=");
    if (BitConverter.IsLittleEndian)   // true for most computers nowadays
        Array.Reverse(a); 
    return BitConverter.ToUInt64(a, 0);
}

YouTube IDs


Video Id

For the videoId, it is an 8-byte (64-bit) integer. Applying Base64-encoding to 8 bytes of data requires 11 characters. However, since each Base64 character conveys exactly 6 bits (viz., 2⁶ equals 64), this allocation could actually hold up to 11 × 6 = 66 bits—a surplus of 2 bits over the 64 bits our payload needs. The excess bits are set to zero, which has the effect of excluding certain characters from ever appearing in the last position of the encoded string. In particular, the videoId is guaranteed to always end with one of the following characters:

{ A, E, I, M, Q, U, Y, c, g, k, o, s, w, 0, 4, 8 }

Thus, the maximally-constrained regular expression (RegEx) for the videoId would be as follows:

[0-9A-Za-z_-]{10}[048AEIMQUYcgkosw]


Channel or Playlist Id

The channelId and playlistId strings are produced by Base64-encoding a 128-bit (16-byte) binary integer. This gives a 22-character string which can be prefixed with either UC to identify the channel itself, or with UU to identify a full playlist of the videos it contains. These 24-character prefixed strings are used in URLs. For example, the following shows two ways to refer to the same channel. Notice that the playlist version shows the total number of videos in the channel,[see note 3.] a useful piece of information which the channel pages don’t expose.

Channel URL

https://www.youtube.com/channel/UCK8sQmJBp8GCxrOtXWBpyEA

Playlist URL

https://www.youtube.com/playlist?list=UUK8sQmJBp8GCxrOtXWBpyEA

As was the case with the 11-character videoId, calculation per Base64 correctly predicts the observed string length of 22-characters. In this case, the output is capable of encoding 22 × 6 = 132 bits, a surplus of 4 bits; those zeros end up restricting m̲o̲s̲t̲ of the 64 alphabet symbols from appearing in the last position, with only 4 remaining eligible. We therefore know that the last character in a YouTube channelId string must be one of the following:

{ A, Q, g, w }

This gives us the maximally-constrained regular expression for a channelId:

[0-9A-Za-z_-]{21}[AQgw]

As a final note, the regular expressions shown above describe the bare ID values only, without the prefixes, slashes, separators, etc., that must be present in URLs and the other various uses. The RegEx patterns I presented are as mathematically minimal as possible given the properties of the identifier strings, but if used as-is without additional context they will probably generate a lot of false-positives, that is: incorrectly match spurious text. To avoid this problem in actual use, surround them with as much of the expected adjacent context as possible. 


Notes

[1.]
As promised above, here is an excerpt from the Base64 specification which discusses their considerations in selecting the alphabet symbols. Individuals seeking to understand how the process concluded in the selection of characters with URL semantics may find the explanations somewhat unedifying.

3.4. Choosing the Alphabet

Different applications have different requirements on the characters in the alphabet. Here are a few requirements that determine which alphabet should be used:

  • Handled by humans. The characters “0” and “O” are easily confused, as are “1”, “l”, and “I”. In the base32 alphabet below, where 0 (zero) and 1 (one) are not present, a decoder may interpret 0 as O, and 1 as I or L depending on case. (However, by default it should not; see previous section.)
  • Encoded into structures that mandate other requirements. For base 16 and base 32, this determines the use of upper- or lowercase alphabets. For base 64, the non-alphanumeric characters (in particular, “/”) may be problematic in file names and URLs.
  • Used as identifiers. Certain characters, notably “+” and “/” in the base 64 alphabet, are treated as word-breaks by legacy text search/index tools.

There is no universally accepted alphabet that fulfills all the requirements. For an example of a highly specialized variant, see IMAP [8]. In this document, we document and name some currently used alphabets.

[2.]
Alternatively, to solve the problem of using Base64-encoded ID strings as “as-is” components of file or path names on the NTFS filesystem, which is case-insensitive by default (and thus technically risks conflating one or more unrelated ID values), it so happens that NTFS can be configured with case-sensitive path/file naming on a per-volume basis. Enabling the non-default behavior may fix the problem described here, but is rarely recommended since it alters expectations for any/all the disparate applications that inspect or access the volume. If you’re even considering this option, read and understand this first, and you’ll probably change your mind.

[3.]
I believe the total number of videos shown the channel playlist page takes into account an exclusion for videos which are restricted according to the geographical region of the HTTP client. This accounts for any discrepancy between the number of videos listed for the playlist vs. channel.

https://webapps.stackexchange.com/a/101153