Random TinyURL Browser (Updated)
Posted 1/26/2005 01:27:00 AM |

Click here to visit a random TinyURL page (Note - pops up a new browser. Keep clicking this link to refresh that page.)
This clicking began to fascinate me, as I realized I was in essence snooping on information transactions conducted by hundreds or thousands (depending on how many times you click, I guess) nameless people around the globe. I discovered maps, company documents, pictures, links to forums, technical documents, blog postings, ebay pages, and a whole host of other strange content. It was really quite interesting. This got me thinking about how TinyURL works, which then led me to start questioning why my "random" browser was actually giving me so many good results.
So, this is a simplistic implementation of the TinyURL hashing algorithm. According to my initial calculations (based on their published stats of 4.2 M established TinyURL's), only 7.2 percent of clicks (Total dimension space is &Sigma k=0...5 (36^k)) will result in valid (non-TinyURL notfound) pages. I've been absent-mindedly clicking the link and I'm definitely not seeing 9.3 out of every 10 clicks returning the TinyURL notfound page.
The question is, why does my random implementation perform well against TinyURL's data set? And if borne out over longer intervals, how well would it perform? I've managed to dig up an answer, which has two parts. The first part of the answer deals with how TinyURL assigns names for compressed URL's, and the second involves how my algorithm assigns random values.
As we have established, there are 62,193,780 possible values for TinyURL's. TinyURL's are generated by a Base 36 hash (36 indicating the number of characters a-z and 0-9, the array of possible values out of which a TinyURL can be constructed), autoincremented by MySQL with an initial value count of zero. What is a hash function? Imagine five buckets in a row. Now, starting with bucket one, we'll drop one of 36 characters into the bucket, and check to make sure that for all other bucket sets out there, there isn't a bucket that also has that particular character in the first bucket. If there is, we'd insert a character in the second bucket and run that comparison again, and so on and so forth . A hash is extremely efficient computationally, because instead of doing a tree comparison against all 62 million values, we're incrementally comparing a set of 5 characters that has only 36 possible values.
Imagine the scenario where you'd filled 4 buckets with characters, but you'd still found matches and had to extend to the fifth bucket. For each of the 36 values you might drop into the bucket, there exists a probability of 1 in 62 million that you'd have a match and have to start the computation over again. All things considered, those are good odds!
As I've attempted to demonstrate, the hashing algorithm naturally prefers to use less characters to represent the URL value. By that, I mean, if the algorithm decides that a tinyurl extention of "7ds" is not already in existence, it will use this value and not append any more characters. This means that for our two-dimensional variable space, we see a heavy front-loading effect for the starting characters of the hash. What I mean by this is that for shorter URL's, there is a greater chance that a TinyURL has been assigned to that value.
So, how does my algorithm work? It first gets a random number 1-5, and assigns a random character for each value of the random number. For example, if 2 is generated, we might get back a result of "x2", "ce", "0s" and so on. Since we've established that the hash tends to populate smaller values first, an interesting situation arises. Here's the power of exponents:
36^1 = 36 (eg, http://tinyurl.com/h)
36^2 = 1296 (eg, http://tinyurl.com/h3)
36^3 = 46656 (eg, http://tinyurl.com/h3j)
So, for all possible hash combinations, I'd say its fairly likely that our greedy algorithm has assigned 99.999 percent of combinations for all 3 character tinyURL's. Expanding this to include a fourth character
36^4 = 1,679,620 (eg, http://tinyurl.com/h3jz)
Even that dimension space would be a good deal utilized, possibly 90 percent.
Seeing as my algorithm arbitrarily assigns a total string length, having something appear in our strong confidence space (4 char TinyURL or less) is 80 percent likely. In fact, getting a 5 character TinyURL that works is the real sort of anomaly in this system. I don't have the numbers I'd need to bear out a more precise calculation, but I'd be willing to wager that 8 out of every ten clicks on that randomizer will give you a TinyURL that works (or worked at one time).
Update
Now that I basically understand how TinyURL works, it was fairly easy to implement a random TinyURL generator that displays zero preference, and uses a base 36 hash. I don't think I've got TinyURL's algorithm right, though, because I'm getting lots of 404 erors.
Almost too random TinyURL page (Note - pops up a new browser. Keep clicking this link to refresh the page.)
Permalink |
|
to this post
View blog reactions | Post to
2 Comments: (Post a Comment)
- At January 26, 2005 1:21 PM, said...
-
(This is Simon).
There's no reason to assume that a hashing function is involved; indeed it would be odd if one were being used.
A much more plausible assumption would be that all n character urls are generated before any n+1 urls are.
Alternatively, URLS might be generated by sequential probes, but that need not involve any hashing; plain (possibly pseudo) random numbers are all you need.
You could get more useful info from your chosen ciphertext approach by randomly probing 4-space and finding how dense it is.
You could also use a chosen plaintext approach by generating tiny-urls for a variety of inputs and looking for patterns in the output - e.g. frequency of 6-char or 4-char urls. - At December 19, 2007 1:51 AM, said...
-
The risk is clicking on a link that leads you to a malicious website and have virus/spyware installed on your PC.
Btw, I don't use TinyURL.com but rather http://www.tiny9.com because it has richer features.


