Saturday, October 30, 2010

The math behind bit.ly and other url-shortening services and other numbers on the web

My son, who is 12 years old, was recently assigned homework dealing with outcomes.  These are the calculations that feed into probability calculations. We use outcomes all the time to calculate our chances of an event occurring given a predetermined number of possibilities surrounding prior events.

If Janie has a red blouse, a blue blouse and a green blouse and she also has blue jeans, khaki shorts and black capri pants, how many outfits can she create?
Aside from the  possible Glamour donts that Janie could create, we can calculate Janie's number of outfits by multiplying the number of blouses (3) times the number of leg-wear (3) and get 9 total outfits.  This can be generalized by saying that the number of outcomes is the product of the total possibilities of each subsequent event.  This can be represented by a tree diagram as well.


This math is actually the basis of our numbering system. Every number we use every day from our age to our shoe size can be represented as the outcome of a tree diagram.  Each digit in any number is represented by one of 10 possible characters 0, 1, 2, 3, 4, 5, 6, 7, 8 or 9. (These are called Arabic numerals.)  The second digit of a two-digit number can also have 1 of these 10 different characters.  Each digit in a number has 10 possibilities. If we draw a tree diagram for a 2 digit number, it has 100 possible branches.  Below is a tree diagram for a 2 digit number (I left out 2,3,4,5,6,7,8 for space reasons).

Each additional digit increases the number of possible outcomes by a factor of 10.  So a 3 digit number has 10^3 possibilities or 1,000.

You may see numbers mixed with letters on the, such as FF6.  This is a number, but instead of using 10 Arabic numerals for each digit, the character set has been expanded by 6 to include A, B, C, D, E, and F.  This yields a total of 16 characters for each digit.  This is called "base 16" or "hexadecimal". Our out numbering system is "base 10" or "decimal". Each additional digit in a hexadecimal number increases the number of outcomes by a factor of 16.

Anyone who has some knowledge of Cascading Style Sheets or HTML has come across hexadecimal in color codes.  All colors on the web can be represented by numbers and those numbers are often represented in hexadecimal instead of decimal. You can see breakdowns of the colors in both numbering systems on many websites. (for example, http://html-color-codes.com/ and http://www.december.com/html/spec/colorhex.html)

The reason behind this has to do with how computers think and how early computers stored numbers in memory.  Computers at their simplest operate in base 2 meaning they have two options, off and on. This is typically represented by the numbers 0 and 1.   Each on/off digit was called a "bit" and in fact on/off fields database tables are still called bit field.  Early computers, notably the 6502 microprocessor (which ran the Apple ][+), used 8 bits together to process calculations.  Half a byte was 4 bits. And with a 4 digit number in base 2, you have a total of 2x2x2x2 or 2^4 combinations or 16 total.  And instead of writing out 4 digit numbers, computer memory was often represented in base 16. (The name of this blog, FF69g, is the computer instruction that the original Apple ][+ used to exit from machine language mode and it refers to memory location FF69, which is a hexadecimal number)

Let's take a moment and talk about what Bit.ly and other url shortening services do.  They are very simple services in concept.  With an url shortening service you can take a long web address and shrink it down, like putting a wool sweater in a dryer, only when you put the sweater back on it magically returns to its correct size. Say you have a very long web address, for example if you search for "what is the capital city with the longest name?" on Google the actual results are in this web address:

http://www.google.com/search?q=what+is+the+capital+city+with+the+longest+name%3F&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a

But wit an url shortening service, that search becomes

http://bit.ly/db3e3Q

The longer url can be hidden on a web page using the tag.  That search is linked here, for example. But with non-HTML services, such as text emails and text messages and Twitter, the longer url is simply inconvenient or impossible.  So url-shortening services evolved and now there are hundreds of them. They generally have short domain names such ".ly" or ".co".  (See recent controversy surrounding ".ly")

So bit.ly and other search services save the long url in a database table and assign a code to it. And when your browser requests a bit.ly url, the code that looks random (db3e3Q in this case) is used to look up the correct url in the table and redirect your browser to the destination, in this case a Google search.

Each url that bit.ly stores get it's own unique lookup code.  And sinceI assume bit.ly stores billions of urls, the lookup codes could get very long (and I'm assuming there are billions since I don't really know).

So a 10-digit lookup code using Arabic numerals would have about 10 billion possible outcomes  from 0000000000000 to 9999999999999.  In the computer world and the search world, that's not that much.  That means if everyone in China searched 10 times the service would run out of lookup codes if they wanted to use only 10 digits.

In order to increase the number of possible outcomes without increasing the number of digits, bit.ly added upper and lower case letters to each digit.  So instead of 10 possible characters each digit has 62, because there are 10 Arabic numerals, plus 26 lower case letters and 26 upper case letters. The number of possible outcomes for a 1 digit code with 62 characters is 62.  The number possible outcomes for a 10 digit code with 62 possible characters in each digit is 62 to the 10th power or 839,299,365,868,340,000 or, in words,  eight hundred thirty nine quadrillion two hundred ninety nine trillion three hundred sixty five billion eight hundred sixty eight million three hundred and forty thousand.




Now, math purists (and bit.ly mathematicians) may say I've actually undercounted the possibilities because not all 10 digit numbers are used, so, in fact, the remaining digits in any combination could be 63.  For example, 0000000000 and 0 could be considered different lookup codes by the bit.ly servers (and they probably are).

This means that the actual number of outcomes is the sum of the various-length codes from 1 digit (62 codes) to 10 digits (the aforementioned long number).

I've never seen a bit.ly url that has only one digit in the lookup code, so perhaps they've limited the number of digits in their lookup codes. And in fact I have no idea if they even use 10 digits.

In fact, I don't know anything about how bit.ly works at all. I'm just assuming this since I had to create a similar system for my QR code tracking service that I built for fun (http://www.bloty.com) which might one day become a way for people to tell you that your lights are on in your car or that you are illegally parked while sitting at a restaurant (http://www.notesonawindshield.com, but it's in beta, so apologies for errors).

My code to generate the look-up using 62 possible digits is written in PHP and I've cut and pasted it below. No apologies if it doesn't work for you, I only recently learned PHP and it may or may not work exactly as planned. "UUID" stands for universal unique identifier.

  function uuid()
  {
//62 ITEMS IN THE ARRAY
  $arr = array(0=>0,1,2,3,4,5,6,7,8,9,0,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z);
  $uuidtemp='';
 
  $i=1;
    while($i<=12)
          {
          $uuidtemp = $uuidtemp . $arr[mt_rand(0,61)];
          $i++;
          }
  return $uuidtemp;
  }