Assignment #3: Six Degrees Of Kevin Bacon Game

Find the people in the world whose Bacon Number is 7 or higher. (This number changes as the actor database is updated).

What Is The (Six Degrees of) Kevin Bacon Game?

For a description of the Kevin Bacon game, follow this occasionally unreliable link or look at Exercise 15.23 in the text. Try the game a few times and see if you can find someone with a Bacon Number higher than 3. To make the Hall of Fame, you must find someone with a Bacon Number of 7. There are several such actors or actresses, and between 4 and 9 (depending on whose database you believe) with numbers of 8. In this assignment, you will find all of them.

The Input

There are two data files; both have identical formats. These files are: actors file and actresses file. These files combined are 89.1 Mbytes (roughly 93,400,000 bytes) and were last updated Sept 23, 2000. These links require that you go through the www.cs.fiu.edu web-domain; they do not work from www.fiu.edu. These files are available in the AUL in the folder \\cougar\cop3530, and also /home/aul1/cop3530 on the Unix boxes. YOU MAY NOT COPY THEM INTO ANY OF THE LOCAL FIU MACHINES BECAUSE YOU WILL EXCEED YOUR DISK QUOTA.

These datafiles contain approximately 422,000 actors/actresses in a total of 144,000 movies, with 1,552,000 roles. These files also list TV roles, but you must not include TV roles in your analysis.

This is a large amount of data. My implementation, running on a 300Mhz Pentium II with 96 Mbytes of RAM and a slow hard drive using Windows 98 took 3 minutes to load, 20 seconds to compute the actors, and 20 seconds to run the destructors. It appears that on the AUL Windows NT systems you will get similar performance. From an account on ocelot I was able to process everything in 90 seconds.You will be hard pressed to get an answer on a machine with 64 Mbytes of RAM (my laptop thrashed to an answer after several hours). Your algorithms should run in these general parameters. You can get extra credit if you can significantly improve performance for the AUL machines.

Before you run on the large data sets, use the small file sample.list to debug the basic algorithms. In this data file, there are six actors, named a, b, c, d, e, and f, who have been in movies such as X, Y, and Z.

Input File Hints

Since it is not my input file, I cannot answer questions about it. Here are some observations that I used in my program, that should suffice (it differs in its answers from the oracle for some actors, but still produces the extreme numbers).

There are over 200 lines of preamble that can be skipped. This varies from file to file. However, you can figure it out by skipping all lines until the first occurrence of a line that begins with "Name", and then skipping one more.
There are many postamble lines, too, starting with a line that has at least nine dashes (i.e. ---------).
A name is listed once; all roles are together; the name starts in the first column.
A movie title follows the optional name and a few tab stops ('\t'). There are some messed up entries that have spaces in addition to tab stops.
The year should be part of the movie title.
Movies made in 2001 or later should be skipped.
A TV movie, indicated by (TV) following the year, is to be skipped.
Archive material, indicated by (archive footage), is to be skipped. (Otherwise JFK has a Bacon number of 1).
A TV series, indicated by a leading " in the title is to be skipped.
A video-only movie, indicated by (V) following the year is allowed.
Blank lines separate actors/actresses, and should be skipped.

Strategy

This is basically a shortest path problem. I'll discuss more details in class, but here's a broad outline. YOU MUST ATTEND CLASS, OR YOU WILL PROBABLY BE VERY CONFUSED. You can do everything using basic STL components vector, list, and map. You do not need to use any pointers at all, nor do you need any primitive C-style strings or arrays. However, you might find that if you use pointers, you might be able to speed up your algorithm, or at least reduce memory consumption.

Ideally, you would like to define a class that looks somewhat like this:

class Database
{
public:
    void loadFile( const string & fileName );       // Read file into Database
    void computeNumber( const string & start );     // Compute Bacon #s; start actor is 0
    void printEntry( const string & actor ) const; // Print Bacon path for any actor after computeNumber done

private:
    map<string,vector<string> > actorsToMovies;     // movies for each actor
    map<string,vector<string> > moviesToActors;     // actors for each movie

    map<string,int> baconNumber;                    // Filled in by computeNumber
    map<string,Entry> link;                         // Filled in by computeNumber

     // represents a single role; these are constructed by shortest path alg
   struct Entry
   {
     Entry( const string & a = "", const string & m = "" )
       : actor( a ), movie( m )
       { }

     string actor;
     string movie;
   };
};

When I used this strategy, I ran out of memory. So did the AUL machines. Thus, instead of storing a vector<string> as the values for the first two maps, you need to use a vector<int>. In other words, you would number each actor starting at 0 or 1 (your choice), and also number each movie, starting at 0 or 1. You would then have to add additional private data members. One would be an array listing all the actors (this would allow you to determine what the numbers in your map mean). A second would be a map that efficiently gives you the number for each actor. And you would then have two more data members for the movies. YOU NEED TO DO THIS EVEN IF YOU HAVE A MONSTER FAST HOME PC WITH MEMORY LIKE THE AMAZING KRESKIN.

Then run a single-source shortest path algorithm to find all Bacon numbers. Here is an algorithm sketch, minus error handling and C++ details, and the fact that you probably will not be able to use a map with vector<string> as the value.

void Database::computeNumber( const string & start )
{
mark all movies as unprocessed (you will need a local array of booleans)
mark all actors as unprocessed (you will need a local array of booleans)

initialize all bacon numbers to UNREACHABLE

    Create an empty Queue q
    baconNumber[ start ] = 0;
    q.enqueue( start );
    mark start as processed

    while( !q.empty( ) )
    {
      v = q.dequeue( );   // v is an actor

// Find each movie
const vector & movies = actorsToMovies[ v ];

      for each movie, thisMovie, in the movies array
      {
        if thisMovie is already processed
          continue;

mark thisMovie as processed

          // Find actors in this movie!
        const vector & actors = moviesToActors[ thisMovie ];
        for each actor, thisActor, in thisMovie
        {
          if thisActor is already processed
            continue;

          mark thisActor as processed
            // At this point we know that thisActor and v are in the same movie
            // So if v's Bacon number is k, thisActor's Bacon number is k+1

          baconNumber[ thisActor ] = baconNumber[ v ] + 1;
          link[ thisActor ] = Entry( v, thisMovie );
          q.enqueue( thisActor );
        }
      }
    }
}

After that is done, find the large Bacon numbers by scanning the baconNumber array, and print the entries by consulting the link array.

What To Submit

Submit complete source code and the actors/actresses with Bacon Numbers of 7 or higher. Include the complete paths for each of the actors/actresses (with shared movie titles). Also indicate how long your algorithm takes in the AUL. If it's somewhat fast, or provably space-efficient, you can get extra credit. You cannot receive credit for a working program if your program fails to produce a large set of actors and actresses with Bacon Numbers of 7 or higher. Note: the data you will work on and the data the Oracle uses (and the IMDB data) are all slightly out of synch. So you will not be able to exactly reproduce the results. You should be able to get the Oracle to agree that several of your answers have Bacon numbers of 8, and there is only one set of correct answers for the data that we have locally.

Due Date

This program is due on Thursday, October 19.

WARNING

Start early. Also, you will almost certainly want to use the optimizer to speed things up. In Visual C++, the essential options are -GX /GR -O2 (that's the letter "O" as in "Optimizer"). You may have to turn off other options to get the optimizer to work. You may need to set up command-line compiling so that Visual C++'s IDE does not use up the memory that your program needs. I will not answer questions about how to use the optimizer or command line compiling.

Additional details

click here for more details about managing space and time.