These datafiles contain approximately 422,000 actors/actresses in a total of 144,000 movies, with 1,552,000 roles. These files also list TV roles, but you must not include TV roles in your analysis.
This is a large amount of data. My implementation, running on a 300Mhz Pentium II with 96 Mbytes of RAM and a slow hard drive using Windows 98 took 3 minutes to load, 20 seconds to compute the actors, and 20 seconds to run the destructors. It appears that on the AUL Windows NT systems you will get similar performance. From an account on ocelot I was able to process everything in 90 seconds.You will be hard pressed to get an answer on a machine with 64 Mbytes of RAM (my laptop thrashed to an answer after several hours). Your algorithms should run in these general parameters. You can get extra credit if you can significantly improve performance for the AUL machines.
Before you run on the large data sets, use the small file sample.list to debug the basic algorithms. In this data file, there are six actors, named a, b, c, d, e, and f, who have been in movies such as X, Y, and Z.
Ideally, you would like to define a class that looks somewhat like this:
class Database
{
public:
void loadFile( const string & fileName );
// Read file into Database
void computeNumber( const string & start
); // Compute Bacon #s; start actor is 0
void printEntry( const string & actor )
const; // Print Bacon path for any actor after computeNumber done
private:
map<string,vector<string> > actorsToMovies;
// movies for each actor
map<string,vector<string> > moviesToActors;
// actors for each movie
map<string,int> baconNumber;
// Filled in by computeNumber
map<string,Entry> link;
// Filled in by computeNumber
// represents a single role; these are
constructed by shortest path alg
struct Entry
{
Entry( const string & a = "", const
string & m = "" )
: actor( a ), movie( m )
{ }
string actor;
string movie;
};
};
When I used this strategy, I ran out of memory. So did the AUL machines. Thus, instead of storing a vector<string> as the values for the first two maps, you need to use a vector<int>. In other words, you would number each actor starting at 0 or 1 (your choice), and also number each movie, starting at 0 or 1. You would then have to add additional private data members. One would be an array listing all the actors (this would allow you to determine what the numbers in your map mean). A second would be a map that efficiently gives you the number for each actor. And you would then have two more data members for the movies. YOU NEED TO DO THIS EVEN IF YOU HAVE A MONSTER FAST HOME PC WITH MEMORY LIKE THE AMAZING KRESKIN.
Then run a single-source shortest path algorithm to find all Bacon numbers.
Here is an algorithm sketch, minus error handling and C++ details, and
the fact that you probably will not be able to use a map with vector<string>
as the value.
void Database::computeNumber( const string & start )
{
mark all movies as unprocessed (you will need
a local array of booleans)
mark all actors as unprocessed (you will need
a local array of booleans)
initialize all bacon numbers to UNREACHABLE
Create an empty Queue q
baconNumber[ start ] = 0;
q.enqueue( start );
mark start as processed
while( !q.empty( ) )
{
v = q.dequeue( ); //
v is an actor
// Find each movie
const vector & movies = actorsToMovies[
v ];
for each movie, thisMovie, in the
movies array
{
if thisMovie is already
processed
continue;
mark thisMovie as processed
// Find actors
in this movie!
const vector & actors
= moviesToActors[ thisMovie ];
for each actor, thisActor,
in thisMovie
{
if thisActor
is already processed
continue;
mark thisActor
as processed
// At this point we know that thisActor and v are in the same movie
// So if v's Bacon number is k, thisActor's Bacon number is k+1
baconNumber[
thisActor ] = baconNumber[ v ] + 1;
link[ thisActor
] = Entry( v, thisMovie );
q.enqueue(
thisActor );
}
}
}
}
After that is done, find the large Bacon numbers by scanning the baconNumber array, and print the entries by consulting the link array.