Running simulations and analysing data

My first post in a long time. This is more as a journal entry for me to look back at when I need to.

My PhD project mostly involves running simulations of hundreds of people evacuating from a building and then analysing the simulation in various ways. While the MASON framework in Java helps a lot in the implementation of the model itself, something just as intersting and some thing that in the end feels a lot cooler is running all those simulations and getting data and analysing them.

Step 1: Running multiple Simulations

MASON allows you to run simulations in two major ways : Either using the GUI in which you get to see how the simulation is going. This mode is very useful and essential when creating and debugging the model. However when it comes to actually running simulations and gathering data for analysis, this is quite obviously impractical. This is when the console mode comes in handy. In the console mode, you run several replications of the required simulation with the required seed. Initially I used the handy in built function to do this. I also needed to store the simulation specific settings in some place. Initially I did this using constants in various classes, which I changed to storing all the constants in one class which was a lot more convenient to change and finally I resorted to a much more practical xml file which can easily be read from using JaxB in java. Though I think I might change to an sql based implementation soon. Anyway, the point is, I am able to run my simulation using it’s jar file and an xml file with all the parameters that are used for the simulation.

Step 2: Storing data:

The next step in this process is collecting data from these simulations. As a way to get started I stored my initial files as simple text files in csv format which I analysed in excel. Pretty soon, this became extremely impractical because of the amount of data I had to store. So I changed to storing in binary format and created  parser which would convert generated binary files to text files. I could have used some of java’s inbuilt analysis tools like some provided by the apache framework, but I was quite lazy, and I was working with someone who wanted the text files so that he could analyse it in Matlab, so I resorted to a binary file with a parser to convert to text.

However, despite the organised file hierarchy and names, this was still very difficult to analyse and keep organised and it was still very huge. Also there were a lot of complications when I were writing from multiple runs, experiments, etc. So I switched to what I should actually have done: a relational database. I set up a mysql server instance on my lab computer and wrote all the required database to the file at the end of each run of the simulation.

Step 3: Analysing the data :

Excel being boring, I shall not go into the details of how I did it initially. So once I got the data in MySQL, I needed some tool to analyse it. That’s when my prof recommended using mathplotlib in python. I’ve used python before to create a simple script to clean up references in a text file however, I’ve hardly used it for anything else even though I liked the language a lot. So I decided to give it a try. Interestingly enough I had a lot of trouble finding a free library for mysql. Though once I finally found, mySQLdb, the process of querying and analysing the data and getting some neat graphs took hardly a few lines of code. So now once I had the data, i could simply run the python script and get all the charts I needed.

Step 4: The power of the cloud

A single run of my simulation can take up to 5 minutes. For 100 replications of  under 6 different settings (this is what I needed for the particular run at that time) this would take about 3000 minutes or 50 hours or just over 2 days. While not bad, I needed my computer and I worked at the parallel and distributed computing center so it would have been a waste to not make use of all that computing power at our disposal. So I got myself an account on the cluster and created a simple shell script that would run the simulation with the fixed settings. Eventually I extended this so that it would read parameters from a separate text file, modify the xml file appropriately, and then run the simulation the required number of times and finally, at the end of the run I would be send an email. Here is the code for this first script:

 
#!/bin/bash
# runSimulations = runsSimulations from inputs in file 1

opath=$PATH
PATH=/bin:/usr/bin

case $# in
  0|1) echo 'Usage runSimulations settingsFile xmlFile' 1>&2; exit 1
esac

awk -v xmlFile=$2 '
BEGIN {totalCount=1
startingPoint[1]=1}
{
  model[NR] = $1
  startingPoint[NR+1] = startingPoint[NR]+NF-1
  for(i=2;i<=NF;i++){
    completeValuesList[totalCount] = $i
    totalCount++
  }
}
END {
  startingPoint[NR+1] = totalCount-1
  for (j=0; j<=NR; j++){
    indices[j] = 0
  }

  while(indices[0]!=1){
    timeNeeded=0

    for(j=1;j<=NR;j++){
      value[j] = completeValuesList[startingPoint[j]+indices[j]]
      command = "overwrite " xmlFile " xmlParser " model[j] " " value[j] " " xmlFile
      # print command
      system(command)
    }
    testCommand = "grep FilePath " xmlFile;
    testCommand |getline filePathLine
    close(testCommand)
    seed = 1
    javaCommand = "java -cp dist/CrowdSimulation.jar app.RVOModel -repeat 100 -time 100 -seed " seed
    # print javaCommand
    system(javaCommand)
    for(j=NR;j>=1;j--){
       if(startingPoint[j]+indices[j]==startingPoint[j+1]){
          indices[j]=0
          indices[j-1]++
       }else {
          if(j==NR){
             indices[j]++
          }
       }
    }
 }
}' $1
echo $1 $2 "run complete"|mail -s "Run Complete" vaisaghvt@gmail.com

For anyone with a little experience in shell scripting, this might seem like crap, so if are bored enough to go through this and you know some shell scripting, please do give me any suggestions that you have. That was the code for my first project. In my second project, I’ve changed my approach to having a separate class for each experiment. And also initially, I manually did the work of connecting to each cluster and initializing the job. Now, I’ve automated this too. So I specify the experiment and settings to be run and the script dispatches the jobs to the specified set of clusters and as above, I get emailed at the end when the data is available.

#!/bin/bash
opath=$PATH
PATH=/bin:/usr/bin

case $# in
  0) echo 'Usage runExperiment classToBeRun' 1>&2; exit 1
esac
program=$1

for cluster in "c0-0 0" "c0-1 20" "c0-2 40" "c0-3 60" "c0-4 80" "c0-5 100"
   do
      set -- $cluster
      ssh $1 "nohup ./runCommunication.sh $program $2 2> 2_$2.log 1> 2_$2_1.log < /dev/null &"
      echo "assigned to $1"
   done

SSHing to a remote client and running the command in nohup were the two most difficult parts of this. Nohup lets you run the process even after disconnecting from the machine. The & at the end makes the process run in the background so that you can disconnect and connect to the next machine or do other things. The output is redirected to log files so that I can keep a track of what is happening and finally, something that I took a long time to figure out, you should set input to be received from /dev/null, otherwise you will not be able to disconnect from that particular remote machine.

#!/bin/bash
opath=$PATH
PATH=/bin:/usr/bin

case $# in
  0|1) echo 'Usage runSimulation classFile parameter' 1>&2; exit 1
esac
java -cp IBEVAC.jar $1 $2

echo "$1 $2 run complete"|mail -s "Run Complete" vaisaghvt@gmail.com

There’s still a lot more automation I can and plan to do. But as of now, I’m in a state where I can run simulations quite easilly and I won’t be changing things much for some time. Next stop, getting a proper gitflow happening with Netbeans or eclipse.