Mutual Information in R

In my endeavors as a graduate student in image processing I have found R to be an incredibly useful tool for testing hypotheses and rapid prototyping. Ultimately everything needs to be implemented in a real compiled language like C++ to make a useful tool, but it has saved me a lot time from implementing things that are dumb. I am going to share with you my experience of using R to do feature selection.

I am trying to predict the position on a surface using shape descriptors. I have a training set of about 50 point clouds representing the surface of the corpus callosum. For each point in the training set, I compute geometric moments. These moments boil down all the information in a spherical region around the point down to a single number. They are more sophisticated than simple descriptors like curvature and should capture the relationship between position on the surface and the shape of the corpus callosum.

The problem with using geometric moments is that are infinitely many of them to choose from and many of them are redundant or not very useful. Therefore I want a way to try a lot of them and quickly rule out the useless ones. I will be using mutual information (MI) for that. MI measures the dependence of two random variables on one another using entropy. An MI of 0 indicates no dependence (and therefore no relationship). Higher values indicate a stronger relationship.

For each point, I write its medial axis coordinates and the shape descriptors to a CSV file in the following format:

AxialDistance, Angle, d0, d1, d2, d3, d4, d5, d6, d7
0.00388039, 1.66916, 0.137261, 1, 2.4811e-07, 0.0446063, 0.0102502, 1, -1.07544e-06, 0.071294, 0.00606162
0.00466314, 1.03462, 0.288215, 1, -1.34452e-06, 0.0483304, 0.0120159, 1, -2.80076e-06, 0.072637, 0.00579168
~378,000 more rows like this

I’m trying to figure out which shape descriptors will be the best predictor for the medial axis coordinates. One way to eyeball this is simply by making a simple x-y plot where one axis has the descriptor, and the other has the variable you’re trying to predict. For example:

data <- read.csv("fs_lowres_2mm_o4_i4.csv")
plot(data$d5, data$AxialDistance)

gets a plot similar to the following:
This particular descriptor seems to have a relationship (although nonlinear) with axial distance. This seems like a pretty good descriptor in comparison to something like:


In my case, I am trying to pick a set of useful feature descriptors in a relatively automatic way, so looking at plots is far too time consuming. Instead I can just calculate the mutual information for each descriptor and only look at the ones significantly above 0.

#install.packages(entropy) if you don't have the entropy package
disc=discretize2d(data$AxialDistance, data$d0, numBins1=16, numBins2=16)

So here we see what we expect, the descriptor that yields a plot that looks nearly uniformly random has an MI of 0.043 while the plot with a very patterned relationship has an MI of 0.905!

I hope this helps!

j j j

Use Webmail as the Default “mailto” Handler in Windows

Note: A guide for the impatient is at the bottom.

I was trying to configure Internet Explore to use Comcast webmail as the default handler for mailto links recently. As I’m not really a Windows user, I was appalled at the dreary selection of add-ons for IE and the inconvenience of Windows’ “Default Programs” manager. From “Default Programs” you would expect to be able to pick any application you want to be the “mailto” handler, but this is not the case.

I found an article describing how to add handlers for any type of URL, and went from there to construct a hack that allows you to use any website as a mailto handler.

It appears the Windows/IE has the following pipeline for handling URLs when they are clicked:

  1. The text of the URL is analyzed to infer the protocol (http, ftp, etc)
  2. The registry is searched to see if a handler exists
  3. The handler is invoked in the way specified by the “open” key in registry

To jump right in, lets set up our own mailto handler.

Start regedit and navigate to the following key: HKEY_CLASSES_ROOT\mailto\Shell\open\command

This key should have a value “(default)” that specifies a command to run when a mailto URL is clicked. If you have outlook installed, it should look something like this:

"C:\PROGRA~2\MICROS~1\Office14\OUTLOOK.EXE" -c IPM.Note /m "%1"

This is simply specifying the location of an executable and the arguments to pass to it. In this case %1 is referring to the entire URL that was clicked. So for example, if I were to click a link with the URL “” the %1 would be replaced with the text “”.

Our hack will be to write a batch file that will take this URL, remove the first seven characters (“mailto:”), and give the email address to our favorite webmail service in our favorite browser.

Supposing that C:\Users\Kris\handle-email.bat is the location of our custom script, changing the “(default)” value to the following would run the script when a mailto URL is clicked.

C:\Users\Kris\handle-email.bat %1

This just leaves making the actual script. Create a file called handle-email.bat and place it somewhere out of the way. Open it in your favorite text editor and paste the following:

set address=%1
set address=%address:~7%
start iexplore ""

The first line assigns the arguments provided to the script to the variable called “address.” In our set up, this will be things like “”. The second line removes the first seven characters to get rid of the prefix “mailto:”. The last line starts IE so that it goes to the URL provided. In this example, it goes to Comcast’s compose message page and populates the “To:” field. Similar “magic” URLs exist for gmail and Windows Live.

Other webmail service link formats:

For the impatient:

Edit “(default)” in HKEY_CLASSES_ROOT\mailto\Shell\open\command to contain the following:

C:\path\to\your\script.bat %1

Make the script:

set address=%1
set address=%address:~7%
start iexplore ""

j j j

Broken Download Links Fixed

I had an issue where all of the download links on the site were broke. I have no idea how long this has been an issue because it all worked fine when I first set the website up.

First, I discovered that WordPress had put any of the files I “recently” uploaded in the wrong place. By default, WordPress is supposed to look in the path INSTALL_DIR/wp-content/uploads for uploaded content. However, I looked at the source code to find that you can override this behavior with a setting in the database. Somehow, that had happened to me. I deleted the record from the database and tried again.

I was still getting 404 when I moved the files to the right location! What was strange was that these were WordPress 404 errors, not Apache. I deleted my “.htaccess” file and refreshed the page after clearing my browser cache. Direct links to the files worked, but obviously the rest of the site was broken.

To fix the .htaccess file I restored what I had, ran “chmod 666 .htaccess”, then went into the WP admin -> Settings -> Permalink Settings and changed the permalink setting, hit apply, then changed it back and hit apply again. This (presumably) forced WP to rebuild my .htaccess file.

All is now well again.

j j j

Timing Parallel Algorithms in C++

It seems to be that it has become common knowledge that if you want to time something in c++, you use clock(). Typically you construct something that looks like:


using namespace std;

int main()
      clock_t startTime, endTime;
      startTime = clock();
      // Do some computationally expensive thing
      endTime = clock();
      cout << "It took " << (endTime - startTime) / (CLOCKS_PER_SEC / 1000) << "ms." << endl;

I’m sure most seasoned C++ developers have written something like this at one point or another. Unfortunately, this breaks down when timing a multithreaded algorithm. The clock() method is constructed in such a way that it always returns the amount of CPU time elapsed, not the real time elapsed. Perhaps this is common knowledge to some, but the reference documentation on sites like simply state: “Returns the number of clock ticks elapsed since the program was launched.” This is incredibly ambiguous to me, because I could think of clock ticks as either the 2.6 billions ticks per second my CPU is receiving, or the number of ticks from the built in clock that have elapsed.

Regardless, that was tangential. To get reasonably accurate actual time that has elapsed, you need to employ gettimeofday(), which populates a struct with the current time in seconds since the epoch (plus a microsecond component).


using namespace std;

int main()
      struct timeval startTime, endTime;
      long totalTime;
      // The second parameter is the timezone, but it's ignored in the newer linux kernels
      gettimeofday(&startTime, NULL);
      // Do some computationally expensive thing
      gettimeofday(&endTime, NULL);

      // We need to combine the two components of timeval into one value.
      totalTime =  (endTime.tv_sec - startTime.tv_sec) * 1000000L;
      totalTime += (endTime.tv_usec - startTime.tv_usec);

      cout << "It took " << (totalTime / 1000L) << "ms." << endl;

And there you have it! I’d be interested to see if this method is any more accurate on a real time linux kernel.

j j j

Tracking Memory Leaks in C++

This may be common knowledge for a lot of you, but Valgrind is a dynamic code analysis tool that discovers memory leaks for you. It can tell you exactly where memory that leaked was allocated, from there you can use your intuition to decide where the memory should be freed.

The greatest thing is how easy to use it is! I just removed half a dozen memory leaks from my undergraduate thesis project in about 35 minutes with no prior experience. The only “trick” to using Valgrind is ensuring that you compile your project with debugging flags turned on ( “-g” for gcc and g++). That will embed line number information in your executable so that Valgrind can generate useful output.

The Valgrind Quick Start page is by far the best introduction to Valgrind.

j j j