All of my diligent research seems to have finally paid off. I now have some results worthy of a Master's Thesis. Dr. Menzies and I had been staring at these results for a long time but thinking they were bad. The reason for this is because we did not know how to interpret what we were looking at. A little bit before Christmas Dr. Menzies suggested what he calls the Koru Diagram. Below is an example of the Koru Diagram and an explanation of it follows the image.

The X-axis of a Koru Diagram is a percentage of lines of code explored so far. The Y-axis is the percentage of the total defects found in the file. The red line, labeled Oracle in the key, shows the perfect rule. That it is only finds defects. Its output is sorted by module size, literally C functions or C++ methods, depending on the codebase used, and then plotted in this diagram. The Oracle represents the best possible detector, that is it illustrates the least amount of effort required to find all of the defects. Since the items in the data file that each detector says is defective are sorted in ascending order, no line will ever exceed the Oracle line. The X-axis, Y-axis, sort order, and Oracle make up the Koru Diagram. The other lines on the graph show how well detectors generated by the different learners perform. The fact that Launam performs better than Manual in most cases suggests that there is a disproportionate amount of defects in small modules, which is the reason we decided to graph the results using a Koru Diagram in the first place.
A key thing to notice here are the two lines labeled "manual" and "launam." What they really are are manual methods of searching, there is no detector associated with them. Manual simply sorts ALL modules in a file in ascending order and inspects them, whereas Launam sorts them in descending order and inspects them that way.
Another key thing to notice is the yellow line that is which2. which2 is a machine learner that is which with 2 equal frequency binning. That is we have a bottom 50% of the numbers and the top 50% of the numbers and we convert them to bin1 and bin2. In almost all of the cases, which2 is the best version of which. I am not entirely sure what to say about this. I do not know what the fact that 2 bins means. Also notice how nearly perfect which2 performs.
As a final note, a plateau in a curve means the detector said things were defective that were not. A line that has a lot of horizontal segments equates to a detector that has a very high probability of false alarm.
This results were completely typical in all 10 data sets I tested on. I have the results of a 10x3 cross validation, that is divide each file into 3 parts. Each one of those parts contains 66% train data and the detector learned is tested on the remaining 33%. This is repeated 10 times. So 10 data sets times 10 repeats times 3 folds is a total of 300 numbers for each detector. This gives us a very large amount of empirical data to show our results are valid. The results can be found here. The file is a csv file. Opening it in your browser might hurt. An analysis of Mann-Whitney and Quartile Charts was performed on that 3000+ line csv file. Here are the Mann-Whitney results for KC2, the above graph.
#key, ties, win, loss, win-loss @ 99%
which2, 0, 10, 0, 10
manualUp, 0, 9, 1, 8
which4, 1, 7, 2, 5
nBayes, 1, 7, 2, 5
manualDown, 1, 5, 4, 1
jRip, 3, 3, 4, -1
which8, 2, 3, 5, -2
j48, 2, 3, 5, -2
which8loc, 2, 0, 8, -8
which4loc, 2, 0, 8, -8
which2loc, 2, 0, 8, -8
This shows that which won 10 times and lost 0 times and tied 0 times out of 10 repeats. In other words, it was always the winner. All of the Mann-Whitney and Quartile Chart results can be found here.