Empirical Analysis of a Brute Force Algorithm for Median Search

This paper serves as a brief analysis of a number of computational experiments examining the running time and complexity of a brute force median selection algorithm. C++ was used to implement and test the algorithm, while Python’s matplotlib.pyplot library was used to graphically visualise results. The main focus of the report was to test whether the theoretical prediction for the algorithms average-case time-complexity is consistent with observations.

1. Description of the Algorithm

The algorithm, called BruteForceMedian (see Appendix A), is a median search algorithm which takes an array of ordered or unordered integers as input, and returns the median value of that list.
Unlike other algorithms used for finding a median, BruteForceMedian does not reshuffle or sort the input array, but rather compares every element of the array against every other element iteratively, including itself, until it determines the kth value in the list if it were sorted. It is important to note that the array itself doesn’t actually need to be sorted in order to find the median. By keeping track of some extra variables, the position of the median can be deducted.

Some definitions are necessary for an in-depth description of the algorithm:
A = Input array of integers
n = Length of A
k = n/2 (the kth position in the list is the median, if list is ordered)
A[i] = Item from list A
A[j] = List item to be compared with A[i]
numsmaller = instances in which A[j] is smaller than A[i]
numequal = instances in which A[j] is equal to A[i]

Before the basic operations of the algorithm are implemented, an input array is specified so that the important variables n and k can obtain values. These don’t need to be integers, as long as each item in the array can be ranked against the other numbers in the array in some manner, and they are of the same type.

For each element of the array, numsmaller and numequal are set to zero before the inner loop checks each element against each other.

If A[j] is less than A[i] then numsmaller is incremented by one, while Numequal is incremented by one on the condition that A[i] and A[j] share an identical value.

After all items of the list are checked and the inner loop is closed, the median is identified as the item in the list that has an even probability of other values in the array falling above or below it. In the case that the list length leads to a discrepancy, for example If there are two median values because the centre point of the array (k) is not part of ℤ , such as when the list has an even number of items, then the lower of the two is selected as the median value.

2. Theoretical Analysis of Algorithm

2.1. Identifying Algorithms Basic Operations

The basic operations of an algorithm can be defined as the lines of code which are fundamental to solving the problem the algorithm intends to solve, and tends to be the most frequently used portions of code within the algorithm.

This particular algorithm is short and only comprises of a few lines of code spread across a nested loop, which makes it fairly straight forward to follow. The inner loop, within the main loop, is repeated n times per iteration of the outer loop, and is incidentally repeated more often than the outer loop. Therefore, each iteration of the inner loop, which is any comparison between array elements, can be considered a basic operation.

2.2. Order of growth

The outer loop only iterates as many times as it needs to until it finds a median value, but no more than n. In other words, the time-complexity of this algorithm is heavily reliant on the position of the median in the array.

If the median value falls in the 0th position of the array, the inner loop will only need to cycle through the array once to compare each item against the median before it is returned. The BigO Notation for this scenario is expected to be O(n). On the other hand, if the median lies in the last position of the array, the inner loop will need to cycle through the array n times, which is written as O(n2).
Since the array is using a randomized set of integers, there is theoretically a 1/n chance that the median will be located at any given position of the array. Larger arrays should take much longer, iterating through the outer loop looking for a suitable median candidate.

3. Methodology, Tools, and Techniques

To test the algorithm, a 64-bit Windows Intel i5 @ 3.50GHz. The general-purpose programming language, C++, was used to calculate the time complexity and running time of BruteForceMedian. With the multiplatform IDE Codeblocks, Several C++ console applications were created for different purposes during this analysis, which are detailed below in section 4.1. Function Testing.
Originally Microsoft Excel was planned to be used for graphic visualisation of the data but large datasets were causing memory problems and Python’s pyplot library was chosen for its simplicity and flexibility for use with large datasets.

4. Experimental Results

The following sections were tested 10 times per array size as some timed values were negative values, seemingly due to the fairly low resolution of the function used to time the algorithm. The data was normalised to remove any rows that contained these negative values.

4.1. Functional Testing

Firstly, the algorithm itself was produced and the cpp file was duplicated to make a second file called test.cpp which was edited to output numerous lines of useful code for initial testing. This first cpp file only deals with one example of the algorithm in action with a list of length 5, and the console output (Appendix B) is intended to show a human readable line by line representation of the processes taken.
Once the basic operation of the algorithm had been identified by counting the inner loop iterations from the above output, a new cpp file called time.cpp was created, which included a function that utilised a pseudorandom number generator to output an array of random integers between 0 and 1000. Array lengths of 1 to 10,000, with random values, were each tested 10 times so that there were 100,000 tests in total. The algorithm was timed with <sys/time.h>’s gettimeofday() function which is capable of deriving millisecond resolution. Although the timer may be interrupted by background processes, it is still useful to give an idea of the general trend.

Output from the above time.cpp was redirected to a txt file in csv format and included the total number of test runs, the length of the array being tested, the runtime in milliseconds, the median value, and the number of basic operations in a 5 by 100,000 array.

4.2. Average-Case Number of Basic Operations

Considering the maximum and minimum number of basic operations are dictated by the position of the median in the array, the average case is considered to be the case in which the median is located in the center of the array. The best way to ensure this is to initially sort the array in ascending order so that the median is always located in the center position, or k, which equals n/2. The graphs in Figure C & D represent tests in which the arrays were sorted and the median was in the n/2 position.

4.3. Average-Case Execution time

Figure c represents the number of operations and the execution time in ms when the array is sorted, or in other words, at the average case time complexity, and it can be seen that there is a linear relationship between the two variables as the position of the median is constant. Figure D shows the results of same test if it were done with unsorted arrays instead of sorted ones. As can be seen, the runtime was generally overall much lower in figure D, than when the arrays were fixed with the median in the same position every time. This is because, in the trial that produced figure D, there was a 1/(n/2)-1 chance that the array would fall in the first position, drastically reducing the runtime, especially in the larger arrays.

5. Conclusion

The algorithm behaved as was expected but there are still many more aspects to explore such as using a system which is capable of timing with higher time resolution for more accurate results, or finding a way to reduce the length of each element in the array.

This median selection method is not very practical if the intention of the algorithm is to reduce time complexity as much as is possible. There are many methods of reducing time complexity of an algorithm such as this one, for example, by assuming the median is always in the first position of the array, the time complexity can be reduced to O(n) because it only takes one round of accessing the array to reach a conclusion. If this is not possible, a similar divide and conquer technique can be applied in which several smaller arrays are created and the median of each is assessed before assessing the final mean, this technique is called the median of medians and works much better on large datasets with a high variability between data points.