You can reduce that n^2 to n log n by the following:
Suppose your input is this:
12343211
11112111
11112223
31111111
71236111
12341111
11211111
11111111
1. Compute a quadtree of maximums.
i.e. level 1 is this:
12343211
11112111
11112223
31111111
71236111
12341111
11211111
11111111
[code]
level 2 is this:
[code]
2431
3123
7461
1211
level 3 is this:
43
76
level 4 is this:
7
This would take O(m log m) time - but you don't need to compute levels above log n, so it's really O(m log n) time. (Specifically, you don't need to compute any level that cannot theoretically fully contain some full pixel + radius.)
2. You can compute the maximum within n distance of a cell by the following recursive "algorithm":
max_total = max(max(all cells at current level that are fully contained within distance of n of the current cell), recurse over all cells at the next level down that are partially contained within distance n of the current cell)
(There's actually a shortcut here - you do not need to recurse on any cells that have a max <= the current max.)
For instance, suppose we want to find the max within distance 3 of cell (4, 4) (1-based) (I'm using Chebyshev distance here as it's simpler for an example)
At level 4 (1,1). The current cell is not entirely within distance 3 of (1) (4,4), so recurse:
At level 3 (1,1) = 4. The current cell is entirely within distance 3 of (1) (4,4), so max >= 4.
At level 3 (2,1) = 3. The current cell is <= 3, but max >= 4, so don't need to recurse.
At level 3 (1,2) = 7. The current cell is not entirely within distance 3 of (1) (4,4), so recurse.
At level 2 (1,3) = 7. The current cell is entirely within distance 3 of (1) (4,4), so max >= 7.
At level 2 (1,4) = 4. The current cell is <= 4, but max >= 7, so don't need to recurse.
At level 2 (2,3) = 1. The current cell is <= 1, but max >= 7, so don't need to recurse.
At level 2 (2,4) = 2. The current cell is <= 2, but max >= 7, so don't need to recurse.
At level 3 (2,2) = 6. The current cell is <= 6, but max >= 7, so don't need to recurse.
max = 7.
Total time per pixel is log(n) - you never recurse on all 4 quadrants of any quad that can fit into the radius. In the case of the Chebyshev metric, you never recurse on all 4 quadrants of a quad of size < 2r, but for euclidean distance it's a little smaller.
However, in practice I suspect this may be substantially slower except for large values of n.
So your overall time is O(m log n), and you can reduce this via parallelism all the way down to O(log n) if you've got enough threads. (Seriously: a GPU would be great for this sort of thing.)
You can actually do something similar for a circular box blur. Though I've never seen anyone implement it - it's not worth it except for very large radiuses, and for very large radiuses box blurs don't look very good.