This is my approach for the 7th problem of AdventOfCode 2021, including some related variations that I came up with. I’ll be using R
to implement the solutions.
Essentially, you are given $N$ data points $X = \langle x_i \rangle$ (could contain duplicates). You have to find an integer point $y$ minimizing the sum of distances - $\sum_i d(y, x_i)$. Each part has a different distance function:
Let’s optimize the expression with the distance function $d_1$. This is a very standard problem - the optimal point is the median. When there are two medians, any value between those two is optimal.
Proof (informal): Say you pick a $y$ which is not between the medians. This means that there are more points on one side of it. WLOG say that it’s the right side. Now if we consider $y’ = y + 1$. For each point on the right, the total cost would decrease by $1$ each. And for each on the left it would increase by $1$. So the total cost would decrease, contradicting that $y$ was optimal.
sum(abs(X - median(X)))
First a simpler distance function: $d_s(a, b) = (a - b)^2$. So the total cost would be $C(y) = \sum_i d_s(y, x_i) = \sum_i (y - x_i)^2$
Now notice that $C(y)$ is a quadratic equation in $y$. It has only one minimum, at the point where the derivative is zero. $C’(y) = \sum_i 2(y - x_i) = 0 \implies y = \frac{\sum_i x_i}{N} = \overline{X}$
For optimal integer points, we just round $\overline{X}$. As it is a quadratic, the closer integer will have lower cost.
sum((X - round(mean(X)))^2)
Now we’ll use the actual distance function for part 2 - $d_c(a,b) = g(g + 1) = 2 d_2(a, b)$ where $g = |a - b|$. They both will have the same optimal $y$.
\[\begin{eqnarray} C(y) &=& \sum_i d_c(y, x_i) \\ &=& \sum_i \bigg((y - x_i)^2 + |y - x_i|\bigg) \\ &=& \sum_i (y - x_i)^2 + \sum_i |y - x_i| \end{eqnarray}\]As this function is not differentiable, let us consider separate ranges for $y$. For each $1 \le j < N$, let $C_j(y)$ be the cost function when $x_j \le y < x_{j+1}$ (WLOG assume $X$ is sorted in ascending order).
\[\begin{eqnarray} C_j(y) &=& \sum_i (y - x_i)^2 + \sum_i |y - x_i| \\ &=& \sum_i (y - x_i)^2 + \sum_{i = 1}^{j} (y - x_j) + \sum_{i = j + 1}^{N} (x_j - y) \\ &=& \sum_i (y - x_i)^2 + jy - \sum_{i = 1}^{j} x_j - (N - j)y + \sum_{i = j + 1}^{N} x_j \\ &=& \sum_i (y - x_i)^2 + (2j - N)y - \sum_{i = 1}^{j} x_j + \sum_{i = j + 1}^{N} x_j \end{eqnarray}\]And it’s derivative would be:
\[\begin{eqnarray} C_j'(y) &=& \sum_i 2(y - x_i) + (2j - N) \\ &=& 2Ny - 2N\overline{X} + (2j - N) \\ &=& 2N\bigg(y - \overline{X} + \frac{j}{N} - \frac{1}{2} \bigg) \\ &=& 0 \\ \implies y_j &=& \overline{X} + \frac{1}{2} - \frac{j}{N} \end{eqnarray}\]This gives us $N$ candidates $y_j$ for the optimal positon. Now we can compute the cost for each of these values, and pick the best one.
min(sapply(round(mean(X) + 0.5 - seq(1, N)/N),
function(y) { sum((X - y)^2 + abs(X - y))}))
This has complexity $O(N^2)$, which is quite slow. Let us optimize that. Instead of taking every value, we’ll only consider the unique values. As we are rounding off $y_j$, and $\overline{X} - \frac{1}{2} \le y_j < \overline{X} + \frac{1}{2}$, we will have at most two interesting integer values.
min(sapply(unique(round(mean(X) + 0.5 - seq(1, N)/N)),
function(y) { sum((X - y)^2 + abs(X - y))}))
This would run in $O(N)$, which is as fast as it gets.