浮点数加法精度误差

Published: June 16, 2020 by Daltao

Categories:
algorithm 101

众所周知，计算机在执行浮点数加法的时候，由于double类型只有52个有效位，因此精度是有限的（不到16位十进制有效数字），在执行加法运算的时候，较低的位会被省略掉，比如 $1e10+(1+1e-10)$ ，其结果为 $1e10+1$ ，而 $1e-10$ 就被省略了。虽然看起来的省略项微不足道，但是如果有许多这样的项合在一起就变成了明显的精度误差了。

昨天晚上在做一道题目的时候，我的二分做法始终在第71个样例报错。看了一下java提交，好像也都是这样的。但是对于其它语言好像基本都通过了。

后面发现一个人也用二分通过了，但是他用DoubleStream#sum来计算浮点数和，而不是普通的for循环。看一下源码：

    @Override
    public final double sum() {
        /*
         * In the arrays allocated for the collect operation, index 0
         * holds the high-order bits of the running sum, index 1 holds
         * the low-order bits of the sum computed via compensated
         * summation, and index 2 holds the simple sum used to compute
         * the proper result if the stream contains infinite values of
         * the same sign.
         */
        double[] summation = collect(() -> new double[3],
                               (ll, d) -> {
                                   Collectors.sumWithCompensation(ll, d);
                                   ll[2] += d;
                               },
                               (ll, rr) -> {
                                   Collectors.sumWithCompensation(ll, rr[0]);
                                   Collectors.sumWithCompensation(ll, rr[1]);
                                   ll[2] += rr[2];
                               });

        return Collectors.computeFinalSum(summation);
    }

    /**
     * Incorporate a new double value using Kahan summation /
     * compensation summation.
     *
     * High-order bits of the sum are in intermediateSum[0], low-order
     * bits of the sum are in intermediateSum[1], any additional
     * elements are application-specific.
     *
     * @param intermediateSum the high-order and low-order words of the intermediate sum
     * @param value the name value to be included in the running sum
     */
    static double[] sumWithCompensation(double[] intermediateSum, double value) {
        double tmp = value - intermediateSum[1];
        double sum = intermediateSum[0];
        double velvel = sum + tmp; // Little wolf of rounding error
        intermediateSum[1] = (velvel - sum) - tmp;
        intermediateSum[0] = velvel;
        return intermediateSum;
    }

可以发现它用了一种名字叫做kahan summation的算法。去wiki了一下，发现这个算法可以用于计算浮点数的加总，同时其造成的误差是 $O(1)$ ，即与加总的次数无关。

其具体代码如下：

    public static double sum(double[] arr, int l, int r) {
        double sum = 0;
        double err = 0;
        for (int i = l; i <= r; i++) {
            double x = arr[i] - err;
            double t = sum + x;
            err = (t - sum) - x;
            sum = t;
        }
        return sum;
    }

我们来分析一下这段代码。err变量中存储上一个被加数被丢弃的精度（但是符号与它相反）。因此取下一个变量的时候double x = arr[i] - err;，我们会先把上一次被加数丢弃的精度在加回到新的被加数中。之后在计算新的和的时候double t = sum + x;，x的部分精度会丢失。之后在计算新的误差的时候err = (t - sum) - x;，x丢失的那部分精度就以相反数的形式存储在了err变量中。所以我们可以认为整个加总过程所有丢失的精度都累积在err中，当err足够大的时候，其就会出现在sum中。

上面的代码速度会比普通的直接加法慢不少，但是对应的好处就是精度丢失非常少。

又学到了奇怪的知识。

参考资料

Kahan summation algorithm