An efficient algorithm for the large-scale smoothing of scattered data retrieved from remote sounding experiments

Abstract. We present a new algorithm for smoothing/interpolation of two-dimensional fields applicable to noisy data observed at scattered sites. The technique is based on a special statistics allowing one to simultaneously minimize the fit residual and the correlation between residuals of adjacent points. The principle of the method is first explained in the 1-D case and then extended to the 2-D case by adjunction of a regularization operator. The method is compared with different algorithms (Loess-Renka, Biharmonic Spline and kriging) in three test cases related to remote sounding of the Earth’s atmosphere by space-borne experiments. Key words. Atmospheric composition and structure (evolution of the atmosphere; instruments and techniques; general or miscellaneous)


Introduction
Data retrieved from remote sounding space experiments generally provide atmospheric information with good coverage in space and time. These data can be validated by comparing results obtained by independent instruments and/or by ground-based observations. However, satellite data from several orbiting platforms correspond to different sounding geometries, such as solar or stellar occultation, nadir or limb viewing, etc. Geolocation of sounded regions is also clearly influenced by the orbital state vector and possible hardware constraints. From these asynoptic and composite data sets, it is, therefore, highly desirable to use efficient numerical tools capable of producing regular (gridded) fields on which validation and statistical analysis can be performed (Lait, 2000).
Sophisticated interpolation techniques, such as 4-D variational assimilation (Fisher and Lary, 1995), are presently developed and will be routinely used for future processing of stratospheric data. However, these procedures are computationally very expensive and their success strongly depend Correspondence to: D. Fussen (Didier.Fussen@oma.be) upon the quality of the underlying atmospheric model. An extensive review of atmospheric data analysis can be found in Daley (1991) and a considerable number of methods for approximating two-dimensional ("2-D") fields from scattered data have been published (see Foley and Hagen, 1994).
This work presents a new algorithm for 2-D smoothing which is based on an expansion in orthogonal polynomials combined with a regularization driven by statistics different from the usual Chi-squared. After disappointing experiences with existing schemes, the method has been designed to be robust (it should always converge), automatic (not requiring interactive inspection of the smoothing level), reasonably fast and not very sensitive to the inaccuracy of experimental error bars. As the main idea of the algorithm is to Minimize the Correlation between Residuals, it will be referred to as the MCR method hereafter.
For the sake of clarity, we will first explain the principles of the algorithm for one-dimensional ("1-D") problems without addressing a rigorous mathematical framework. In the second part, we will extend it to 2-D cases and we will investigate its performance (with respect to alternative methods) for realistic data sets coming from atmospheric remote sensing experiments.
It is well known that the generalized least-squares techniques require the use of a full covariance matrix to avoid biases induced by possible correlation between data point errors (notice that the data points themselves are highly correlated by the underlying physics). Also, it is clear that the data point processing of instrumental data through inversion algorithms may introduce correlation between values retrieved at adjacent altitudes or between different species retrieved simultaneously. However, it is reasonable to assume that the estimated random experimental errors at different geolocations and sampling times are uncorrelated because they correspond to independent realisations of the instrumental/inversion noise. Clearly, there does not exist a smoothing algorithm capable of removing a bias in a given measurement and if such a bias would exist, it could only be corrected at the level of the combined "measurement-data processing" chain. The removing of the bias would also require the availability of many measurements at close geolocations and time or the cross-validation with respect to several independent experiments. Also, experimental data from spaceborne instruments are mostly available as independent measurements with estimated variances. If a bias is suspected in the data and if it can be estimated but not exactly computed, it can be quadratically added to the random error variance before applying the smoothing algorithm. The MCR method explained hereafter presupposes that the experimental errors associated with the measurement points have a zero mean and are uncorrelated in space and time.

1-D MCR
We deal with the problem of approximating m bivariate data (x i , f i ), i = 1...m by a smooth curve f (x) that should approximate the unknown reality g(x). The measured data are supposed to contain a noise component δ i so that The standard deviation σ i of δ i and the experimental data x i may have been measured by different instruments with specific random error distributions and possible moderate bias. The classical fitting strategy (Press et al., 1992) consists of minimizing the merit function χ 2 defined by where the normalized residual d i is and the approximating function f (x; c 0 , c 1 , ...c n ) depends on (n + 1) parameters. Ideally, the value of χ 2 should decrease when the number of fitting parameters c 0 , c 1 , . . . c n is increased until a plateau (whose value is about m ± √ 2m for large m) is reached. An insufficient number of parameters causes underfitting and some information content of the data set is lost, while too many parameters means overfitting and the fit starts to represent the noise. A simple example is given in the upper part of Fig. 1. The optimal fit is clearly at the border between both extreme regimes. On the other hand, the values of σ i are often poorly known or even unknown, and the optimal fit has to rely to the appearance of such a plateau. Furthermore, it is not always easy to recognize the existence of the plateau, which can be obscured by multiple extrema.
Quite recently, the Durbin-Watson (Durbin and Watson, 1950) statistics was proposed as an alternative statistics for the least-squares spline approximation of noisy data (Thijsse et al., 1998). Basically, the merit function of the fit is given by: i.e. the ratio of two variances where the denominator is the usual aggregate magnitude of residuals, while the numerator measures the serial correlation between them. For reasonably smooth underlying functions, the evolution of Q from underfitting to overfitting regimes is easy to understand (see Fig. 1). For underfitting cases, residuals are strongly correlated because two adjacent points tend to have the same residual with respect to the fit (d i+1 d i and Q is closer to 0) . In the case of overfitting, residuals tend to be anticorrelated (d i+1 −d i and Q is closer to 4), as the fit rapidly oscillates between neighbouring points. The optimal fit, for which Q is about 2 ± 2 √ m , corresponds to a minimal correlation between residuals ( i d i+1 d i 0) because they only contain noise. A considerable advantage of Q arises from its relative insensitivity to the knowledge of the σ i , since these appear both in the numerator and the denominator.
Keeping in mind that our goal is to globally approximate smooth geophysical fields retrieved from space experiments, we decided to develop the unknown function over a basis of orthogonal functions. Excellent (but not unique) candidates are Chebyshev polynomials T k (u) = cos(k arccos(u)) (Press et al., 1992) and, within a domain [x min , x max ], f (x) may be expanded as: where In Fig. 2, we simulated a synthetic atmospheric transmittance measured by an occultation experiment on which detector shot noise (varying as the square root of the signal) has been added. The least-squares linear problem associated with Eq. (5) can be solved for increasing values of n until the value Q = 2 (nominal) or Q + = 2 + 2 √ m (conservative) is reached (for n 7). The resulting fitted curve has the desired smoothness and it is able to capture the slope change around 25 km. Considering larger n values would only result in overfitting (wavy structures) and possible ill-conditioning of the linear system.

2-D MCR
It is possible to generalize the orthogonal function expansion to the 2-D case by: (Chebyshev polynomials could be replaced by spherical harmonics if a full coverage of the globe is desired). A natural way to consider the correlation between adjacent measurement points is to perform a Delaunay triangulation of the considered domain. For a given data point i, we consider  the list {1, 2, ...p(i)} of all neighbours, i.e. all points belonging to a Delaunay triangle whose point i is a vertex. For the MCR statistics, we propose where d ij refers to the residual of the j -th neighbour of point i. This statistics has the same asymptotic properties as those present in Eq. (4) after normalization with respect to the number p of neighbours surrounding the point i. In case of underfitting, d ij d i and Q almost vanish, while overfitting corresponds to anticorrelated residuals (d ij −d i ) and makes Q closer to 4 if the number of data points is large enough. The optimal fit is achieved when the sum of products of neighbour residuals times the point residual, summed over all points, approaches zero, i.e. Q 2 ± 2 √ m . An additional problem of the 2-D case is the large increase in the number of coefficients c kl to compute, which may cause severe ill-conditioning (Dierckx, 1993). It means that the solution becomes extremely sensitive to the noise amplitude even if the statistical criterion is fulfilled. Therefore, it is necessary to introduce a regularization technique (Hansen, 1992). If A represents the design matrix of the problem (hereafter, we will consider σ i = 1 in order to simplify the notation) the constrained linear inversion consists of minimizing the Lagrangian merit function consisting of two quadratic forms where c and b, respectively stand for the vector of unknown coefficients c kl and the vector of m data points. The regularization operator U is aimed at measuring the total surface smoothness by using direct partial differentiation of Eq. (7). It is constructed from the smoothness measure η = c T Uc given by Dierckx (1993) and the minimization problem defined by expression (10) can be algebrically solved: Before investigating examples related to non-uniform sampling in remote sensing, it is worth illustrating the full algorithm by a simple test case (test case 1 = "TC1") that is represented in Fig. 3. A synthetic 2-D field f 0 (x, y) consisting of 2 Gaussian functions of amplitude 1 has been perturbed by random Gaussian noise with a standard deviation of s 0 = 0.2. The noisy field f i (x, y) was sampled by a set of 400 measurements at random locations in the considered domain. The goal of the algorithm is twofold: to reconstruct the unperturbed field at the sampling points and to construct the best possible gridded field everywhere in the domain. If we call f 1 and f g the MCR estimated values at the m sampling locations and at the m g grid nodes, respectively, the performance can be estimated by means of the following quantities: where s is the estimation of the noise RMS (to be compared with the true value s 0 ), s 1 is the RMS error with respect to the true field at the sampling locations (referred to by the summation index i) and s g is the RMS error with respect to the true field at the grid nodes (index g).
Like in the 1-D case, the algorithm proceeds by increasing the value of n until the conservative value Q + is reached (Fig. 4). At this stage, n may be safely incremented by one or two, in order to ensure a maximal sensitivity to data because the regularization in Eq. (12) is not yet working (λ = 0). The isopleths associated with Q (c(n, λ)) exhibit a characteristic corner-shaped form that expresses the progressive transition from a regime of high sensitivity to data and low smoothness, to a regime of low sensitivity and high smoothness. The optimal estimation of f 0 , if it exists, should lie somewhere in the corner region. Therefore, keeping n constant, the algorithm iteratively increases the value of λ by means of a standard zero finding numerical scheme and computes Q (c(n, λ)) at each step until the Q + isopleth is reached again, which will always be possible due to the corner-shaped form of the curve. It should be noticed that all (n, λ) solutions lying in the corner region can be considered as equivalent and the present algorithm is a simple way to discover a cheap solution (with a low n value). Indeed, augmenting n along the Q + isopleth would increase the computational cost without reducing s 1 and, eventually, the ill-conditioning would become redhibitory. Also, the technique to discover a quasioptimal (n, λ) doublet is not unique. For instance, it would also be possible to detect the maximal curvature along the Q + isopleth when both n and its associated λ value are varied, as done in many regularization problems (Hansen, 1992). However, this procedure turns out to be more expensive.
In Fig. 3, it can be seen that the proposed MCR algorithm is quite effective (s 1 s 0 /4) in recovering the unperturbed field in presence of important noise.

Three competing methods
In order to evaluate the efficiency of the MCR algorithm, it is worth comparing its performance with three other published algorithms applied to the same test cases.
The first one is the Loess (Cleveland and Devlin, 1988) method (referred to as LR,) in which the smoothing is locally performed by using a bivariate cubic polynomial fitted to an adjustable number of nearest neighbours. In the surface fitting commercial software (TableCurve 3D TM ) that we have used, interpolation on the regular grid is performed by locally-weighted fitting with the Renka triangulation-based procedure (Renka, 1996) applied to the smoothed values. When the number of neighbours is interactively increased by the user, the degree of smoothing varies from overfitting to underfitting. In the comparisons hereafter, we have selected the optimal number that produces an exact value for the estimation of s 0 . This gives a lower bound of the smoothing error caused by the LR method.
The second method is the biharmonic spline scheme (BS) proposed by Sandwell (1987), in which Green functions of the biharmonic operator are used for minimum curvature interpolation of irregularly spaced data points. The interpolating surface is a linear combination of Green functions centered at each data point. By reducing the number of model parameters, noisy data can be fitted in a least-squares sense due to the minimimal curvature property of any solution of the biharmonic equation. The interpolation scheme  reproduces the input data at the sampling points so that the smoothing error has only been computed at the regular grid nodes. For this comparison test, we have used the routine "grid data" as implemented in the numerical software package MATLAB TM (release 12).
The third competing method is "kriging" which has been extensively used in geophysics (see, e.g. Tranchant and Vincent, 2000). Briefly, kriging is a minimum variance method closely connected with statistical interpolation techniques (Daley, 1991). The analyzed field is considered as being reducible to the sum of a mean value, the non-stationary expectation of the field, and a random fluctuation with a zero expectation. A predictor is then constructed from a linear combination of data values whose coefficients minimize the estimation variance, with a constraint ensuring that the estimator is unbiased. The kriging system of equations is found to relate the optimal coefficient set and the covariance between the sample points, which is assumed to depend only on the interdistance. The covariance is then computed by means of a "semi-variogram" and fitted by a predefined theoretical model. In the comparaison exercise, we have used the EasyKrig code (version 2.1) developed by Chu (2000). We also made the assumption that the sampled geophysical field was quasi-stationary, which allows one to use ordinary kriging. In case of suspected non-stationarity, universal kriging should be preferred. Kriging is a powerful statistical method, and robust techniques have been proposed to fit the semi-variograms (Journel and Huijbregt, 1989), including the possible use of anisotropic forms if privileged directions are expected (see Tranchant and Vincent, 2000, for an extended discussion). Here, however, we do assume that the use of an Gaussian-linear isotropic semi-variogram is accurate enough. In Table 1, we present the results of the comparison of s, s 1 and s g (Eqs. 13-15) obtained by MCR and by the competing methods for test cases TC1, TC2, TC3 (the latter two are described hereafter). For s 1 and s g , reported values have been normalized to the best score for each test case, whereas for s the normalization was performed with respect to the exact value. For TC1, MCR is clearly superior to the LR and BS methods. It is also slightly superior to kriging, except for the estimation of noise standard deviation.

Two geophysical examples in the Earth's remote sounding by space experiments
The second test case ("TC2") concerns the reconstruction of the ozone field at 30 mbar in a time-latitude domain. The example has been conceived from geolocations inspired by the well-known SAGE II experiment Mauldin III L. E. et al., 1985), which has furnished an invaluable series of atmospheric altitude profiles of ozone, nitrogen dioxide, aerosols and water vapour. SAGE is a typical solar occultation experiment that was launched in October 1984 on the Earth Radiation Budget Satellite into a 56 degree inclination orbit. It makes measurements primarily at middle and low latitudes, but reaches the polar regions in both hemispheres several times a year due to the combination of the orbital plane precession and of the seasonal effect (see Fig. 5). About 15 sunset and 15 sunrise measurements are performed per day at a quasi-constant latitude. Over the period of about 1 month, the latter slowly varies along sinusoidal tracks and extends over a seasonally dependent altitude of approximately 70 • S to 70 • N. In test case TC2, we arbitrarily selected the SAGE geolocations of year 1992.
The assumed geophysical field of interest is the ozone volume mixing ratio (VMR) and, for the sake of realism, we have constructed the "exact" field from the ozone climatology published by Fortuin and Kelder (1998). Daily ozone VMR have been averaged and they were assigned a measurement random error of 2%, based on the published values of Cunnold et al. (1989). From a data set of 638 measurements, the test consisted in the reconstruction of the ozone field on a regular grid having a 0.2 month × 2 degree resolution. It can be seen in Fig. 5 that the MCR method performs well (the optimal order was found to be n = 14 with a total of 120 coefficients) and that fine structures are correctly extracted within domains of missing data (e.g. around months 5 and 7 in tropical regions). Also, at high latitudes, partial information about well-known ozone minima is extracted by the algorithm, although data are scarce or even inexistent. Even by using a basis of spherical harmonics (or geodesic distances for kriging), it is very important to keep in mind that an interpolating/smoothing algorithm is not able to produce information not measured by the experiment. An important feature of ozone field is located near the edge of the south polar vortex (see, for instance, Wauben et al., 1997;McIntyre, 1989;Vincent and Tranchant, 1999) which is exterior to the SAGE sampling domain. Nevertheless, MCR gives valuable, although incomplete information, over this region by smoothly extrapolating the data points. In Table 1, MCR turns out to be the best estimator of s, s 1 and s g .
The third test case ("TC3") deals with NO 2 columns retrieved from the Global Ozone Monitoring Experiment (GOME) on board the ESA ERS-2 satellite (Burrows et al., 1999) that was launched into a polar heliosynchronous orbit. The instrument is a nadir-viewing UV-visible spectrometer and allows for a global coverage in about 3 days. The test case has been built by assuming that the true geophysical field is the monthly averaged NO 2 columns for September 1999 above Europe and North Africa (longitude 30 • W-60 • E, latitude 0 • -70 • N) (see Fig. 6). According to Lambert et al. (1999), a random retrieval error of 15% has been added to produce the measured data for one observation day (15 September 1999).
Test case TC3 consists of recovering the true field from data (820 values) representing about one-third of the full coverage, with clustered pixels along the satellite tracks. This was achieved by using MCR with an optimal n = 10 (66 coefficients) and reconstruction was performed on a grid having a 2.5 deg. (in longitude) × 1.25 deg. (in latitude) resolution. By inspecting Fig. 6, it is clear that the reconstruction is not as accurate as for TC2 case. For instance, the pollution peak above western Europa is well identified but not separated into local maxima above Germany and North Italy. Fine structures above the Black Sea have also been smoothed out. Actually, the lower spatial resolution obtained by the reconstruction algorithm is a consequence of the quite high noise level and not of the sampling geometry. In comparison with alternative methods (Table 1), it is interesting to observe that MCR performs sligthly worse than KR at the sampling locations, due to the locally very high correlation between GOME adjacent pixels, whereas it is better for interpolating/extrapolating on the regular grid. For test case TC3, LR gives the best results on the grid, if we remember that it should be fed with the exact noise estimation that could be delivered by MCR.

Conclusions
Surface fitting of noisy data from scattered locations can be performed by various methods. The final accuracy of the reconstruction obviously depends on the algorithm intrinsic error, the geometrical distribution of samples and the noise level of the observations.
In this work, we have developed a new algorithm capable of producing an optimal fit in the sense that a trade-off is achieved between underfitting where information is lost and overfitting where spurious structures are induced by experimental noise. The 2-D MCR method is based on minimal correlation of minimal residuals, associated to a standard regularization technique. The method is fast, quite robust and easy to implement.
MCR performs well with respect to existing methods for geophysical applications. In particular, we have tested its efficiency for three kinds of sampling geometries: a randomly dispersed data set (TC1), a line type data set (TC2) and a clustered data set (TC3) for different levels of noise. The method is, therefore, well suited for interpolation in validation campaigns or for the construction of climatological fields.
In future work, we will investigate the generalization of the algorithm to higher dimensions. Clear geophysical objectives will include interpolation between vertical profiles observed at different locations and temporal evolution of 2-D fields.