A singular evolutive extended Kalman filter to assimilate real in situ data in a 1-D marine ecosystem model

Abstract. A singular evolutive extended Kalman (SEEK) filter is used to assimilate real in situ data in a water column marine ecosystem model. The biogeochemistry of the ecosystem is described by the European Regional Sea Ecosystem Model (ERSEM), while the physical forcing is described by the Princeton Ocean Model (POM). In the SEEK filter, the error statistics are parameterized by means of a suitable basis of empirical orthogonal functions (EOFs). The purpose of this contribution is to track the possibility of using data assimilation techniques for state estimation in marine ecosystem models. In the experiments, real oxygen and nitrate data are used and the results evaluated against independent chlorophyll data. These data were collected from an offshore station at three different depths for the needs of the MFSPP project. The assimilation results show a continuous decrease in the estimation error and a clear improvement in the model behavior. Key words. Oceanography: general (ocean prediction; numerical modelling) – Oceanography: biological and chemical (ecosystems and ecology)


Introduction
One of the major challenges for the coming years in oceanography is the design of forecasting systems capable of supporting coastal zone management issues. Data assimilation systems of coupled physical/biogeochemical models provide a new perspective regarding the forecasting of ecosystems. Ecosystem modelling and assimilation are closely connected with the use of compatible and complementary observing systems and data assimilation techniques. The development of adaptive ecosystem models that can respond to changes in physical forcing will require a complex interplay between improving model realism and acquisition of measurements suitable for assimilation. Such systems are very important since they enable a reliable comprehension of physical, Correspondence to: G. Triantafyllou (gt@imbc.gr) chemical and biological interactions, to predict and monitor their behavior and to assess extreme ecosystem phenomena.
Data assimilation techniques have been used in several studies (Matear, 1995;Prunet et al., 1996;Harmon and Challenor, 1996) for the estimation of poorly known parameters in marine ecosystem models by essentially trying to minimize the misfit between model simulations and observed data. Based on the assumption that a model with a fixed parameter set can efficiently reproduce the observations, data assimilation can be further used for the model's state estimation. These methods, which can be classified into variational (based on the optimal control theory) and sequential (based on the statistical estimation theory), have been widely used in meteorology and are becoming popular in oceanography (Ghil and Malanotte-Rizzoli, 1991). Recently, focus has turned toward marine ecosystem models. For the purpose of this study, a sequential method and in particular, the implementation of a (simplified) Kalman filter has been used.
The Kalman filter is a statistical data assimilation scheme which provides the best estimate, in the sense of leastsquares, of the state of a linear system, using only observations available up to the analysis time. Although the implementation of this filter is not difficult, its application in realistic oceanic models encounters two major difficulties: nonlinearity and computational cost. The first can be partially resolved by linearizing the model around the state estimate, which leads to the so-called extended Kalman (EK) filter (Ghil and Malanotte-Rizzoli, 1991). The second arises from the large dimension of the system. Several variants of the EK filter, which basically reduce the dimension of the system n through some kind of projection onto a low dimensional subspace, have been proposed (see De Mey, 1997, for a review). One of these variants is the singular evolutive extended Kalman (SEEK) filter developed by Pham et al. (1997). The idea behind this filter is to view the error covariance matrix as singular with a low rank r n. This leads to a filter in which the correction of the errors is made only along certain directions parallel to a linear subspace of dimension r. In its general form, the "directions of correction" evolve in time according to the model dynamics (Pham et al., 1997). However, following Brasseur et al. (1999) and Hoteit et al. (2001), these directions were not updated in the numerical experiments, in order to avoid expensive calculations. The covariance matrix is then represented by a limited number of three-dimensional multivariate empirical orthogonal functions (EOFs) as an approximation of the system error covariance matrix, describing the dominant modes of the system's variability and defining in this way the structure of the directions of correction.
A major aim of the Mediterranean Forecasting System Pilot Project (MFSPP) was to develop methodologies for the efficient integration of observations with hydrodynamical and biogeochemical models. It is in this framework that the SEEK filter was used to assimilate real in situ data in a 1-D marine ecosystem model for the Cretan Sea (Triantafyllou et al., 2003b). The principal aim was to customize, implement and verify this sequential data assimilation method in the context of future operational forecasting applications. The paper is organized in the following way. The model and the available data are briefly described in Sects. 2 and 3, and the SEEK filter is presented in Sect. 4. The validation of the assimilation scheme with twin experiments and experiments with real observations is demonstrated in Sect. 5. Finally, concluding remarks are discussed in Sect. 6.

Model description and configuration
The model used in this study is a generic 1-D ecosystem model originally designed for the North Sea (Allen, et al., 1998a(Allen, et al., , 1998b and later adapted to the Cretan Sea ecosystem by Triantafyllou et al. (2001Triantafyllou et al. ( , 2002b. It consists of modules describing the biological and chemical processes in the water column, which may be stratified or mixed. It describes the significant biogeochemical processes affecting the flow of carbon, nitrogen, phosphorus, and silicate. Each module consists of a coupled set of ordinary differential equations, which may be solved by a straightforward explicit method (Euler integration) or by an implicit higher-order Runge-Kutta method. The ecosystem is subdivided into three functional types: producers (phytoplankton), decomposers (bacteria) and consumers (zooplankton). State variables have been chosen in order to keep the model relatively simple without omitting any component that could exert a significant influence on the energy balance of the system. The organisms which constitute the food web (Triantafyllou et al., 2003b) are organized into functional groups, classified according to trophic levels and subdivided on the basis of trophic links and/or size. Physiological processes and population dynamics are described by fluxes of carbon and nutrients between functional groups. Due to the oligotrophic nature of the system under study and the significance of the microbial loop, a modified bacteria module was used .
The vertical diffusion sub-model of the Princeton Ocean Model (POM) (Blumberg and Mellor, 1987) is used to provide the physical forcing to the ecological model where the Mellor Yamada 2.5 turbulence closure model (Mellor and Yamada, 1982) calculates the vertical turbulent kinetic energy, temperature and salinity diffusion coefficients. Other mixing processes (e.g. internal waves) have been parameterized by a background viscosity.
The combination of food web with coupled nutrient dynamics allows for the model to be adjusted to spatial and temporal variations of carbon and nutrient availability, and to reproduce the different types of ecosystem behavior. Versions of the model have been implemented in a wide variety of regimes from the coastal eutrophic (Allen, 1997;Allen et al., 1998a,b;Baretta and Baretta-Bekker, 1998;Vichi et al., 1998;Zavatarelli et al., 2000) to offshore oligotrophic (Triantafyllou et al., 2002b) and closed systems .
The application area is located north of Heraklion at the deployment site of the M3A buoy with an approximate depth of 1050 m, as shown in Fig. 1. The discretization of the model is 40 boxes in the vertical with a finer resolution at the euphotic zone, in order to simulate fine scale phenomena. The physical model is forced with 3-hourly realtime wind speed and air temperature data extracted from the weather forecast model of the POSEIDON system (Nittis et al., 2001) at the particular area and relaxed to mean monthly sea-surface temperature and salinity obtained from the POSEIDON buoy (Soukissian et al., 1999) located at the Cretan Sea during the period of the M3A deployment (January 2000 -April 2001). The incident surface solar radiation is calculated from the latitude and modified by the cloud cover data using the methods of Patsch (1994).
The state vector consists of the model prognostic variables that have to be initialized independently. The prognostic biogeochemical variables of the ecosystem model are 115, and the state vector dimension is 115 × 40 = 4600.

The data
In this study high frequency in situ data collected during the MFSPP project at north Crete during 30 January 2000 to 20 April 2001 are used (Fig. 1). The parameters measured on a 3-hourly basis were temperature, salinity, oxygen, nitrate and chlorophyll-a concentrations at 40, 65, 90 and 115 m, as described in Nittis et al. (2003). For the experiments, high frequency in situ measurements of oxygen and nitrate were chosen as observation data in the assimilation scheme, while chlorophyll data were used as independent measurements to validate the assimilation system.
Dissolved oxygen sensors gave reliable data for the first six months, but an attempt for in situ repair was not successful and, thus, data are missing after August 2000 (Nittis et al., 2003). The nitrate data are also fragmented since there were some difficulties with the instrument after 7 July 2000.
Chlorophyll-a, dissolved oxygen and nutrients data were post-calibrated using in situ bottle measurements obtained during maintenance trips. As described in Nittis et al. (2003) for the calibration procedure, water samples were collected at the respective depths during each maintenance cruise, once during the recovery of the instruments and once during their deployment 2-3 days later. CTD measurements to a depth of 1000 m were also carried out during each cruise. The calibration coefficients for the M3A sensors were calculated separately for each period between maintenance cruises, using the reference data collected during each re-deployment when all sensors had been cleaned. Possible sources of errors in the measurements, which could have affected the assimilation system, might be the vertical movements of the mooring caused by the increased water currents (110 m on 15 December 2000). However, the estimation of these errors (error covariance matrix) will be avoided in this preliminary study, making use of a simple property of the SEEK filter (see Sect. 4).

The assimilation scheme: the SEEK filter
The method used to assimilate real in situ data in the 1-D ecological model is the singular evolutive extended Kalman filter, called the SEEK filter, which is a sequential data assimilation scheme derived from the extended Kalman filter. This filter has been already implemented successfully in several ocean general circulation models (Pham et al., 1997;Verron et al., 1998;Brasseur et al., 1999;Hoteit et al., 2001Hoteit et al., , 2002. To present the algorithm of the SEEK filter, the notation proposed by Ide et al. (1997) has been adopted. Considering a physical system described by: where X t (t) denotes the vector representing the true state at time t, M(s, t) is an operator describing the system transition from time s to time t, and η(t) is the system noise vector. At each time t k , one observes where H k is the observational operator and ε k is the observational noise. The noises η(t k ) and ε k are assumed to be independent random vectors with mean zero and covariance matrices Q k and R k , respectively.
The SEEK filter proceeds in two stages in the same way as the EK filter, excluding the initialization stage. For the initialization the objective analysis was chosen, based on the first observation Y o 0 : and the initial analysis state vector was taken as: where H 0 is the gradient of H 0 evaluated at X(t 0 ) and X(t 0 ) is an a priori estimation of the model state at the time of the first available observation, L is a n × r matrix containing the r retained EOFs on its columns, The initial analysis error covariance matrix may be taken as: Note that for the initialization the first observation was used; thus, the algorithm actually starts with the next observation.
1. Forecast stage: At time t k−1 , the estimate of the system state is X a (t k−1 ) and its corresponding error covariance matrix P a (t k−1 ), in the factorized form is expressed as: where the matrix U k−1 are of dimension r × r. The model (1) is used to forecast the state as: The corresponding forecast error covariance matrix can then be approximated by: 2. Correction stage: The new observation Y k at time t k is used to correct the forecast according to: where G k is the gain matrix given by: with U k computed from: The corresponding filter error covariance matrix is then equal to: Since Eqs. (8) and (12) are not needed in the algorithm, the SEEK filter (with fixed directions of correction) drastically reduces the computational cost with respect to the EK filter. Basically, it requires only one integration of the numerical model to compute the forecast state.
To deal with the model and observation errors, the approach proposed by Pham et al. (1997) was adopted. Thus, a so-called compensation technique that replaces the contribution of the imperfect model by amplifying the already existing modes of the background error is used to parameterize the model error. The term (L T L) −1 L T Q k L(L T L) −1 expressing this error in Eq. (11) is then taken into account by means of a forgotten factor ρ. This equation is rewritten as: Concerning the observation errors, nitrate and oxygen are assumed to be independent Gaussian variables of mean zero with variances σ 2 N and σ 2 O , respectively. Thus, the observation errors covariance matrix is expressed by 1 To avoid the estimation of σ 2 N and σ 2 O , a simple property of the SEEK filter was used. Considering that the order of magnitude of oxygen is about 3-4 times the order of magnitude of nitrate and the nominal instrument error about 10%, it can be assumed that σ 2 O = 10σ 2 N . Thus, the matrix R k can be rewritten: It can be easily seen that onlyR is necessary for the algorithm of the filter, since U 0 is, in general, very large compared to σ 2 N , so that only U k /σ 2 N enters into the computation.

Model initialization -EOF analysis
Following Pham et al. (1997), the computation of the EOFs is made through a simulation of the model itself. Thus, the model has been spun up for 30 years initialized with field historic January biogeochemical data, relaxed to climatological temperature and salinity obtained from MODB (Brasseur et al., 1996) and forced with climatological atmospheric data obtained from ECMWF, with the aim of reaching a statistically quasi-steady state. Next, another integration of five years is carried out to generate a historical sequence H S of model realizations. A sequence of 600 state vectors was retained by storing one state vector every three days, to reduce the calculations, because successive states are quite similar.
Since the variables of the state vector of the model are not of the same nature, a multivariate EOF analysis was applied.
Each state variable has then been normalized by the inverse of the spatial average (over all the grid points) of its components variance. The "directions of correction" were then obtained via a multivariate EOF analysis on the sample H S . Figure 2 plots the number of EOFs and the percentage of variability (or inertia) contained in the sample H S they explain.

Assimilation results
In an attempt to explore the assimilation performance of the ecosystem model, two experiments were conducted. First, a twin experiment approach, a technique for validating data assimilation schemes, was used. In these experiments, instead of using real data, pseudo-data were simulated by the model. With this methodology, the performance of the assimilation scheme can be investigated on unobserved model variables, indicating how the assimilation results converge toward the "truth" and how sensitive the results are to the number of retained EOFs. In a second application, real in situ data were used and the filter results evaluated against independent chlorophyll-a data available at 40, 65 and 115 m.

Twin experiments
In the first series of experiments, the "truth" is assumed to be provided by the model itself. A reference experiment is then performed and the reference states are retained to be compared later with the fields produced by the filter. Following the five years of simulation for the generation of the H S sequence, the model was run for a further year to provide the initialization field (January data) for the reference  experiment. The assimilation experiments are performed using pseudo-measurements on a daily basis for nitrate in the 45 m layer and for oxygen in the 65 and 115 m layers (coinciding with the available in situ measurements), which are extracted from the reference states. In these experiments, the filter starts from the mean state of the sample H S .
The reference solution compared with (a) the assimilated (from the filter with 15 EOFs and a forgotten factor equal to 0.9) and (b) the simulated (from a model-free run without assimilation, starting from the initial state of the filter) fields (nitrate, phosphate, chl-a concentrations and bacteria biomass) at 70 m is shown in Fig. 3. In all cases the performance of the assimilation scheme was satisfactory, leading to a clear improvement in the model behavior. Also, it satisfactorily reproduces a chl-a maximum at approximately day 100 and a subsequent decrease in phosphate and nitrate. The model produces a distinct chl-a maximum during spring and autumn with rather significant concentrations within this period, while minimum values occur during winter. Bacteria follow the chl-a distribution with a short time lag, indicating a rather tight coupling as expected in such an oligotrophic system. Figure 4 shows the RMS (relative mean square) misfit between the reference and the assimilated state by the filter with four different values of retained EOFs: 5, 10, 15 and 20, which explain 87.9%, 97.1%, 98.9% and 98.9% of the inertia of the sample H S , respectively. In all cases, the forgotten factor was chosen equal to 0.9. The chosen rank of the initial error covariance matrix has a direct effect on the performance of the algorithm, and adding more than 10 EOFs does not really improve the assimilation performance. Such an observation was also made by Cane et al. (1996) and Verron et al. (1998).

Experiments with real observations
Finally, available daily real observations of nitrate at 45 m layer and of oxygen at 65 and 115 m layers over a period of 69 days (5 March 2000 to 12 May 2000) were assimilated in the 1-D ecosystem model. Chlorophyll data at 45, 65 and 115 m layers were used as independent observations to validate the filter. In this experiment, the filter has been initialized by the model state at the time of the first available observation and the forgotten factor has been taken equal to 0.9. Figure 5 plots the fields of Chlorophyll at 45, 65 and 115 m obtained from the data, estimated by the filter and simulated by the model (without assimilation). At 40 and 65 m assimilation results of the SEEK filter are very satisfactory, leading to a clear improvement in the model behavior. Moreover, the filter solution seems to converge toward the data at the end of the assimilation period. Deeper, at 115 m, both reference and assimilated results fail to reproduce the field data. This can also be seen from Fig. 6 which shows the relative misfit of the filter for chl-a at 40, 65 and 115 m. In all cases, the assimilation system improves the model analysis, especially at the top water column layers. However, the relative error of chl-a at 115 m, although decreasing slowly, may not be con- sidered as satisfactory compared to the other two. A weak variability is observed in the simulation results that could be attributed to a number of factors. The particular station is an offshore deep station and, therefore, it is not affected by the increased variability of nutrients in coastal areas. Due to significant depth (1050 m), nutrients released from the benthos can reach the euphotic zone only in rare extreme events. These extreme events are absent from the small simulation period in the particular assimilation experiments. In addition, the area is characterized by mesoscale spatial and temporal variability with complex hydrological structure (Theocharis et al., 1999;Triantafyllou et al., 2003a) and by the presence of water masses rich in nutrients which cannot be reproduced by the 1-D model. The importance of the hydrology and circulation variability in the ecosystem functioning of the Cretan Sea is presented and discussed in Petihakis et al. (2002).
In an attempt to improve the performance of the filter in the deep layers, smaller values of the forgotten factor have been tested to amplify the forecast error modes (assuming that the model error is large), in order to make the correction of the filter more efficient. This has led to relatively better assimilation results of chl-a at 115 m. However, the results were degraded in the upper layers, where the model error is small (Fig. 5), suggesting that the use of a "matrix of forgetting factors" would improve the performances of the SEEK filter. Chlorophyll is a derived variable and is calculated from the biomass of the phytoplanktonic groups multiplied by a constant C:Chl ratio at all depths. Chlorophyll and other light-harvesting pigments increase as irradiance available for growth decreases (Estrada, 1985;Estrada et al., 1993;Gasol et al., 1997). A fairly wide range of C:Chl values is suggested by researchers (Carlson and Ducklow, 1996;Gasol et al., 1997) for different environments which makes parameterization rather difficult, thereby affecting the behavior of the model in simulating the biological processes at the deep layers (Petihakis et al., 2002). Thus, the EOF's computed from the model simulation were inconsistent with the in situ observations from the deep layers, which eventually may degrade the performance of the filter.

Conclusions
The relative sparseness of biological and chemical data of the water column necessitates the use of data assimilation for realistic estimates of ecosystem parameters. Thus, an ecosystem forecasting system should include a tuned dynamical model, an observational network and a data assimilation scheme. An integrated system combining physical and ecological oceanic models, in situ data and an advanced dynamic assimilation scheme, was developed for the first time in the Cretan Sea.
The singular evolutive extended Kalman (SEEK) filter was applied to a 1-D marine ecosystem model of the Cretan Sea for the assimilation of real in situ data from the M3A buoy. In this filter, the concept of order reduction via empirical orthogonal functions (EOFs) analysis has been adopted to simplify the time integration of the forecast error. Two major numerical experiments were performed: first, twin experiments with simulated data and second, with real in situ data. In the simpler context of twin experiments, the effectiveness of the filter is quite clear with regard to all variables of the model. Moreover, it was found that by using such an approach, oxygen and nitrate could improve the performance of the model. Applications with real in situ data over the period 5 March 2000 to 12 May 2000 revealed a significant improvement of the model behavior in the upper part of the water column with respect to the reference simulation run of the derived chlorophyll variable. The improvement of the Chlorophyll at 115 m is less evident and several problems arise compared with the simplistic twin experiment approach: it is believed that part of this problem comes from the inconsistency between the data and EOFs, since the latter was computed from the model which was not very successful in simulating chlorophyll concentrations at the lower part of the euphotic zone.
In an attempt to enhance the performance of the assimilation system, a future study will focus on the improvement of the model behavior of the chlorophyll at the deeper layers by using a variable C:CHL ratio, as well as a more realistic (adaptive) model error covariance matrix. In this preliminary study, the "directions of correction" of the filter were kept fixed in order to save computational cost. Nevertheless, recent studies have shown that the evolution of the "directions of correction" and, therefore, the statistics of the estimation error according to the model dynamics, would improve the representativeness of these directions and consequently, the performance of the filter.
The results obtained so far are quite encouraging, suggesting the implementation of more realistic applications. Although the assimilation of biochemical observations with a three-dimensional complex marine ecosystem model will be considered in the near future, these preliminary experiments were necessary to establish a basic assimilation system before more demanding applications take place.