Chaotic time series Part II: System identification and prediction

This paper is the second in a series of two, and describes the current state of the art in modelling and prediction of chaotic time series. Sampled data from deterministic non-linear systems may look stochastic when analysed with linear methods. However, the deterministic structure may be uncovered and non-linear models constructed that allow improved prediction. We give the background for such methods from a geometrical point of view, and briefly describe the following types of methods: global polynomials, local polynomials, multi layer perceptrons and semi-local methods including radial basis functions. Some illustrative examples from known chaotic systems are presented, emphasising the increase in prediction error with time. We compare some of the algorithms with respect to prediction accuracy and storage requirements, and list applications of these methods to real data from widely different areas.


Introduction
The rst paper in this series 30] discussed the phenomenon of chaotic behaviour, i.e. the fact that seemingly stochastic time series can be generated from low dimensional deterministic systems. Chaotic systems are characterised by features such as strange attractors and positive Lyapunov exponents, which, when estimated from real data, are used to identify chaos. From this starting point, the focus of the present paper is system identi cation and prediction; identi cation is also called the inverse problem in dynamical systems theory. Under ideal conditions, a chaotic and therefore randomly behaving system, may be identi ed using techniques from non-linear deterministic system identi cation. In some sense, a seemingly stochastic problem has a deterministic solution. However, in practice this is only partially true since the chaotic signal will often be corrupted by noise. But even with perfect reconstruction of the dynamical equations in the noise free case, only short term predictions are possible due to the extreme sensitivity of chaotic systems to uncertainties in initial conditions. This is because a chaotic system, while globally constrained to a nite region of state space, is locally unstable everywhere.
Prediction of chaotic time series is still a relatively new research topic, dating back to 1987. So far, we have identi ed more than 50 published articles in the eld, with a steady growth. There may well be work that we are not aware of, and we apologise for any such omissions. Another review article, shorter than this paper, is 75]. Non-linear system identi cation and prediction at large is a diverse eld with a plethora of potentially useful methods originating from di erent scienti c disciplines. A broad set of these methods has been applied to chaotic systems and a survey in this area, which is the topic of the present paper, provides a useful comparison of di erent techniques. Hopefully, this will contribute both to spur the application of some of these methods to other areas of non-linear identi cation and prediction as well as providing useful feedback to the study of chaotic systems.
Pb. 124, Blindern, N-0314 Oslo, Norway y Department of Informatics,University of Oslo, Pb. 1080 Blindern, N-0316 Oslo, Norway As noted in our rst paper, one should also realise that even if chaos per se is not of interest, some knowledge of this phenomenon is highly useful in many non-linear studies. This is because chaotic behaviour is a pervasive non-linear phenomenon, and many systems may therefore turn chaotic in parts of the parameter space.
To limit the length of this article, we have chosen to omit system identi cation in the elds of fractals. Work like the Collage theorem by Barnsley 5] can be viewed as identi cation of system dynamics in the complex plane with applications mainly to data compression, whereas we focus on more conventional time series prediction.
In addition, noise in chaotic systems is not covered in this article, mainly because little theory has yet evolved in this area. For some initial work, see 24] and 34].
In the background section below, the notation as well as some basic mathematical concepts are given together with simple examples. The fundamental ideas behind each major approximation method class are then treated in sections covering global polynomials, multi layer neural network perceptrons, local polynomials, and semi-local methods, including radial basis functions.
Here we give brief references to some applications. In the section 6 we discuss and compare the advantages of each method through tables and gures compiled from various sources.

Notation and mathematical background
For chaotic systems, delay coordinates are commonly used to generate state variables. We consider coordinates derived from scalar time series, but there are no principal di culties in applying the same theory to multivariate observations. A general treatment of the so called embedding technique is given in 30]. For simplicity, in this article we let the delay time denote the xed interval between observations (the sampling interval in the case of a continuous system), and consider the discrete scalar time series: x k = x(k ), where k is an integer. The delay state vector at time t = k is de ned as x k = x k ; x k 1 ; :: :; x k (m 1) ] T ; (1) where m is the embedding dimension and T denotes the transpose. Note that the rst element of the vector x k is the sample value at time k . The rst complete state vector can be constructed at time (m 1) after the rst sample arrives.
There are two equivalent ways of expressing the map from time k to (k + 1) , x k+1 = f(x k ) f : < m 7 ! < m ; (2) x k+1 = f(x k ) f : < m 7 ! <; (3) where the vector eld f is related to scalar eld f as f(x k ) = f(x k ); x k ;: ::; x k (m 2) ] T . For a chaotic system, one basically only knows that f is non-linear. However, we will assume that f is at least continuous and also continuously di erentiable if needed. It is useful to note that geometrically equation (3) de nes an m dimensional surface (manifold) in < m+1 . By embedding space we mean < m , and we denote the space < m+1 where the surface exsits as the graph space.
If the time series stem from a chaotic system in its asymptotic state on a strange attractor of dimension d, Takens theorem 68], and its extension 61], states that m 2d+1 components in the delay coordinate vector are su cient to reconstruct the attractor for almost all dynamical systems. This implies that the vectors x k all lie on the nite attractor in < m and that the observation pairs (x k ; x k 1 ), (x k 1 ; x k 2 ), :: :, (x 2 ;x 1 ) lie on the manifold generated by f in < m+1 . The identi cation problem amounts to constructing an approximation f to f given the observation pairs. This is a well known approach to system identi cation as outlined for instance in 36]. The problem of approximating a manifold in < m+1 given points on, or near to its surface in the case of noise, is a central problem in numerical approximation theory as well as in statistical non-linear regression. Through Takens theorem, identi cation of chaotic systems is put on a rm mathematical footing and shown to be a non-linear identi cation problem. In practice, the embedding dimension m may be estimated by di erent methods and values less than 2d + 1 may be feasible 30]. Choosing m too large is not a problem in principle, but will certainly lead to a higher computational burden than necessary, and probably a less accurate predictor.
As a simple example, assume the well known logistic map x k+1 = 4x k (1 x k ). A series of observations from this map looks like noise, and the autocorrelation of the data is as for white noise. Linear techniques will therefore be of no use in predicting such time series, but it is clear from the map itself that m = 1 will su ce to embed the attractor. In this case it is rather simple to build an approximation f to f from the observation pairs (x k ; x k 1 ), (x k 1 ;x k 2 ), : :: , (x 2 ; x 1 ) as illustrated in Fig. 1a). Here 20 pairs are plotted and the underlying shape of the one dimensional graph generated by f in < 2 is clearly seen. If m = 2 was chosen, the result would be points on a one dimensional curve in < 3 . As a second example, 200 observation pairs from the Henon map, which may expressed as x k+1 = 1 1:4x k 2 + 0:3x k 1 26], are shown for m = 2 in Fig. 1b). The domain of variation in embedding space is the attractor which is recognised in the gure as lying on the surface in < 3 generated by f(x; y) = 1 1:4x 2 + 0:3y.
To assess f, the normalised prediction error e over a set of samples with N elements is used: where is the root mean square prediction error given by 2 = 1 N P i (x i f(x i 1 )) 2 ; and x is the sample standard deviation, 2 x = 1 N P i (x i x) 2 ; where x denotes the average of the x values. If e 0, the prediction is almost perfect, whereas an e value equal to 1 is equivalent to using the average as the predictor.
We will mainly think of identi cation in batch mode, where the model is built to minimise the sum of the root mean square error over all samples in a training set. Good statistical practice dictates that the prediction error e of the estimated model should not be computed from samples used to construct f, but over a separate test set. When few observation pairs are available, the standard technique is cross validation 66].
On-line applications are feasible when the methods described here are used to continuously update f; think of this as an analog to a Kalman lter doing on-line model parameter estimation. A non-linear model estimated o -line can also be used in the predicting step of an on-line, extended Kalman lter.
Maps f approximating f in (3) are one-step predictors. If it is desirable to predict more steps ahead in spite of the escalating uncertainty, say r > 1 steps, one can repeat the onestep prediction f r-times. Alternatively one may estimate the r-th iterate x k+r = f (r) (x k ) = f(f( f(x k ))) directly. In 21] and 9] it is argued that iterated predictors outperform direct ones. Intuitively, when the prediction horizon r increases, the function f (r) gets very complex and hard to approximate, which is illustrated in Fig. 2 showing the logistic map together with its 2nd, 4th, and 6th iterates. Direct approximation of f over only a few time steps quickly becomes intractable because of the wild oscillations occurring.
In 1] and 2] it is shown that minimising the prediction error often leads to a f which does not reproduce dynamic invariants in the original data like Lyapunov exponents and the density of points on the attractor. They suggest that the performance criterion should also include a t to these invariants, and achieve this by not only predicting x k+1 as f(x k ), but as a linear combination of the L rst iterates f(x k ), f( f(x k 1 )), : ::, f( f( ( f(x k L )). Since high iterates defy approximation, L should be low. A system identi ed this way shows slightly larger prediction errors, but will reproduce the general behaviour of the underlying system better.
An important property of chaotic systems facilitates the approximation task. A su ciently long time series produces a sequence of vectors x k that is approximately dense on the attractor, meaning that no new x k will be far from those already observed. Thus, borrowing a term from the identi cation eld, a chaotic system is in some sense persistently excited . Further, a low dimensional attractor occupies only a fraction of the higher dimensional embedding space, which can reduce storage requirements in approximations signi cantly.

Global polynomials
An obvious approximation f to f (or f), is a polynomial in the m delay coordinate variables of degree p, set by the user. Polynomials can be written where w i are parameters and the basis functions i are powers and cross products of the components in x k . Small adjustments to the weights w i will cause the map to change almost everywhere, and therefore this method is classi ed as global.
Since the parameters enter linearly, they can be tted to the data using standard least squares techniques involving the normal equations in practice often done by singular value decomposition 54]. A potential advantage is that the parameters may be estimated recursively (and therefore in real time as measurements accumulate), using a Kalman lter like algorithm 63]. 23] showed an alternative, e cient way of estimating the parameters using orthonormal polynomials and assuming ergodicity. In that case the multivariable parameter estimation problem can be reduced to simple computations involving sums of powers of the variables in the delay coordinate vector.
The simplest rst order polynomial approximator is the well known Auto Regressive (AR) which geometrically amounts to tting an m dimensional hyperplane to the data in < m+1 , and is thus a global linear model. This is an unsuitable model for predicting chaotic time series, but some authors have used AR models as benchmarks with which to compare non-linear techniques. Among the work trying global polynomials for predicting chaotic time series, we mention 31], 21], 9], 10], 50], and 23]. We refere to section 6 for a discussion of the quality of such global polynomial approximators. A disadvantage using polynomials is that the number of independent parameters equals m+p m , which gets intractably large as p increases. Many independent parameters also increase the risks of over tting noisy time series and higher order polynomials may show strong oscillations between samples. In addition, some scalar elds that are not well approximated by polynomials and in such cases rational functions may be used. The reason is basically that rational functions may have poles. If the underlying function has poles, even in the complex extension, these poles may ruin real valued approximations by plain polynomials. Rational functions were tried in 10].

Multi layer perceptron neural networks
Another class of global methods which have been applied to chaotic time series is multi layer perceptron (MLP) neural networks. These have an elaborate structure with sigmoid shaped basis functions like for example (x) = tanh(x) or (x) = 1=(1 + e x ), and are probably the most commonly used neural networks. The building blocks in a neural network are the nodes , which is just one basis function with some preprocessing of the input, typically an inner product. As a simple example, an MLP net with two input variables, three hidden nodes and one output node de nes a function from < 2 to <, as illustrated in Fig. 4. Such a net can be written The w i and w j;i are real valued parameters, denoted weights in neural net terminology. A full net may have any number of layers and any number of nodes in each layer. In contrast to global polynomials, the weights in MLPs do not enter linearly, so iterative parameter estimation is required. Deriving the values of the weights in the net is in most cases done by back propagation, which is a steepest descent search 58]. As with polynomials, such MLP nets may approximate any smooth function f : < m 7 ! < to any degree of accuracy, given enough sigmoid functions with accompanying weights. A standard proof is found in 14].
In 31], MLP nets were applied for the rst time to predict chaotic time series, namely data obtained from the Mackey-Glass equation 40]. Among other work, we mention 19] who applied a standard MLP to the Lorenz attractor 37], and extended this work in 20] to cover data from a controlled uid dynamics experiment as well as estimates of sea surface temperatures. 7] predicted corporate bankruptcy, 39] did prediction on van der Pols oscillator and the thalamic neuron. We would also like to mention the work of 72], 73] and 74].
Typical disadvantages of standard MLP nets are the long parameter estimation time and potential local minima. To improve this, 76] tested a variation of the back propagation parameter estimation algorithm with a momentum term to speed up convergence, and simulated annealing 29] to avoid local minima. They analysed various data sets, including the logistic map and real data from two biological systems. A similar approach is found in 56]. Another successful approach to fast estimation algorithms, is 17], which devised a hierarchical way of structuring and estimating the weights in a sigmoid MLP. They tested the system behaviour on the logistic map and the R ssler system 57], and reported an improvement in parameter optimisation time of approximately three orders of magnitude.

Local methods
One of the main disadvantages of global methods is that a new sample pair (x k ;x k 1 ) may change f everywhere. Local interpolation overcomes this drawback by utilising only a limited number, say s, of neighbouring samples. There are two major classes of local methods, those applying neighbour samples directly in the prediction, and those tting a function locally to the neighbours basing the prediction on the estimated function.
The simplest way to predict x k+1 from neighbour samples, is to identify the nearest neighbour to x k in the embedding space < m . We denote the nearest neighbour to x k by x k (1) , and the next sample x k(1)+1 = f(x k(1) ) is then known from the time series, and can be used as the predictor. This was suggested by 38], and is equivalent to building a look-up table of previous state mappings. In terms of the original time series, one nds the segment of length m that is most similar to x k ; x k 1 ; :: :; x k (m 1) and then uses the sample following that segment to predict x k+1 , in other words f(x k ) = x k(1)+1 ). This is also termed the analog method . In 28], the method is used on a number of simulated data sets to distinguish chaos from coloured noise.
An improvement is to take the s nearest neighbours and use the average of their state mappings as the predictor. Another variant was suggested in 67]. They selected s = m +1 (not necessarily closest) neighbours to form the smallest m-dimensional simplex circumscribing x k in < m . f(x k ) is then computed as a weighted sum of the mapped simplex corners. Besides synthetic data, they experimented with time series from measles, chicken pox and diatom populations (plankton). In 13] and 12], the method of Sugihara and May was used to predict vertical ground movements of an active caldera in Italy. In 35], an alternative formulation of the same algorithm was tested on driven semiconductors and the Lorenz attractor. Yet another variation is found in 44] which applies Voronoi tessellation methods from computational geometry 53] to build linear patches (tiles) in the m + 1 dimensional graph space.
A common mathematical formulation for all these methods is where x k(i) denotes the i-th closest vector to x k in < m , s is the number of neighbours, and jj jj denotes a norm. Usually is a weight function increasing from zero to one when x k approaches x k(i) . Note that there is no parameter estimation involved here, is a xed function. Thus, this method is e cient in terms of computation time. However, the approximating maps are generally not continuously di erentiable, and the search for neighbours become more time consuming as more vectors are stored. The other class of methods ts a surface in graph space < m+1 , as described in section 2, to the measurement points (x k(i)+1 ;x k(i) ), i = 1; 2: :: ;s. This may be a plane, but polynomials of higher degrees may also be used to interpolate between neighbours. Taking s > m+1 and tting a plane, one obtains a local AR model, also called a local linear model. For chaotic time series, this was, as far as we know, rst done in 21]. They experimented with such local AR models, as well as with higher order polynomials, but did not observe signi cant improvements moving to higher order. For comparison, they also applied a global AR model as a standard forecasting technique . 10] continued to explore the relation between global and local AR models. Other work applying local methods is 9] RBF-approximation can be thought of as a combination of the tting and weighting approaches described in the previous section on local methods weights are assigned according to the distance from the basis function centres, but these weights adjust parameters, not the next sample value as in (8). Applying s basis functions, such approximators take the form f(x k ) = s X i=1 w i i (jjx k i jj); (9) where the function i (r) is radially symmetric around a centre value i in < m , and w i are weights to be chosen. If w i are the only parameters to estimate, the normal equations can be utilised. If, however, there are parameters inside i which enter the problem nonlinearly, time consuming iterative optimisation methods must be applied. As an intuitive visualisation of how radial basis functions work, consider Fig. 4 and imagine the smooth step functions replaced by, for example, Gaussian hats, (and a linear combination at the last level). There are two main areas for modifying the basic scheme: experiments with various types of basis functions i (section 5.1), and experiments with various algorithms for determining the parameters, especially the centres and the number s of basis functions (section 5.2).

Choosing the basis function type
Various results prove that di erent types of RBF functions i are universal approximators, that is, any smooth function can be approximated to any degree of accuracy given enough basis functions. This, in turns, requires an in nite amount of noise free measurement data. We refer to 49] for a neural network approach, 62] for a statistical approach and 45] for a result from approximation theory.
The most popular type is rotation symmetric Gaussian hats of the form i (r) = exp( r 2 =2 2 i ), where the hat widths i are xed constants. Even though Gaussian hats are global in theory, they decrease fast enough to have nite support for all practical purposes. The basis function can also be non-local like the multi quadric i (r) = (r 2 + 2 ) 1=2 , and r real. For a more extensive list of basis functions, see 8], and cf. 22] for a number of related methods viewed from numerical mathematics.
In higher dimensions, the locality advantage of Gaussian hats turns into a disadvantage. The basis functions are local, and since data is almost always scarce in higher dimension, most points in state space has no basis function cover. Another source of di culty is that some input variables x k j may be almost uncorrelated with the output variable x k , especially if the embedding dimension m is estimated too large. There are two typical ways to improve this situation, either by letting the hats smear out in some directions and become Gaussian ridges , or by normalising the basis functions. The normalised Gaussian hats going into (9) are written i (jjx k i jj) = exp( jjx k i jj 2 =2 2 i ) P s j=1 exp( jjx k j jj 2 =2 2 j ) ; (10) and they are named weight constant predictors (WCP) in 65] and 64]. They also suggested a method called weighted linear predictors (WLP) where the simple weights w i in (9) are replaced by the linear term v i + (x k i ) d i where v i is a scalar parameter and d i is a parameter vector.
In 42] and 43], the same method was tested on the Mackey-Glass equation, and in 1] and 2], a slightly modi ed weighting function was used. The WLP method was applied in 50] with e Ki as the weight function, and neighbour samples taken from identi ed unstable periodic orbits (UPO) close to x k . Generic strange attractors can be approximated by unstable periodic orbits of a given length, and methods to identify such orbits are given in the same article. K i was the sum of the positive Lyapunov exponents of each neighbour UPO i .
In 25], ridge functions were applied to the Mackey-Glass equation. An alternative view was proposed by 15], observing that a multi-dimensional Gaussian is equal to a product of scalar Gaussian with adjustable centres and widths. During learning, hat parameters are adjusted when a suitable Gaussian covers the input, otherwise a new hat is generated. In the product they use as few terms as possible, and new variables are introduced only when required, thus converting ridges into hats one variable at the time. In this way, the system identi cation method itself decides which previous variables that are su ciently correlated to warrant inclusion into the model, estimation of the best embedding dimension m becomes an integral part of the model building algorithm. This algorithm is named CC-RAN . Another related tree-building algorithm was devised in 60] and 59].

The number of basis functions and their centres
In the most basic RBF method, there is one basis function at each sample so that i = x i . To reduce the computational burden, only the s nearest samples are often taken into account. Now, i = x k (i) and the number of terms in (9) is reduced. This approach was taken in 9], using non-local basis functions r 3 and the s = 50 nearest samples. The standard RBF method interpolates the data, and is thus sensitive to noise. To reduce this problem, Broomhead and Lowe 6] put a xed number of basis functions on a regular grid. By letting the number of basis functions be less than the number of samples, the data was smoothed. They applied this method to predict the logistic map.
In higher dimensions, regular grids become infeasible because the number of basis functions grows exponentially with the dimension of the grid. In addition a chaotic attractor occupies only a small subset of the entire space, so most of the basis functions are super uous. The solution is to represent only those parts of the embedding space < m where data exists, that is the manifold occupied by the attractor of the system. Thus, the memory requirements can be made proportional to the attractor's size, which also improves model estimation time and noise robustness.
One such algorithm, here denoted hashing RBF , is described in 46]. Basis functions are maintained on a regular grid with spacing , but are only constructed at those grid points where data exists. A grid coordinate is related to the corresponding function parameters through a hash table, and once the neighbours are found, we are back to a summation model f (x k ) on the form (9). This hash table scheme was originally invented by Albus 4] and 3] for real time motion control in robotics. This is possible since a hash table makes the look-up extremely fast. Without detailed apriori information of the distribution of data in the input space, it is di cult to choose an optimal lattice spacing a coarse grid will smooth data well, whereas a ne grid will capture details. A hierarchy of hash models spaced on grids with increasing resolution , written f(x k ) = P f (x k ), will thus represent the major function structure on the coarsest scale grids, and add model details at ner grid resolutions.
Another idea for reducing the number of basis functions, is to cluster neighbouring sample vectors and represent them all with one basis function at the cluster centre i . When all training samples are collected, the clusters may be formed with for example the k-means clustering algorithm 41]. In 47] and 48], they applied a variation of this k-means clustering algorithm to centre the clusters, used the average distance the the neighbouring hats as hat width, and estimated the weights w i .
The cluster centres and widths may also be built as an integral part of the parameter estimation, and one of the rst descriptions of such algorithms was 32]. These algorithms add a new basis function only when no existing function covers the new sample, or if the existing functions cannot easily be changed to approximate the new sample. As more samples arrive, the hat widths are decreased gradually, leading to increased estimation accuracy. Usually such methods call for iterative estimation algorithms. In 51] a version of this algorithm was implemented, named RAN as an acronyme for resource allocating nets.
The automatic addition of basis functions during model identi cation represents one solution to the problem of so called model structure selection or model realisation. In statistics, there is an emerging theory of how to select the number of basis function, with cross validation and bootstrap methods as some of the themes 18]. Another challenge lies in selecting the form of the basis functions themselves, as well as identifying which variables that should go into the model. This is di erent from estimating the parameters in a xed structure, which is usually what identi cation amounts to. Realisation theory as described in 11] is well developed for deterministic linear systems, but a fundamental theory lacks for non-linear systems (cf. 16]).

Discussions and comparisons
At this stage, a set of rules recommending certain methods for certain classes of problems, would be desirable. Unfortunately, such general advice is yet unavailable there are too little experience gained from actual use. Instead, we have settled for the more modest goal of comparing experimental results given in the literature by compiling tables and gures. The gures and tables will hopefully give a feeling for the performance one can expect from the various methods.
The quality of the constructed approximators depends on many factors including: the underlying system, the number of samples available, the embedding dimension, the noise in process and measurements, the kind of approximator, the number of parameters in the approximator, the amount of computer and human resources invested in constructing the approximator, and the prediction interval. Based on results reported in literature, we have picked interesting experimental results where only one, or just a few, of these factors vary, the remaining factors remaining relatively xed. The following tables and gures have been collated: Table (1) hints at how well di erent methods approximate the Mackey-Glass delay di erential equation. Table (2) compares the approximation error for local AR models (local linear polynomials) with global AR models for di erent systems. Figure (3) shows how fast the prediction error increases with time for di erent methods and di erent test sets.
Figure (5) shows how prediction accuracy is connected with the number of parameters and the number of samples for some methods applied to the Mackey-Glass equation. Even though we have attempted to extract experimental results which are as comparable as possible, comparisons like these can never be completely fair, and only show the applicability of each method to the type of data chosen.
In all tables and gures, prediction error means normalised root mean square prediction errors as given in (4).
To give an impression of how well the various methods approximate a xed system, we have in Table 1 collected prediction errors from experiments on the standard noise free Mackey-Glass equation with delay parameter = 17, with 500 training sets, embedding dimension m = 4, and sampling interval = 6 time units. The di erence Mackey-Glass eqution is written x k+1 = 0:2x k 1 + x k ] 10 0:1x k : Most of the results are reported by the originators of the methods, presumably assuring maximum performance. Di erent number of samples were used in the test sets, typically either 500 or 1000. The number of model parameters di ered between the tests. This is reasonable since each method should be allowed to apply an optimal number of parameters. Unfortunately, some reported prediction error 6 time units ahead, others 84, so the table reports results for both prediction intervals. In spite of this, the table should give an indication of how well the di erent methods approximate this Mackey-Glass system.
As can be seen from Table 1, no single method excels for one-step prediction 6 time units ahead, but global rationals and the weighted constant map (WCP) give the worst predictions. For 14 iterated one-step predictions (84 time units), multi layer perceptrons, local AR models, hashing RBF, weighted linear predictors (WLP), and CC-RAN all give prediction errors of approximately the same size. The only bad approximators here where the method of analogy and standard RBF.
Note that this table indicates that the prediction error increases with increasing prediction interval, which is quite natural. To illustrate this point further, we have in Fig. 3 collected prediction error as a function of prediction interval for three di erent systems and di erent system identi cation methods. The three systems are the Mackey-Glass equation, the Rayleigh-B nard convection, and the R ssler equation. Again, the experimental conditions are similar enough to compare the results.
In In the third part, we reproduce results for predicting the R ssler equation. The embedding dimension was 4, the sampling interval 0:87 time units, and 100 samples were used to train the hierarchical MLP net. The curve A) is from a global linear model, B) a local linear model, C) from global universal periodic orbits (UPO), D) from local UPO, and E) a global MLP model. Curves A) to D) are taken from 50], and E) from 17]. For this system, both global AR, but also local AR models give predictions which soon become totally unreliable, the universal periodic orbits give better predictions, but the MLP net is the most accurate approximator in this case.
Instead of varying the approximation method keeping most of the other factors constant, it is interesting to apply a small number of methods to a variety of processes, both simulated, laboratory experiments and real life data sets. Such a table can be compiled from 10]. In his article, local AR models are computed for all degrees of locality varying from the method of analogy to a global AR model. We have in Table 2 chosen the optimal local AR model and compared it with the global AR model estimated. The results can brie y be summarised as follows: With much noise and/or high dimensional systems, global AR models seem to outperform the local AR models; otherwise the local linear models are more accurate.
On noise free data, most of the approximation methods are, at least in principle, able to approximate any reasonably well behaved system to any degree of accuracy, given enough data. The approximation error will therefore depend on the number of samples used in the approximation process. For most methods, there is a strong relation between the number of samples used in training, and the number of parameters/weights in the model. We have therefore collated two graphs showing these relations in Fig. 5. All these curves are based on experiments with the standard Mackey-Glass equation as described above, with delay parameter = 17. The rst graph shows prediction error as a function of the number of training samples, the second the prediction error as a function of the number of parameters. In both graphs, A) is the CC-RAN method 15], B) is adaptive clustering RBF 51], C) is hashing RBF 46], D) is K-means RBF 48] and E) is Standard RBF 47]. As can be seen, there are no major di erences between most of the methods, except K-means RBF which requires far more training sets. Concerning the number of parameters against prediction error, it is clear that standard and K-means RBF requires lot of parameters compared to the other methods.
Since much of the discussions in this section has circled around deterministic, simulated systems, we conclude this section with Table 3 which summarises some of the real data sets analysed. We have not been able nd a reasonable way of comparing these data sets, so the reader is referred to the original work.

Conclusion
Prediction of chaotic time series is a fairly new research topic, dating back to 1987. The underlying philosophy is geometrical, tting non-linear functions to samples in embedding space < m+1 . The area of chaotic prediction is still relatively small within the larger domain of non-linear system identi cation and prediction, which o ers a multitude of approaches still not tested on chaotic systems. Within non-linear studies at large, some knowledge of chaos is desirable since such behaviour is a pervasive non-linear phenomenon that may be manifest in various models in certain regions of the parameter space.
The fact that chaotic systems are persistently excited in the sense that accumulating points become dense on the attractor, is an advantage when modelling chaotic series. In particular, this could hold promise both for the local methods and the adaptive semi-local methods where the data determine the location and shape of the basis functions.
The paper has focused on methods that up to now have been applied to chaotic time series, including global polynomials, local polynomials, multi layer perceptrons and semi-local methods. Chaotic time series frequently resemble white noise having broadband Fourier spectra, and the non-linear predictors clearly outperform the standard linear methods like global AR models if the noise level is limited and the dimension of the attractor is low. The local and semi-local methods generally seem to be the best, but no non-linear method ranks top in all situations.
Most of the schemes can approximate any well behaved function to any desired accuracy level, provided enough samples and basis functions/parameters are available. The crucial question is therefore how many samples and parameters that are required to achieve a certain accuracy. From this point of view, the adaptive semi-local methods generally seem to be the best.
A crucial aspect is the robustness of the approximation schemes to noise. With noisy data we have, in the language of numerical mathematicians, a tting problem and not an approximation problem. If the model is too small , it will under t the data and ignore important character-istics. If, on the other hand, the model is too large , it will over t the data, reproducing the noise as well as the underlying behaviour. It is impossible to use the number of parameters to measure the risk of over tting, since in most methods the parameters are internally correlated. Currently, cross validation is the standard tool for selection of an appropriate model avoiding over tting. However, the recent appearance of non-linear ltering schemes for time series allow for pretreatment of the data before constructing the model. This is certainly an interesting future line of development, since up to now, too little work has been done on noisy chaotic time series.       x (1) x (2) x (1) x (2) x(1)