Estimating confidence intervals for average results from LES

flotus1 · February 2, 2017, 12:32

Say I performed a LES and extracted some transient value from it, like e.g. the data in this figure:
example.png

This might be an average mass flow rate through a plane or the force on a specific boundary, doesn't matter.
Now calculating the average value of this quantity is straightforward. But how high is the uncertainty range for this average value? Lets leave aside issues like initial transients or any uncertainties in the simulation modeling and focus on statistics. If I had the computational resources to perform an infinite number of time steps, the average value $\mu_{\infty}$ of this time series would have zero statistical error. But since in this case I only have 5000 time steps in total, my estimate for the mean value $\mu_{5000}$ is obviously not infinitely accurate.

Now what I want is to estimate a confidence interval (95%, 99%, whatever) to say that the infinite mean value $\mu_{\infty}$ lies within this distance from my estimated mean value $\mu_{5000}$ .
Or to put it differently: I want to give my simulation result as µ=0.234+-0.056 with 95% certainty.

Edit:
Let me put my question differently:
When performing the same simulation N times with slightly different initial conditions and measuring a time series after the initial transient: I get N different time series with different mean values. Assuming that my sampling time is long enough (>> the largest time scale in the flow) these mean values will be normally distributed.
Now what I want is the standard deviation of this normal distribution. Estimating this would be straightforward if I had all N simulations, but I can only afford one of them. There must be a clever trick to estimate the standard deviation from only one sample.

I am not quite sure which is the correct approach here.
From what I recalled from my "statistics for engineers" lecture I came up with the following approach:
1) Divide the time-series into sub-series of smaller length, e.g. 500 time steps each.
2) Calculate the mean values of these sub-series.
3) Omit every second sub-series to make sure the mean values are uncorrelated.
4) Calculate the standard deviation of the remaining sub-series mean values: $s_{sub}$ .
5) estimate the standard deviation of the time-series mean values as: $s_{5000}=\frac{s_{sub}}{\sqrt{n}}$ .
Here n is the amount of remaining sub-series.
6) Multiply by the appropriate value of Student's t-distribution to obtain the confidence interval
AFAIK, this procedure is based on the assumption that the mean values of the sub-series are uncorrelated (see 3)) and normally distributed. Both properties could be checked additionally.

Is this a valid approach or is there a better one?

FMDenaro · February 2, 2017, 12:54

Well, what you are asking for has nothing to do with LES...the same issue would be true for URANS as well as for DNS.

Usually, we do a statistical ensemble average using several fields in a certain period of time. For example, no less than 30 samples in a time T that must be evaluated from the characteristic turnover time. That makes statistically meaningful the statistics. Obviously, that does not mean that such statistically averaged field is also constant in time.

flotus1 · February 2, 2017, 13:03

I am aware that this issue is not unique to LES and I have some knowledge about how to post-process DNS, LES and URANS in general.
My question is specifically about the quantitative statistical uncertainty for the average flow properties obtained from this kind of simulation.

FMDenaro · February 2, 2017, 13:10

Quote:

Originally Posted by flotus1

I am aware that this issue is not unique to LES and I have some knowledge about how to post-process DNS, LES and URANS in general.
My question is specifically about the quantitative statistical uncertainty for the average flow properties obtained from this kind of simulation.

ok, I wrote that because the title of your post mentioned LES while it is a more general question... Concerning LES/DNS, we focus on spatial correlation and spectra. Usually, we perform spatial averaging along the homogeneous direction and the supplementary time (ensemble) averaging is performed to make the statistics more meaningful.
To tell the truth, I have no idea of published papers that show a quantitative analysis for the error between finite period and asymptotic (T->Infinity) averaging ...that should be more related to the signal analysis field

FMDenaro · February 2, 2017, 14:05

I remembered some comments reported in this report, sec. 3.1.3
http://torroja.dmt.upm.es/turbdata/a...ARD-AR-345.pdf

sbaffini · February 3, 2017, 03:58

Dear Alex,

I can't add any specific information to the matter. Just my 2 cents on how I do it myself in general, without actually considering any quantitative aspect.

Consider any turbulent case producing, eventually, a statistically steady state as yours. Now, you mention taking 1 subset of contiguous 500 samples every 2. However, those 500 samples will not be independent. Actually, if you advance in time with an accurate scheme, each sample in your grid will have a very strict correlation with the one at the previous time step.

What I do instead is picking just 1 value every n, where n is function of the flow and the selected time step. You can choose n by first reaching the steady state, then collecting some contiguoud samples as you did, and finally performing an autocorrelation (in your case at least) in time. That will give you the minimum n to achieve independence between the samples. Then just restart running, but now taking 1 every n samples, with n just determined.

For what concerns when to stop, once you have the previous procedure in place, you can also monitor the running average over the samples taken as described above (n below just counts the samples for the running average, has nothing to do with the n above):

x_avg_n = (n-1)/n * x_avg_n-1 + x_n/n

Thus, monitoring x_avg_n, you can see when it reaches your confidence interval, say within +- y% of a certain value. You will not have a quantitative measure of the certainty that the final avg value will be in that interval, but tipically such visual inspection is such that you don't need that anymore.

This, obviously, does not necessarily requires less samples.

In this case, you are however also considering any possible aspect related to an LES dependent correlation between contiguous samples. That is, for your LES, the time over which samples will decorrelate is a function of several modeling/numerical aspects and, in principle, that time is different from the DNS one on the same experiment. With this approach you are somehow taking into account such specificities (in contrast to just taking 500 contiguous samples, no matter what).

flotus1 · February 3, 2017, 04:17

Thanks for the link to the report, it is a good read.
Making use of a spatially homogeneous direction is not possible since there is none in the 3D geometries I am currently investigating.

Let me put my question differently:
When performing the same simulation N times with different initial conditions and measuring a time series after the initial transient: I get N different time series with different mean values. Assuming that my sampling time is long enough (>> the largest time scale in the flow) these mean values will be normally distributed.
Now what I want is the standard deviation of this normal distribution. Estimating this would be straightforward if I had all N simulations, but I can only afford one of them. There must be a clever trick to estimate the standard deviation from only one sample.

sbaffini · February 3, 2017, 05:50

Quote:

Originally Posted by flotus1

When performing the same simulation N times with different initial conditions and measuring a time series after the initial transient: I get N different time series with different mean values. Assuming that my sampling time is long enough (>> the largest time scale in the flow) these mean values will be normally distributed.

I'm just arguing (not an expert here) but, will they when the samples come from a single experiment?

I feel like this might be another LES related point. Turbulence statistics are clearly not gaussian. But lack of gaussianity, to the best of my knowledge, is mostly related to the smallest scales and the flow type. Imagine sampling near the wall of an LES simulated flow. Do you expect those samples to follow any gaussianity? Or any other PDF not dependent on the numerics/modeling?

That's why ensuring independence among all the single samples seems the minimum requirement to me (still, I repeat, not an expert here, just for the sake of discussion).

Consider also that the whole matter has also to do with the ergodic hypothesis, e.g. http://www3.imperial.ac.uk/portal/pl.../1/9607696.PDF

P.S. I understand your original question, you are just looking for a formula which, probably, is in any statistics textbook (still, I don't have any at the moment, otherwise I would have searched it for you). But I also want to open the discussion to LES related aspects which might be relevant.

flotus1 · February 3, 2017, 06:11

I completely agree with your point that turbulence statistics are not Gaussian.
But what I took from my statistics lecture is that summing a sufficient amount of samples from an arbitrary distribution function these sums will be Gaussian ->central limit theorem. And since calculating the mean involves the summation over all sampled values I expect the mean values to be Gaussian.

FMDenaro · February 3, 2017, 08:21

Just from a very practical point of view, considering your problem that has no homogeneity directions, I think you can use a single simulation that, after the numerical transient is ended and an energy equilibrium is reached, allows you to sample the fields. In other words, you use your LES simulation to obtain a RANS-like solution by performing an ensemble average of the fields that approximates the time averaging. You will sample until a steady averaged field is obtained. Obviously, no high order statistics can be obtained from such averaged field, only zero-th order statistics.
However, using such steady field, you can compute the fluctuations (in the sense of the LES residual to RANS solution) for each field simply by subtraction. Now, statistics at each time can be obtained from that. The time auto-correlation can be use to compute the separation time value, that gives an idea of how many periods you have that could mimik the series of experiments.

sbaffini · February 3, 2017, 08:24

Have you checked these pages?

https://en.wikipedia.org/wiki/Confidence_interval
https://en.wikipedia.org/wiki/Normal..._of_parameters

February 2, 2017, 12:32	Estimating confidence intervals for average results from LES	#1
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Say I performed a LES and extracted some transient value from it, like e.g. the data in this figure: example.png This might be an average mass flow rate through a plane or the force on a specific boundary, doesn't matter. Now calculating the average value of this quantity is straightforward. But how high is the uncertainty range for this average value? Lets leave aside issues like initial transients or any uncertainties in the simulation modeling and focus on statistics. If I had the computational resources to perform an infinite number of time steps, the average value $\mu_{\infty}$ of this time series would have zero statistical error. But since in this case I only have 5000 time steps in total, my estimate for the mean value $\mu_{5000}$ is obviously not infinitely accurate. Now what I want is to estimate a confidence interval (95%, 99%, whatever) to say that the infinite mean value $\mu_{\infty}$ lies within this distance from my estimated mean value $\mu_{5000}$ . Or to put it differently: I want to give my simulation result as µ=0.234+-0.056 with 95% certainty. Edit: Let me put my question differently: When performing the same simulation N times with slightly different initial conditions and measuring a time series after the initial transient: I get N different time series with different mean values. Assuming that my sampling time is long enough (>> the largest time scale in the flow) these mean values will be normally distributed. Now what I want is the standard deviation of this normal distribution. Estimating this would be straightforward if I had all N simulations, but I can only afford one of them. There must be a clever trick to estimate the standard deviation from only one sample. I am not quite sure which is the correct approach here. From what I recalled from my "statistics for engineers" lecture I came up with the following approach: 1) Divide the time-series into sub-series of smaller length, e.g. 500 time steps each. 2) Calculate the mean values of these sub-series. 3) Omit every second sub-series to make sure the mean values are uncorrelated. 4) Calculate the standard deviation of the remaining sub-series mean values: $s_{sub}$ . 5) estimate the standard deviation of the time-series mean values as: $s_{5000}=\frac{s_{sub}}{\sqrt{n}}$ . Here n is the amount of remaining sub-series. 6) Multiply by the appropriate value of Student's t-distribution to obtain the confidence interval AFAIK, this procedure is based on the assumption that the mean values of the sub-series are uncorrelated (see 3)) and normally distributed. Both properties could be checked additionally. Is this a valid approach or is there a better one? Last edited by flotus1; February 3, 2017 at 04:19. Reason: better title

February 2, 2017, 14:05		#5
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,768 Rep Power: 71	I remembered some comments reported in this report, sec. 3.1.3 http://torroja.dmt.upm.es/turbdata/a...ARD-AR-345.pdf flotus1 likes this.

February 3, 2017, 06:11		#9
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	I completely agree with your point that turbulence statistics are not Gaussian. But what I took from my statistics lecture is that summing a sufficient amount of samples from an arbitrary distribution function these sums will be Gaussian ->central limit theorem. And since calculating the mean involves the summation over all sampled values I expect the mean values to be Gaussian. sbaffini likes this. Last edited by flotus1; February 3, 2017 at 07:23.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem: Very long "write" time (~2h-3h) for results and transient results	Shawn_A	CFX	16	April 12, 2016 20:49
'sample' utility for 'U' yields different results for simple-scotch-etc.	HakikiCanakkaleli	OpenFOAM Post-Processing	3	January 5, 2014 12:08
Creating a tool to interpolate results	Luis Batista	OpenFOAM Running, Solving & CFD	2	April 11, 2013 08:15
Transient Run - Output "Time" in partial results?	evcelica	CFX	2	May 16, 2012 21:36
Different Results from Fluent 5.5 and Fluent 6.0	Rajeev Kumar Singh	FLUENT	6	December 19, 2010 11:33

February 2, 2017, 12:54		#2
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,768 Rep Power: 71	Well, what you are asking for has nothing to do with LES...the same issue would be true for URANS as well as for DNS. Usually, we do a statistical ensemble average using several fields in a certain period of time. For example, no less than 30 samples in a time T that must be evaluated from the characteristic turnover time. That makes statistically meaningful the statistics. Obviously, that does not mean that such statistically averaged field is also constant in time.

February 2, 2017, 13:03		#3
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	I am aware that this issue is not unique to LES and I have some knowledge about how to post-process DNS, LES and URANS in general. My question is specifically about the quantitative statistical uncertainty for the average flow properties obtained from this kind of simulation.

February 3, 2017, 03:58		#6
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	Dear Alex, I can't add any specific information to the matter. Just my 2 cents on how I do it myself in general, without actually considering any quantitative aspect. Consider any turbulent case producing, eventually, a statistically steady state as yours. Now, you mention taking 1 subset of contiguous 500 samples every 2. However, those 500 samples will not be independent. Actually, if you advance in time with an accurate scheme, each sample in your grid will have a very strict correlation with the one at the previous time step. What I do instead is picking just 1 value every n, where n is function of the flow and the selected time step. You can choose n by first reaching the steady state, then collecting some contiguoud samples as you did, and finally performing an autocorrelation (in your case at least) in time. That will give you the minimum n to achieve independence between the samples. Then just restart running, but now taking 1 every n samples, with n just determined. For what concerns when to stop, once you have the previous procedure in place, you can also monitor the running average over the samples taken as described above (n below just counts the samples for the running average, has nothing to do with the n above): x_avg_n = (n-1)/n * x_avg_n-1 + x_n/n Thus, monitoring x_avg_n, you can see when it reaches your confidence interval, say within +- y% of a certain value. You will not have a quantitative measure of the certainty that the final avg value will be in that interval, but tipically such visual inspection is such that you don't need that anymore. This, obviously, does not necessarily requires less samples. In this case, you are however also considering any possible aspect related to an LES dependent correlation between contiguous samples. That is, for your LES, the time over which samples will decorrelate is a function of several modeling/numerical aspects and, in principle, that time is different from the DNS one on the same experiment. With this approach you are somehow taking into account such specificities (in contrast to just taking 500 contiguous samples, no matter what).

February 3, 2017, 04:17		#7
flotus1 Super Moderator Alex Join Date: Jun 2012 Location: Germany Posts: 3,399 Rep Power: 46	Thanks for the link to the report, it is a good read. Making use of a spatially homogeneous direction is not possible since there is none in the 3D geometries I am currently investigating. Let me put my question differently: When performing the same simulation N times with different initial conditions and measuring a time series after the initial transient: I get N different time series with different mean values. Assuming that my sampling time is long enough (>> the largest time scale in the flow) these mean values will be normally distributed. Now what I want is the standard deviation of this normal distribution. Estimating this would be straightforward if I had all N simulations, but I can only afford one of them. There must be a clever trick to estimate the standard deviation from only one sample.

February 3, 2017, 08:21		#10
FMDenaro Senior Member Filippo Maria Denaro Join Date: Jul 2010 Posts: 6,768 Rep Power: 71	Just from a very practical point of view, considering your problem that has no homogeneity directions, I think you can use a single simulation that, after the numerical transient is ended and an energy equilibrium is reached, allows you to sample the fields. In other words, you use your LES simulation to obtain a RANS-like solution by performing an ensemble average of the fields that approximates the time averaging. You will sample until a steady averaged field is obtained. Obviously, no high order statistics can be obtained from such averaged field, only zero-th order statistics. However, using such steady field, you can compute the fluctuations (in the sense of the LES residual to RANS solution) for each field simply by subtraction. Now, statistics at each time can be obtained from that. The time auto-correlation can be use to compute the separation time value, that gives an idea of how many periods you have that could mimik the series of experiments.

February 3, 2017, 08:24		#11
sbaffini Senior Member Paolo Lampitella Join Date: Mar 2009 Location: Italy Posts: 2,151 Blog Entries: 29 Rep Power: 39	Have you checked these pages? https://en.wikipedia.org/wiki/Confidence_interval https://en.wikipedia.org/wiki/Normal..._of_parameters