Pairs of Random Variables

Scott L. Miller , Donald Childers , in Probability and Random Processes (Second Edition), 2012

5.6 Independent Random Variables

The concept of independent events was introduced in Chapter 2. In this section, we extend this concept to the realm of random variables. To make that extension, consider the events A = {Xx} and B = {Yy} related to the random variables X and Y. The two events A and B are statistically independent if Pr(A, B) = Pr(A)Pr(B). Restated in terms of the random variables, this condition becomes

(5.37) Pr ( X x , Y y ) = Pr ( X x ) Pr ( Y y ) F X , Y ( x , y ) = F X ( x ) F Y ( y ) .

Hence, two random variables are statistically independent if their joint CDF factors into a product of the marginal CDFs. Differentiating both sides of this equation with respect to both x and y reveals that the same statement applies to the PDF as well. That is, for statistically independent random variables, the joint PDF factors into a product of the marginal PDFs:

(5.38) f X , Y ( x , y ) = f X ( x ) f Y ( y ) .

It is not difficult to show that the same statement applies to PMFs as well. The preceding condition can also be restated in terms of conditional PDFs. Dividing both sides of Equation (5.38) by fX (x) results in

(5.39) f X | Y ( x | y ) = f Y ( y ) .

A similar result involving the conditional PDF of X given Y could have been obtained by dividing both sides by the PDF of Y. In other words, if X and Y are independent, knowing the value of the random variable X should not change the distribution of Y and vice versa.

Example 5.13

Returning once again to the joint PDF of Example 5.10, we saw in that example that the marginal PDF of X is

f X ( x ) = 1 2 π exp ( - x 2 2 ) ,

while the conditional PDF of X given Y is

f X | Y ( x | y ) = 2 3 π exp ( - 2 3 ( x - y 2 ) 2 ) .

Clearly, these two random variables are not independent.

Example 5.14

Suppose the random variables X and Y are uniformly distributed on the square defined by 0 ≤ x, y ≤ 1. That is

f X , Y ( x , y ) = { 1 , 0 x , y 1 , 0 , otherwise .

The marginal PDFs of X and Y work out to be

f X ( x ) = { 1 , 0 x , 1 , 0 , otherwise , f Y ( y ) = { 1 , 0 y , 1 , 0 , otherwise .

These random variables are statistically independent since fX , Y (x, y) = fX (x)fY (y).

Theorem 5.5: Let X and Y be two independent random variables and consider forming two new random variables U = g 1 (X) and V = g 2(Y). These new random variables U and V are also independent.

Proof: To show that U and V are independent, consider the events A = {U≤u} and B = {Vv}. Next define the region R u to be the set of all points x such that g 1 (x) ≤ u. Similarly, define R to be the set of all points y such that g 2(y)v. Then

Pr ( U u , V v ) = Pr ( X R u , Y R v ) = R v R u f X , Y ( x , y ) d x d y .

Since X and Y are independent, their joint PDF can be factored into a product of marginal PDFs resulting in

Pr ( U u , V v ) = R u f X ( x ) d x R v f Y ( y ) d y = Pr ( X R u ) Pr ( Y R v ) . = Pr ( U u ) Pr ( V v )

Since we have shown that F U, V (u, v) = FU (u)FV(v), the random variables U and V must be independent.

Another important result deals with the correlation, covariance, and correlation coefficients of independent random variables.

Theorem 5.6: If X and Y are independent random variables, then E[XY] = μ XμY , Cov(X, Y) = 0, and ρ X ,Y = 0.

Proof: E [ X Y ] = x y f X , Y ( x , y ) d x d y = x f X ( x ) d x y f Y ( y ) d y = μ X μ Y .

The conditions involving covariance and correlation coefficient follow directly from this result.

Therefore, independent random variables are necessarily uncorrelated, but the converse is not always true. Uncorrelated random variables do not have to be independent as demonstrated by the next example.

Example 5.15

Consider a pair of random variables X and Y that are uniformly distributed over the unit circle so that

f X , Y ( x , y ) = { 1 π , x 2 + y 2 1 , 0 , otherwise .

The marginal PDF of X can be found as follows:

f X ( x ) = - f X , Y ( x , y ) d y = - 1 - x 2 1 - x 2 1 π d y = 2 π 1 - x 2 , - 1 x 1.

By symmetry, the marginal PDF of Y must take on the same functional form. Hence, the product of the marginal PDFs is

f X ( x ) f Y ( y ) = 4 π 2 ( 1 - x 2 ) ( 1 - y 2 ) , - 1 x , y 1.

Clearly, this is not equal to the joint PDF, and therefore, the two random variables are dependent. This conclusion could have been determined in a simpler manner. Note that if we are told that X = 1, then necessarily Y = 0, whereas if we know that X = 0, then Y can range anywhere from –1 to 1. Therefore, conditioning on different values of X leads to different distributions for Y.

Next, the correlation between X and Y is calculated.

E [ X Y ] = x 2 + y 2 1 x y π d x d y = 1 π - 1 1 x [ - 1 - x 2 1 - x 2 y d y ] dx .

Since the inner integrand is an odd function (of y) and the limits of integration are symmetric about zero, the integral is zero. Hence, E[XY] = 0. Note from the marginal PDFs just found that both X and Y are zero-mean. So, it is seen for this example that while the two random variables are uncorrelated, they are not independent.

Example 5.16

Suppose we wish to use MATLAB to generate samples of a pair of random variables (X, Y) that are uniformly distributed over the unit circle. That is, the joint PDF is

f X , Y ( x , y ) = { 1 π , x 2 + y 2 < 1 , 0 , otherwise .

If we generated two random variables independently according to the MATLAB code: X=rand(1); Y=rand(1); this would produce a pair of random variables uniformly distributed over the square 0 < x < 1, 0 < y < 1. One way to achieve the desired result is to generate random variables uniformly over some region which includes the unit circle and then only keep those pairs of samples which fall inside the unit circle. In this case, it is straightforward to generate random variables which are uniformly distributed over the square, –1 < x < 1,–1 < y < 1, which circumscribes the unit circle. Then we keep only those samples drawn from within this square that also fall within the unit circle. The code that follows illustrates this technique. We also show how to generate a three-dimensional plot of an estimate of the joint PDF from the random data generated. To get a decent estimate of the joint PDF, we need to generate a rather large number of samples (we found that 100,000 worked pretty well). This requires that we create and perform several operations on some very large vectors. Doing so tends to make the program run slowly. In order to speed up the operation of the program, we choose to create shorter vectors of random variables (1000 in this case) and then repeat the procedure several times (100 in this case). Although this makes the code a little longer and probably a little harder to follow, by avoiding the creation of very long vectors, it substantially speeds up the program. The results of this program are shown in Figure 5.3.

Figure 5.3. Estimate of the joint PDF of a pair of random variables uniformly distributed over the unit circle from the data generated in Example 5.16.

(For color version of this figure, the reader is referred to the web version of this chapter.)

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123869814500084

Advances in Geophysics

Wojciech De¸bski , in Advances in Geophysics, 2010

4.8.5 PDF in the D space

Although solving inverse problems we are most commonly interested in the m parameters, occasionally we may need also to explore the marginal PDF in the D space. The general formula for this marginal PDF is given in Eq. (67). Let us consider now two special cases of σ pos d (d).

First let us consider the case when no theoretical information is available:

(114) σ th ( m,d ) = μ ( m d )

so the integral in Eq. (67) evaluates to a constant and σ pos d (d) reads

(115) σ pos d ( d ) = σ exp d ( d ; d obs ) σ a p r d ( d ; d apr ) μ d ( d ) .

The a posteriori PDF is given by product of the observational and a priori PDFs as expected.

Next, let us assume that there is no observational data so

(116) σ e x p d ( d ; d obs ) = μ d ( d ) .

In this case we arrive at the following formula for σ pos d (d):

(117) σ p o s d ( d ) = σ a p r d ( d ; d apr ) M σ a p r m ( m ) σ th ( m,d ) μ ( m d ) d m .

This formula is the counterpart of Eq. (63) and shows that in the case of missing observational data the a posteriori PDF in the data space is the product of the a priori PDF in D space and the function which describes how the a priori information about m is projected into the D space.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0065268710520016

Multiple Random Variables

Scott L. Miller , Donald Childers , in Probability and Random Processes (Second Edition), 2012

Section 6.1 Joint and Conditional PMFs, CDFs, and PDFs

6.1

Suppose we flip a coin three times, thereby forming a sequence of heads and tails. Form a random vector by mapping each outcome in the sequence to 0 if a head occurs or to 1 if a tail occurs.

(a)

How many realizations of the vector may be generated? List them.

(b)

Are the realizations independent of one another?

6.2

Let X = [Xl , X 2, X 3]T represent a three-dimensional vector of random variables that is uniformly distributed over a cubical region

f X ( x ) = { c , | x 1 | 1 , | x 2 | 1 , | x 3 | 1 , 0 , otherwise .

(a)

Find the constant c.

(b)

Find the marginal PDF for a subset of two of the three random variables. For example, find f X 1 , X 2 (x 1, x 2).

(c)

Find the marginal PDF for one of the three random variables. That is, find f X 1 (x 1).

(d)

Find the conditional PDFs f X 1 | X 2 , X 3 ( x 1 | x 2 , x 3 ) and f X 1 , X 2 | X 3 ( x 1 , x 2 | x 3 ) .

(e)

Are the Xi independent?

6.3

Suppose a point in two-dimensional Cartesian space, (X, Y), is equally likely to fall anywhere on the semicircle defined by X 2 + Y 2 = 1 and Y ≥ 0. Find the PDF of Y, fY (y).

6.4

Suppose a point in three-dimensional Cartesian space, (X, Y, Z), is equally likely to fall anywhere on the surface of the hemisphere defined by X 2 + Y 2 + Z 2 = 1 and Z ≥ 0.

(a)

Find the PDF of Z, fz (z).

(b)

Find the joint PDF of X and Y, f X,Y (x, y).

6.5

Suppose N 1 is a discrete random variable equally likely to take on any integer in the set {1, 2, 3}. Given that N 1 = n 1, the random variable N 2 is equally likely to take on any integer in the set {1, 2, …, n 1}. Finally, given that N 2 = n 2, the random variable N 3 is equally likely to take on any integer in the set {1, 2, …, n 2}.

(a)

Find the two-dimensional joint PMF, P N 1,N 2 (n 1, n 2).

(b)

Find the three-dimensional joint PDF, P N 1,N 2,N 3 (n 1 n 2, n 3).

(c)

Find the marginal PDFs, P N 2 (n 2) and P N3 (n 3).

(d)

What are the chances that none of the three random variables are equal to 1?

6.6

Let X = [X 1, X 2, X 3]T represent a three-dimensional vector of random variables that is uniformly distributed over the unit sphere. That is,

f X ( x ) = { c , x 1 , 0 , x > 1.

(a)

Find the constant c.

(b)

Find the marginal PDF for a subset of two of the three random variables. For example, find f X 1,X 2 (x 1, x 2).

(c)

Find the marginal PDF for one of the three random variables. That is, find f X 1 (x 1).

(d)

Find the conditional PDFs f X1|X 2,X 3 (x 1|x 2,x 3) and f X1,X 2|X 3 (x 1,x 2|x 3)

Extra: Can you extend this problem to N-dimensions?

6.7

Let X = [X 1, X 2, …, XN ]T represent an N-dimensional vector of random variables that is uniformly distributed over the region x 1+ x 2 + … + XN ≤ 1, x i ≥ 0, i = 1, 2, …, N. That is

f X ( x ) = { c , i = 1 N x i 1 , x i 0 , 0 , otherwise .

(a)

Find the constant c.

(b)

Find the marginal PDF for a subset of M of the N random variables.

(c)

Are the Xi independent? Are the Xi identically distributed?

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123869814500096

Additional topics in probability

Kandethody M. Ramachandran , Chris P. Tsokos , in Mathematical Statistics with Applications in R (Third Edition), 2021

3.6 Chapter summary

In this chapter we looked at some special probability distribution functions that arise in practice. It should be noted that we discussed only a few of the important probability distributions. There are many other discrete and continuous distributions that will be useful and appropriate in particular applications. Some of them are given in Appendix A3. A larger list of probability distributions can be found at http://www.causascientia.org/math_stat/Dists/Compendium.pdf, among many other places. For more than one random variable, we learned the behavior of joint probability distributions. We also saw how to find the probability (mass) density function and cumulative distribution for the functions of a random variable. Limit theorems are a crucial part of probability theory. We have introduced the Chebyshev inequality, the law of large numbers, and the CLT for the random variables.

We now list some of the key definitions introduced in this chapter:

Bernoulli probability distribution

Binomial experiment

Poisson probability distribution

Probability distribution

Normal (or Gaussian) probability distribution

Standard normal random variable

Gamma probability distribution

Exponential probability distribution

Chi-square (χ2) distribution

Joint pdf

Bivariate probability distributions

Marginal pdf

Conditional probability distribution

Independence of two random variables

Expected value of a function of bivariate random variables

Conditional expectation

Covariance

Correlation coefficient

In this chapter, we have also learned the following important concepts and procedures:

Mean, variance, and mgf of a binomial random variable

Mean, variance, and mgf of a Poisson random variable

Poisson approximation to the binomial probability distribution

Mean, variance, and mgf of a uniform random variable

Mean, variance, and mgf of a normal random variable

Mean, variance, and mgf of a gamma random variable

Mean, variance, and mgf of an exponential random variable

Mean, variance, and mgf of a chi-square random variable

Properties of expected value

Properties of the covariance and correlation coefficient

Procedure to find the cdf of a function of random variable using the method of distribution functions

The pdf of Y  = g(X), where g is differentiable and monotone increasing or decreasing

The pdf of Y  = g(X), using the probability integral transformation

The transformation method to find the pdf of Y  = g(X 1, …, X n )

Chebyshev's theorem

Law of large numbers

CLT

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128178157000038

Beyond Wavelets

Bertrand Bénichou , Naoki Saito , in Studies in Computational Mathematics, 2003

9.3.2 Statistical Independence

The statistical independence of the coordinates of Y ∈ R n means

f Y ( y ) = f y 1 ( y 1 ) f y 2 ( y 2 ) f y n ( y n ) ,

where fY k (yk ) is a one-dimensional marginal pdf of fY. Statistical independence is a key property for compressing and modeling a stochastic process because: 1) an n-dimensional stochastic process of interest can be modeled as a set of one-dimensional processes; and 2) damage of one coordinate does not propagate to the others. Of course, in general, it is difficult to find a truly statistically independent coordinate system for a given stochastic process. Such a coordinate system may not even exist for a given stochastic process. Therefore, the next best thing we can do is to find the least statistically-dependent coordinate system within a basis dictionary. Naturally, then, we need to measure the "closeness" of a coordinate system (or random variables) Y 1,…, Y n to the statistical independence. This can be measured by mutual information or relative entropy between the true pdf fY and the product of its marginal pdfs:

I ( Y ) f Y ( y ) log f Y ( y ) i = 1 n f y i ( y i ) d y = - H ( Y ) + i = 1 n H ( Y i ) ,

where H(Y) and H(Y i) are the differential entropy of Y and Y i respectively:

H ( Y ) = - f Y ( y ) log f Y ( y ) d y , H ( Y i ) - f Y i ( y i ) log f Y i ( y i ) d y i .

We note that I(Y) > 0, and I(Y) = 0 if and only if the components of Y are mutually independent. See [5] for more details of the mutual information.

Suppose Y = B -1 X and B ∈ GL(n, R) with det(B) = ±1. We denote this set of matrices by SL±(n, R). Note that the usual SL(n, R) is a subset of SL± (n, R). Then, we have

I ( Y ) = - H ( Y ) + i = 1 n H ( Y i ) = - H ( X ) + i = 1 n H ( Y i ) ,

since the differential entropy is invariant under such an invertible volume-preserving linear transformation, i.e.,

H ( B - 1 X ) = H ( X ) + log | det ( B - 1 ) = H ( X ) ,

because | det(B -1)| = 1. Based on this fact, we proposed the minimization of the following cost function as the criterion to select the so-called least statistically-dependent basis (LSDB) in [21]:

(3.3) C H ( B | X ) = i = 1 n H ( ( B - 1 X ) i ) = i = 1 n H ( Y i ) .

The sample estimate of this cost given the training dataset T is

C H ( B | T ) = - 1 N k = 1 N i = 1 n log f ˆ y i ( y i , k ) ,

where f Y i (yi,k) is an empirical pdf of the coordinate Yi, which must be estimated by an algorithm such as the histogram-based estimator with optimal bin-width search of Hall and Morton [11]. Now, we can define the LSDB as

(3.4) B L S D B = B L S D B ( T , D ) = arg min B D C H ( B | T ) .

We note that the differences between this strategy and the standard independent component analysis (ICA) algorithms are: 1) restriction of the search in the basis dictionary D; and 2) approximation of the coordinate-wise entropy. For more details, we refer the reader to [21] for the former and [3] for the latter.

We now demonstrate the fact that the sparsity and the statistical independence are two intrinsically different concepts using a simple example.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S1570579X0380037X

Probability Theory

P.K. Bhattacharya , Prabir Burman , in Theory and Methods of Statistics, 2016

1.10 Independent Random Variables and Conditioning When There Is Dependence

Random variables X 1,   …, X k are said to be mutually independent if

(8a) F X 1 , , X k ( x 1 , , x k ) = i = 1 k F X i ( x i ) for all x 1 , , x k .

Equivalently, for mutually independent rv's,

(8b) f X 1 , , X k ( x 1 , , x k ) = i = 1 k f X i ( x i ) for all x 1 , , x k ,

holds for their joint pdf or joint pmf. The conditional pmf or pdf of X r+1,   …, X k given (X 1,   …, X r ) = (x 1,   …, x r ) when f X 1 , , X r ( x 1 , , x r ) > 0 is

f ( X r + 1 , , X k ) | ( X 1 , , X r ) ( x r + 1 , , x k | x 1 , , x r ) = f X 1 , , X k ( x 1 , , x k ) f X 1 , , X r ( x 1 , , x r ) .

In particular, for (X, Y ) having joint pmf/pdf f XY (x, y),

f Y | X ( y | x ) = f X Y ( x , y ) f X ( x ) = f X Y ( x , y ) f X Y ( x , y ) d y , when f X ( x ) > 0 .

On the other hand, from the marginal pdf f X of X and the conditional pdf f Y|X of Y given X,   the joint pdf of (X, Y ) is obtained as

(9) f X Y ( x , y ) = f X ( x ) f Y | X ( y | x ) .

If X 1,   …, X k are mutually independent rv's, then

E g 1 ( X 1 ) g k ( X k ) = i = 1 k E g i ( X i )

provided that E g i ( X i ) , i = 1 , , k exist. This follows immediately from the definition of independence (8a, b). It follows that if X, Y are independent, then σ X Y = Cov X , Y = 0 , and ρ XY = 0,   and therefore,

σ a X + b Y 2 = a 2 σ X 2 + b 2 σ Y 2 .

However, ρ XY = 0 does not imply that X, Y are independent.

Suppose that X 1,   …, X n are mutually independent and each X i is distributed as X. We then say that X 1,   …, X n are independent and identically distributed (iid) as X,   and if X - = 1 n i = 1 n X i , then

μ X - = μ X and σ X - 2 = σ X 2 n .

If X 1,   …, X k are independent rv's and if the mgf M X i ( t ) exists for each X i ,   then

M X 1 + + X k ( t ) = E e t i = 1 k X i = i = 1 k E e t X i = i = 1 k M X i ( t ) .

Going back to the general case of (X, Y ) with joint pdf f XY (x, y),   let

m ( x ) = E g ( Y ) | X = x = g ( y ) f Y | X ( y | x ) d y .

We now denote by m ( X ) = E g ( Y ) | X the conditional expectation of g(Y ) given X. This conditional expectation is a function of the rv X and therefore, is itself an rv, which takes the value m(x) when X = x. Hence

E E g ( Y ) | X = E m ( X ) = m ( x ) f X ( x ) d x = g ( y ) f Y | X ( y | x ) d y f X ( x ) d x = g ( y ) f Y ( y ) d y = E g ( Y ) ,

because f Y | X ( y | x ) f X ( x ) d x = f X Y ( x , y ) d x = f Y ( y ) using Eq. (9). Next consider

E h ( X ) g ( Y ) | X = x = h ( x ) g ( y ) f Y | X ( y | x ) d y = h ( x ) g ( y ) f Y | X ( y | x ) d y = h ( x ) E g ( Y ) | X = x for each x .

Hence E h ( X ) g ( Y ) | X = h ( X ) E g ( Y ) | X . We thus have the following important results:

E E g ( Y ) | X = E g ( Y ) and E h ( X ) g ( Y ) | X = h ( X ) E g ( Y ) | X .

We next consider the conditional properties of variance.

Var [ Y ] = E Y E Y 2 = E E ( Y E Y | X ) + ( E Y | X E Y ) 2 | X = E E ( Y E Y | X ) 2 | X + E E ( E Y | X E Y ) 2 | X + 2 E E ( Y E Y | X ) ( E Y | X E Y ) | X .

The three terms in the last expression are

E E ( Y E Y | X ) 2 | X = E Var Y | X ,

since E ( Y E Y | X ) 2 | X = Var Y | X ,

E E ( E Y | X E Y ) 2 | X = E ( E Y | X E Y ) 2 = Var E Y | X ,

and the third term is 0, using E h ( X ) g ( Y ) | X = h ( X ) E g ( Y ) | X .

Summary

Besides all the properties analogous to the properties of expectation, we have proved the following important properties of conditional expectation.

Proposition 1.10.1

(i)

E E g ( Y ) | X = E g ( Y ) .

(ii)

E h ( X ) g ( Y ) | X = h ( X ) E g ( Y ) | X .

(iii)

Var Y = E Var Y | X + Var E Y | X .

Definition 1.10.1

The function m ( x ) = E Y | X = x is called the regression function of Y on X. In particular, if m(x) is a linear function of x,   then we can represent Y as

Y = α + β X + ε with E ε | X = 0 ,

and if ε is independent of X,   then this is called the linear regression model and Var ε , if it exists, is the residual variance. More generally, if the dependence of Y on a k-dim rv X = (X 1,   …, X k ) is such that m ( x 1 , , x k ) = E Y | X = x is a linear function of (x 1,   …, x k ),   then

Y = α + β 1 X 1 + + β k X k + ε with E ε | X = 0 ,

and ( X , Y ) is said to follow a multiple linear regression model if ε is independent of X .

Example 1.10.1

The joint pdf of (X, Y ) is

f X Y ( x , y ) = C ( x 2 + 2 y 2 ) 0 < x , y < 1 0 otherwise .

Find the constant C,   the marginal pdf's of X and Y,   the conditional pdf of Y given X = x and then find the means μ X , μ Y ,   the variances σ X 2 , σ Y 2 , the correlation coefficient ρ XY and the conditional expectation E Y | X = x . Also find P X > Y .

Solution

1 = C 0 1 0 1 ( x 2 + 2 y 2 ) d x d y = C 1 0 1 x 2 d x + 2 0 1 y 2 d y = C 1 3 + 2 3 .

Thus C = 1 and f XY (x, y) = x 2 + 2y 2,   0 < x < 1,   0 < y < 1. Now

f X ( x ) = 0 1 ( x 2 + 2 y 2 ) d y = x 2 + 2 3 , 0 < x < 1 , f Y ( y ) = 0 1 ( x 2 + 2 y 2 ) d x = 2 y 2 + 1 3 , 0 < y < 1 .

The conditional pdf of Y given X = x is

f Y | X ( y | x ) = f X Y ( x , y ) f X ( x ) = x 2 + 2 y 2 x 2 + 2 3 , 0 < y < 1 for 0 < x < 1 .

Next, we evaluate the means, the variances, and the correlation coefficient:

μ X = 0 1 x f X ( x ) d x = 0 1 x x 2 + 2 3 d x = 1 4 + 2 3 1 2 = 7 12 , μ Y = 0 1 y f Y ( y ) d y = 0 1 y 2 y 2 + 1 3 d y = 2 1 4 + 1 3 1 2 = 2 3 , σ X 2 = E X 2 μ X 2 = 0 1 x 2 x 2 + 2 3 d x 7 12 2 = 1 5 + 2 3 1 3 7 12 2 = 19 45 49 144 = 59 720 , σ Y 2 = E Y 2 μ Y 2 = 0 1 y 2 2 y 2 + 1 3 d y 2 3 2 = 2 5 + 1 3 1 3 2 3 2 = 23 45 4 9 = 1 15 , σ X Y = E X Y μ X μ Y = 0 1 0 1 x y ( x 2 + 2 y 2 ) d x d y 7 12 2 3 = 1 4 1 2 + 2 1 2 1 4 7 12 2 3 = 3 8 7 18 = 1 72 , ρ X Y = σ X Y σ X σ Y = 1 72 59 720 1 15 = 1 72 10 , 800 59 = 0.1879 .

The conditional expectation of Y given X = x is

E Y | X = x = 0 1 y f Y | X ( y | x ) d y = 0 1 y x 2 + 2 y 2 x 2 + 2 3 d y = 1 x 2 + 2 3 0 1 ( x 2 y + 2 y 3 ) d y = x 2 1 2 + 2 1 4 x 2 + 2 3 = x 2 + 1 2 x 2 + 4 3 .

Note that

E E Y | X = 0 1 E Y | X = x f X ( x ) d x = 0 1 x 2 + 1 2 x 2 + 4 3 x 2 + 2 3 d x = 0 1 1 2 ( x 2 + 1 ) d x = 1 2 1 3 + 1 = 2 3 = μ Y , as it should be .

Finally,

P [ X > Y ] = 0 1 0 x ( x 2 + 2 y 2 ) d y d x = 0 1 x 2 + 2 x 3 3 d x = 5 3 1 4 = 5 12 .

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128024409000011

Feature Selection

Robert Nisbet Ph.D. , ... Ken Yale D.D.S., J.D. , in Handbook of Statistical Analysis and Data Mining Applications (Second Edition), 2018

Feature Ranking Methods

Simple feature ranking methods include the use of statistical metrics, like the correlation coefficient (described in Chapter 4). A more complex feature ranking method is the Gini Index (introduced in Chapter 4).

Gini Index

The Gini Index can be used to quantify the unevenness in variable distributions and income distributions among countries. The theory behind the Gini Index relies on the difference between a theoretical equality of some quantity and its actual value over the range of a related variable. This concept was introduced by Max O. Lorenz in 1905 to represent the unequal distribution of income among countries.

The empirical formula for the Gini score is

(5.1) G = n + 1 n 2 1 n n + 1 i x i n 1 n x i

where xi is the value of I-variable, sorted from least to greatest. For example, suppose $12 is distributed among five people as follows: Two people receive $3, and three people receive $2. In this scenario,

the bottom 20% own $2 or 16.7% of the wealth,

the bottom 40% own $4 or 33.3% of the wealth,

the bottom 60% own $6 or 50% of the wealth,

the bottom 80% own $9 or 75% of the wealth,

the bottom 100% own $12 or 100% of the wealth.

The Lorenz curve is shown in Fig. 5.1.

Fig. 5.1. The Lorenz curve for personal income example.

The Gini coefficient for this data set is

(5.2) 6 / 5 2 × 33 / 5 × 12 = 1.2 1.1 = 0.1

Source: http://www.had2know.com/academics/gini-coefficient-calculator.html.

The theoretical even distribution of income among the people in this data set is symbolized by the straight line through the center of the figure. The inequality in incomes at any percent level of the people is plotted as the curved line below the line of perfect equality. The total inequality among all persons is represented by the area between the diagonal line and the curved line (colored yellow). If curved line remained near the bottom of the figure until the 80th percentile, for example, it would represent a population with a few very rich people and a lot of very poor people.

Corrado Gini incorporated the Lorenz concept in 1912 to quantify the change in relative frequency of income values along the range of a population of the countries. For example, if you divide up the total number of households into deciles (every 10%), you can count the number of households in each decile and express the quantity as a relative frequency. This "binning" approach allows you to use a frequency-based calculation method instead of an integration method to find the area under the Lorenz curve at each point along the percent of household axis (analogous to the X-axis in Fig. 5.1). Many analytic tools provide facilities to calculate Gini scores for only part of the Lorenz curve for each variable, which represent the relative inequality among bins of the range of each variable. One use of this information is to determine the cut-point in the range of a variable in building decision trees.

You program the Gini score in Perl, Python, C++, or SQL. A Perl program to calculate the Gini score can be found on the book website (GINI.plx). You can use this method as a guide in selecting a short list of variables to submit to the modeling algorithm. For example, you might select all variables with a Gini score greater than 0.6 for entry into the model. The disadvantage of using this method is that it combines effects of data in a given range of one variable that may not reflect the combined effects of all variables interacting with it. But that is the problem with most feature ranking methods.

A slightly more integrative approach is to use bivariate methods like the scatterplots and web diagrams described in Chapter 4.

Bivariate Methods

Other bivariate methods like mutual information calculate the distance between the actual joint distribution of features X and Y and what the joint distribution would be if X and Y were independent. The joint distribution is the probability distribution of cases in which both events X and Y occurring together. Formally, the mutual information of two discrete random variables X and Y can be defined as

(5.3) I X ; Y = y Y x X p x , y log p x , y p 1 x p 2 y

where p(x,y) is the joint probability distribution function and p 1(x) and p 2(y) are the independent probability (or marginal probability) density functions of X and Y, respectively. If you are a statistician, this all makes sense to you, and you can derive this metric easily. Otherwise, we suggest that you look for some approach that makes more sense to you intuitively. If this is the case, you might be more comfortable with one of the multivariate methods implemented in many statistical packages. Two of those methods are stepwise regression and partial least squares regression.

Multivariate Methods

Stepwise Linear Regression

A slightly more sophisticated method is the one used in stepwise regression. This is a classical statistics method that calculates the F-value for the incremental inclusion of each variable in the regression. The F-value is equivalent to the square root of the Student's t-value, expressing how different two data samples are, where one sample includes the variable and the other sample does not. The t-value is calculated by

t = difference in the sample means / standard deviation of differences

and so

F = t value

The F-value is sensitive to the number of variables used to calculate the numerator of this ratio and to the number of variables used to calculate the denominator. Stepwise regression calculates the F-value both with and without using a particular variable and compares it with a critical F-value either to include the variable (forward stepwise selection) or to eliminate the variable from the regression (backward stepwise selection). In this way, the algorithm can select the set of variables that meets the F-value criterion. It is assumed that these variables account for a sufficient amount of the total variance in the target variable in order to predict it at a given level of confidence specified for the F-value (usually 95%).

If your variables are numeric (or can be converted to numbers), you can use stepwise regression to select the variables you use for other data mining algorithms. But there is a "fly" in this ointment. Stepwise regression is a parametric procedure and is based on the same assumptions characterizing other classical statistical methods. Even so, stepwise regression can be used to give you one perspective on the short list of variables. You should use other methods and compare lists. Don't trust necessarily the list of variables included in the regression solution, because their inclusion assumes linear relationships of variables with the target variable, which in reality may be quite nonlinear in nature.

Partial Least Squares Regression

A slightly more complex variant of multiple stepwise regression keeps track of the partial sums of squares in the regression calculation. These partial values can be related to the contribution of each variable to the regression model. Statistica provides an output report from partial least squares regression, which can give another perspective on which to base feature selection. Table 5.1 shows an example of this output report for an analysis of manufacturing failures.

Table 5.1. Marginal Contributions of Six Predictor Variables to the Target Variable (Total Defects)

Summary of PLS (fail_tsf.STA) Responses: TOT_DEFS Options—NO-INTERCEPT AUTOSCALE
Increase—R 2 of Y
Variable 1 0.799304
Variable 2 0.094925
Variable 3 0.014726
Variable 4 0.000161
Variable 5 0.000011
Variable 6 0.000000

It is obvious that variables 1 and 3 (and marginally variable 2) provide significant contributions to the predictive power of the model (total R 2  =   0.934). On the basis of this analysis, we might consider eliminating variables 4 through 6 from our variable short list.

Sensitivity Analysis

Some machine-learning algorithms (like neural nets) provide an output report that evaluates the final weights assigned to each variable to calculate how sensitive the solution is to the inclusion of that variable. These sensitivity values are analogous to the F-values calculated for the inclusion of each variable in stepwise regression. Both IBM SPSS Modeler and STATISTICA Data Miner provide sensitivity reports for their automated neural nets. These sensitivity values can be used as another way to determine the best set of variables to include in a model. One strategy that can be followed is to train a neural net with default characteristics and include in your short list all variables with greater than a threshold level of sensitivity. Granted, this approach is less precise than the linear stepwise regression, but the neural net set of variables may be much more generalizable, by virtue of their ability to capture nonlinear relationships effectively.

Complex Methods

A piecewise linear network uses a distance measure to assign incoming cases to an appropriate cluster. The clusters can be defined by any appropriate clustering method. A separate function called a basis function is defined for each cluster of cases. A pruning algorithm can be applied to eliminate the least important clusters, one at a time, leading to a more compact network. This approach can be viewed as a nonlinear from of stepwise linear regression.

Multiple Adaptive Regression Splines (MARS)

The MARS algorithm was popularized by Friedman (1991) to solve regression and classification problems with multiple outcomes (target variables). This approach can be viewed as a form of piecewise linear regression, which adapts a solution to local data regions of similar linear response. Each of the local regions is expressed by a different basis function. MARS algorithms can also be viewed as a form of regression trees, in which the "hard" splits into separate branches of the tree are replaced by the smooth basis functions. The MARS algorithm is implemented in STATISTICA Data Miner by the MARSplines algorithm, which includes a pruning routine—a very powerful tool for feature selection. The MARSplines algorithm will pick up only those basis functions (and those predictor variables) that provide a "sizeable" contribution to the prediction. The output of the MARSplines module will retain only those variables associated with basis functions that were retained for the final solution of the model and rank them according to the number of times they are used in different parts of the model.

You can run your data through a procedure like the STATISTICA MARSplines module to gain some insights for building your variable short list. Refer to Hastie et al. (2001) for additional details.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124166325000050

The Basis of Monte Carlo

William L. Dunn , J. Kenneth Shultis , in Exploring Monte Carlo Methods, 2012

2.3 Multiple Random Variables

In most Monte Carlo applications multiple random variables are involved that often depend on each other. In this section some of the important properties of such probability functions are summarized.

2.3.1 Two Random Variables

The concept of a probability density function of a single random variable can be extended to probability density functions of more than one random variable. For two random variables, x and y, f (x, y) is called the joint probability density function if it is defined and non-negative on the interval x ∈ [a, b], y ∈ [c, d] and if

(2.35) a b c d f ( x , y ) d x d y = 1.

The functions

(2.36) F x ( x ) = c d f ( x , y ) d y a n d f y ( y ) a b f ( x , y ) d x

are called the marginal PDFs of x and y, respectively. The joint PDF can be written in terms of these marginal PDFs as

(2.37) f ( x , y ) = f ( x | y ) f y ( y ) = f ( y | x ) f x ( x ) ,

where f (x|y) is called the conditional PDF of x given y and f (y|x)| is called the conditional PDF of y given x. Equation (2.37) can also be used to define the conditional PDFs, namely,

(2.38) f ( x | y ) = f ( x , y ) f y ( y ) and f ( x | y ) = f ( x , y ) f x ( x ) .

The mean and variance of x are given by

(2.39) x a b c d x f ( x , y ) d x d y = a b x f x ( x ) d x

and

(2.40) σ 2 ( x ) a b c d ( x x ) 2 f ( x , y ) d x d y = a b ( x x ) 2 f x ( x ) d x .

Similar expressions hold for 〈y〉 and σ2(y). A measure of how x and y are dependent on each other is given by the covariance, defined as

(2.41) covar a b c d ( x x ) ( y y ) f ( x , y ) d x d y .

Finally, if f(x, y) = g(x)h(y), where g(x) and h(y) are also PDFs, then x and y are independent random variables, and it is easily shown that covar(x, y) = 0.

Example 2.2 Isotropic Scattering

In simulating particle transport, one must keep track of a particle's direction. For instance, as neutrons migrate through matter, they scatter from nuclei. The probability of scattering from one direction into another direction is often expressed as a joint PDF of two angles, a polar angle θ (measured from the initial direction) and an azimuthal angle ψ. Generally, neutron scattering is rotationally invariant, meaning that the angle ψ is independent of the scatter angle θ and all azimuthal angles are equally likely. Thus, θ and ψ are independent and the joint PDF for neutron scattering can be expressed as

f ( θ , ψ ) = g ( θ ) h ( ψ ) ,

where

h ( ψ ) = 1 2 π , ψ [ 0,2 π ] .

The form of the PDF g(θ) depends on such things as the type and energy of the radiation and the frame of reference in which the scatter angle is defined. These matters are more fully developed later, in Chapters 9 and 10, and in references such as Shultis and Faw [2000].

Bayes theorem

Bayes theorem, which follows from the axioms of probability, relates the conditional probabilities of two events, say x and y, with the joint probability density function f(x, y) just discussed. For two random variables, this theorem states

(2.42) f ( x | y ) = f x ( x ) f ( y | x ) f y ( y ) .

This result is easily verified by using the definitions of Eq. (2.38). This theorem and its application to nonclassical statistical analysis is discussed at greater length in Chapter 6.

2.3.2 More Than Two Random Variables

The concept of a PDF for two random variables can be generalized to a collection of n random variables x = {x1, x 2, …, xn }. Note that each component random variable xj , j = 1, …, xn can be either discrete or continuous. The function f (x) is the joint PDF of x if it obeys both f (x) ≥ 0 for all values of x ∈ V and ∫ V f (x)dx = 1, where V defines the "volume" over which x is defined. If x is decomposed such that x = {xj , xk }, where xk Vk and where j + k = n, 2 then

(2.43) f x j ( x j ) = V k f ( x ) d x k ,

is called the marginal PDF of xj and

(2.44) f ( x k | x j ) = f ( x ) f x j ( x j )

is called the conditional PDF of xk , given xj. Also, in analogy to Eq. (2.40), each xi has a variance and, in analogy to Eq. (2.41), each pair of random variables has a covariance. Finally, if

(2.45) f ( x ) = f x 1 ( x 1 ) f x 2 ( x 2 ) f x n ( x n ) = i = 1 n f x i ( x i ) ,

then all n random variables are independent and all covariances between any two pairs are zero.

Again, suppose z represents a stochastic process that is a function of the random variable x, where x is governed by the joint PDF f(x). Then z(x) is also a random

variable and one can define its expected value, by analogy with Eq. (2.16), as

(2.46) z V z ( x ) f ( x ) d x ,

and its variance, by analogy with Eq. (2.17), as

(2.47) σ 2 ( z ) [ z ( x ) x ] 2 V [ z ( x ) z ] 2 f ( x ) d x .

This last result can be reduced to

(2.48) σ 2 ( z ) = z 2 z 2 .

The quantity 〈z〉 is properly called the population mean, while σ2(z) is the population variance and σ ( z ) = σ 2 ( z ) is the population standard deviation. However, the formality of this nomenclature is often ignored and the terms mean, variance, and standard deviation are commonly used, respectively, for these quantities.

2.3.3 Sums of Random Variables

The purpose of a Monte Carlo calculation is to estimate some expected value 〈z〉 by the population mean or average given by Eq. (2.19). But to know how good this estimate is, the variance of z ¯ , which is also a random variable, is also needed.

Begin by considering the random variable ξ = i = 1 N z i where, in general, the zi are distributed by a joint PDF f (z 1, z 2, …, zN ). The expected value of a sum can be shown (see problem 6) to be the sum of expected values, i.e.,

(2.49) ξ = i = 1 N z i = i = 1 N z i = i = 1 N μ i .

The variance of ξ is calculated as

(2.50) σ 2 ( ξ ) = σ 2 ( i = 1 N z i ) = ( i = 1 N z i i = N N z i ) 2 = ( i = 1 N ( z i z i ) 2 ) = i = 1 N j = 1 N ( z i z i ) ( z j z j ) = i = 1 N ( z i z i ) 2 + i = 1 N j = 1 j i N ( z i z i ) ( z j z j ) = i = 1 N σ 2 ( z i ) + 2 i = 1 N j = 1 j < i N covar ( z i , z j ) .

Now if the zi are all independent, then f ( z 1 , z 2 , , z N ) = i + 1 N f i ( z i ) , and it is easy to show that covar(zi , zj ) = 0 (see problem 7) so that

(2.51) σ 2 ( ξ ) = i = 1 N σ 2 ( z i )

One final property of the variance of a random number is needed, namely, for α = a constant,

(2.52) σ 2 ( α ξ ) = ( α ξ α ξ ) 2 = α 2 ( ξ ξ ) 2 = α 2 σ 2 ( ξ ) .

Now apply these results to the Monte Carlo estimate of the sample mean z ¯ = ( 1 / N ) i = 1 N z ( x i ) . Here, the z(xi ) ≡ zi are all identically distributed because the zi are independently sampled from the same PDF f(z). Thus, all the zi have the same variance σ2(z) given by Eq. (2.18). Then the variance of the sample mean is, from Eqs. (2.51) and (2.52),

(2.53) σ z ( z ¯ ) = σ 2 ( 1 2 ( z 1 + z 2 + + z N ) ) = 1 N 2 [ σ 2 ( z ) + σ 2 ( z ) + + σ 2 ( z ) ] = 1 N σ 2 ( z ) .

Although σ2(z) is not known, it can be approximated by the sample variance s 2(z), as given by Eq. (2.23), so that

(2.54) σ ( z ¯ ) s ( z ) N = 1 N N N 1 ( z 2 ¯ z ¯ 2 ) z 2 ¯ z ¯ 2 N for large N .

Hence, the Monte Carlo estimation of 〈z〉 ± σ (z) is calculated as

(2.55) z ± σ ( z ) V z ( x ) f ( x ) d x ± V ( z z ) 2 f ( x ) d x z ¯ ± z 2 ¯ z ¯ 2 N .

Here, z ¯ = ( 1 / N ) i = 1 N z ( x i ) and z 2 ¯ = ( 1 / N ) i = 1 N z ( x i ) 2 , where the xi are distributed according to f(x) over the "volume" V.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444515759000026

Feature Selection

Robert Nisbet , ... Gary Miner , in Handbook of Statistical Analysis and Data Mining Applications, 2009

Feature Ranking Methods

Simple feature ranking methods include the use of statistical metrics, like the correlation coefficient (described in Chapter 4). A more complex feature ranking method is the Gini Index (introduced in Chapter 4).

Gini Index

The Gini Index can be used to quantify the unevenness in variable distributions, as well as income distributions among countries. The theory behind the Gini Index relies on the difference between a theoretical equality of some quantity and its actual value over the range of a related variable. This concept was introduced by Max O. Lorenz in 1905 to represent the unequal distribution of income among countries, and can be illustrated by Figure 5.1.

Figure 5.1. The Lorenz curve relating the distribution of income among households in a population.

The theoretical even distribution of income among households is symbolized by the straight line through the center of the figure. The inequality in incomes among households is shown by the red line below the line of perfect equality. If the red line remained near the bottom of the figure until the 80th percentile, for example, it would represent a population with a few very rich people and a lot of very poor people.

Cerrado Gini incorporated the Lorenz concept in 1912 to quantify the change in relative frequency of income values along the range of a population of countries. For example, if you divide the % Households into deciles (every 10%), you can count the number of households in each decile and express the quantity as a relative frequency. This binning approach allows you to use a frequency-based calculation method instead of an integration method to find the area under the Lorenz curve at each point along the % Households axis (the x-axis). With this approach, you can calculate the relative mean difference (RMD) of all the binned values (frequency of a bin/mean frequency across all bins), and divide it by 2 × the mean frequency value, expressed for a population of size n, with a sequence of values yi , i = 1 to n:

RMD = MD arithmetic mean

where

MD = 1 n 2 i = 1 n j = 1 n | y i y j |

You can use this method as a guide in selecting a short-list of variables to submit to the modeling algorithm. For example, you might select all variables with a Gini score greater than 0.6 for entry into the model. The disadvantage of using this method is that it combines effects of data in a given range of one variable that may not reflect the combined effects of all variables interacting with it. But that is the problem with most feature ranking methods.

A slightly more integrative approach is to use bi-variate methods like the scatterplots and web diagrams described in Chapter 4.

Bi-variate Methods

Other bi-variate methods like mutual information calculate the distance between the actual joint distribution of features X and Y and what the joint distribution would be if X and Y were independent. The joint distribution is the probability distribution of cases where both events X and Y occur together. Formally, the mutual information of two discrete random variables X and Y can be defined as

I ( X ; Y ) = y Y x X p ( x , y ) log ( p ( x , y ) p 1 ( x ) p 2 ( y ) ) ,

where p(x,y) is the joint probability distribution function, and p 1(x) and p 2(y) are the independent probability (or marginal probability) density functions of X and Y, respectively. If you are a statistician, this likely all makes sense to you, and you can derive this metric easily. Otherwise, we suggest that you look for some approach that makes more sense to you intuitively. If this is the case, you might be more comfortable with one of the multivariate methods implemented in many statistical packages. Two of those methods are stepwise regression and partial least squares regression.

Multivariate Methods

Stepwise Linear Regression

A slightly more sophisticated method is the one used in stepwise regression. This classical statistics method calculates the F-value for the incremental inclusion of each variable in the regression. The F-value is equivalent to the square root of the student's t-value, expressing how different two samples are from each other, where one sample includes the variable and the other sample does not. The t-value is calculated by

t = difference in the sample means / standard deviation of differences

and so

F = t value

The F-value is sensitive to the number of variables used to calculate the numerator of this ratio and for the denominator. Stepwise regression calculates the F-value both with and without using a particular variable and compares it with a critical F-value to either include the variable (forward stepwise selection) or to eliminate the variable from the regression (backward stepwise selection). In this way, the algorithm can select the set of variables that meets the F-value criterion. It is assumed that these variables account for a sufficient amount of the total variance in the target variable to predict it at a given level of confidence specified for the F-value (usually 95%).

If your variables are numeric (or can be converted to numbers), you can use stepwise regression to select the variables you use for other data mining algorithms. But there is a fly in this ointment. Stepwise regression is a parametric procedure and is based on the same assumptions characterizing other classical statistical methods. Even so, stepwise regression can be used to give you one perspective on the short-list of variables. You should use other methods and compare lists. Don't trust necessarily the list of variables included in the regression solution because their inclusion assumes linear relationships of variables with the target variable, which in reality may be quite nonlinear in nature.

Partial Least Squares Regression

A slightly more complex variant of multiple stepwise regression keeps track of the partial sums of squares in the regression calculation. These partial values can be related to the contribution of each variable to the regression model. STATISTICA provides an output report from partial least squares regression, which can give another perspective on which to base feature selection. Table 5.1 shows an example of this output report for an analysis of manufacturing failures.

TABLE 5.1. Marginal Contributions of Six Predictor Variables to the Target Variable (Total Defects)

Summary of PLS (fail_tsf.STA) Responses: TOT_DEFS Options: NO-INTERCEPT AUTOSCALE
Increase - R2 of Y
Variable 1 0.799304
Variable 2 0.094925
Variable 3 0.014726
Variable 4 0.000161
Variable 5 0.000011
Variable 6 0.000000

It is obvious that variables 1 and 2 (and marginally, variable 3) provide significant contributions to the predictive power of the model (total R2 = 0.934). On the basis of this analysis, we might consider eliminating variables 4 through 6 from our variable short-list.

Sensitivity Analysis

Some machine learning algorithms (like neural nets) provide an output report that evaluates the final weights assigned to each variable to calculate how sensitive the solution is to the inclusion of that variable. These sensitivity values are analogous to the F-values calculated for inclusion of each variable in stepwise regression. Both SPSS Clementine and STATISTICA Data Miner provide sensitivity reports for their automated neural nets. These sensitivity values can be used as another reflection of the best set of variables to include in a model. One strategy that can be followed is to train a neural net with default characteristics and include in your short-list all variables with greater than a threshold level of sensitivity. Granted, this approach is less precise than the linear stepwise regression, but the neural net set of variables may be much more generalizable, by virtue of their ability to capture nonlinear relationships effectively.

Complex Methods

A piecewise linear network uses a distance measure to assign incoming cases to an appropriate cluster. The clusters can be defined by any appropriate clustering method. A separate function called a basis function is defined for each cluster of cases. A pruning algorithm can be applied to eliminate the least important clusters, one at a time, leading to a more compact network. This approach can be viewed as a nonlinear form of stepwise linear regression.

Multivariate Adaptive Regression Splines (MARS)

The MARS algorithm was popularized by Friedman (1991) to solve regression and classification problems with multiple outcomes (target variables). This approach can be viewed as a form of piecewise linear regression, which adapts a solution to local data regions of similar linear response. Each of the local regions is expressed by a different basis function. MARS algorithms can also be viewed as a form of regression tree, in which the hard splits into separate branches of the tree are replaced by the smooth basis functions. In STATISTICA Data Miner (for example), the MARSplines algorithm includes a pruning routine, which provides a very powerful tool for feature selection. The MARSplines algorithm will pick up only those basis functions (and those predictor variables) that provide a sizeable contribution to the prediction. The output of the MARSplines module will retain only those variables associated with basis functions that were retained for the final solution of the model and rank them according to the number of times they are used in different parts of the model.

You can run your data through a procedure like the STATISTICA MARSplines module to gain some insights for building your variable short-list. Refer to Hastie et al. (2001) for additional details.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012374765500005X

Conditional Probability and Conditional Expectation

Mark A. Pinsky , Samuel Karlin , in An Introduction to Stochastic Modeling (Fourth Edition), 2011

Examples

(a)

Queueing Let N be the number of customers arriving at a service facility in a specified period of time, and let ξ i be the service time required by the i th customer. Then, X = ξ1 + · ·· + ξ N is the total demand for service time.

(b)

Risk Theory Suppose that a total of N claims arrives at an insurance company in a given week. Let ξ i be the amount of the i th claim. Then, the total liability of the insurance company is X = ξ1 + · ·· + ξ N .

(c)

Population Models Let N be the number of plants of a given species in a specified area, and let ξ i be the number of seeds produced by the ith plant. Then, X = ξ1 + · ·· + ξ N gives the total number of seeds produced in the area.

(d)

Biometrics A wildlife sampling scheme traps a random number N of a given species. Let ξ i be the weight of the ith specimen. Then, X = ξ1 + · ·· + ξ N is the total weight captured.

The necessary background in conditional probability was covered in Section 2.1 for when ξ1, ξ2, … are discrete random variables. In order to study the random sum X= ξ1 + · ·· + ξ1 when ξ1, ξ2, … are continuous random variables, we need to extend our knowledge of conditional distributions.

2.3.1 Conditional Distributions: The Mixed Case

Let X and N be jointly distributed random variables and suppose that the possible values for N are the discrete set n = 0, 1, 2, …. Then, the elementary definition of conditional probability (2.1) applies to define the conditional distribution function F x | n (x | n) of the random variable X, given that N = n, to be

(2.23) F X | N ( x | n ) = Pr { X x and N = n } Pr { N = n } if Pr { N = n } > 0 ,

and the conditional distribution function is not defined at values of n for which Pr{N = n} = 0. It is elementary to verify that F X | N (x | n) is a probability distribution function in x at each value of n for which it is defined.

The case in which X is a discrete random variable was covered in Section 2.1. Now let us suppose that X is continuous and that F X | N (x | n) is differentiable in x at each value of n for which Pr{N = n} > 0. We define the conditional probability density function f X | N (x | n) for the random variable X given that N = n by setting

(2.24) F X | N ( x | n ) = d d x F X ( x | n ) if Pr { N = n } > 0 ,

Again, f X | N (x | n) is a probability density function in x at each value of n for which it is defined. Moreover, the conditional density as defined in (2.24) has the appropriate properties, e.g.,

(2.25) Pr { a X < b , N = n } = a b f X | N ( x | n ) p N ( n ) d x

for a < b and where pN (n) = Pr{N = n }. The law of total probability leads to the marginal probability density function for X via

(2.26) f X ( x ) = n = 0 f X | N ( x | n ) p N ( n ) .

Suppose that g is a function for which E [| g (X)|] < ∞. The conditional expectation of g(X) given that N = n is defined by

(2.27) E [ g ( X ) | N = n ] = g ( x ) f X | N ( x | n ) d x .

Stipulated thus, E [g (X)| N = n] satisfies the properties listed in (2.7) to (2.15) for the joint discrete case. For example, the law of total probability is

(2.28) E [ g ( X ) ] = n = 0 E [ g ( X ) | N = n ] p N ( n ) = E { E [ g ( X ) | N ] } .

2.3.2 The Moments of a Random Sum

Let us assume that ξ k and N have the finite moments

(2.29) E [ ξ k ] = μ , Var [ ξ k ] = σ 2 , E [ N ] = v , Var [ N ] = τ 2 ,

and determine the mean and variance for X = ξ1 + · ·· + ξ N as defined in (2.22). The derivation provides practice in manipulating conditional expectations, and the results,

(2.30) E [ X ] = μ v , Var [ X ] = v σ 2 + μ 2 τ 2 ,

are useful and important. The properties of conditional expectation listed in (2.7) to (2.15) justify the steps in the determination.

If we begin with the mean E [X], then

E [ X ] = n = 02 E [ X | N = n ] p N ( n ) ( by 2.15 ) = n = 1 E [ ξ 1 + + ξ N | N = n ] p N ( n ) ( definition of X ) = n = 1 E [ ξ 1 + + ξ n | N = n ] p N ( n ) ( by 2.9 ) = n = 1 E [ ξ 1 + + ξ n ] p N ( n ) ( by 2.10 ) = μ n = 1 n p N ( n ) = μ v .

To determine the variance, we begin with the elementary step

(2.31) Var [ X ] = E [ ( X μ v ) 2 ] = E [ ( X N μ + N μ v μ ) 2 ] = E [ ( X N μ ) 2 ] + E [ μ 2 ( N v ) 2 ] + 2 E [ μ ( X N μ ) ( N v ) ] .

Then,

E [ ( X N μ ) 2 ] = n = 0 E [ ( X N μ ) 2 | N = n ] p N ( n ) = n = 1 E [ ( ξ 1 + + ξ n n μ ) 2 | N = n ] p n ( n ) = σ 2 + n = 1 n p N ( n ) = v σ 2 ,

and

E [ μ 2 ( N v ) 2 ] = μ 2 E [ ( N v ) 2 ] = μ 2 τ 2 ,

while

E [ μ ( X N μ ) ( N v ) ] = μ n = 0 E [ ( X n μ ) ( n v ) | N = n ] p N ( n ) = μ n = 0 ( n v ) E [ ( X n μ ) | N = n ] p N ( n ) = 0

(because E [(X)| N = n] = E1 + · ·· + ξ n n μ] = 0). Then, (2.31) with the subsequent three calculations validates the variance of X as stated in (2.30).

Example The number of offspring of a given species is a random variable having probability mass function p (k) for k = 0, 1, …. A population begins with a single parent who produces a random number N of progeny, each of which independently produces offspring according to p (k) to form a second generation. Then, the total number of descendants in the second generation may be written X = ξ1 + · ·· + ξ N , where ξ k is the number of progeny of the k th offspring of the original parent. Let E [N] = Ek] = μ and Var[N] = Var[ξ k] = σ2. Then,

E [ X ] = μ 2 and Var [ X ] = μ σ 2 ( 1 + μ ) .

2.3.3 The Distribution of a Random Sum

Suppose that the summands ξ1, ξ2, … are continuous random variables having a probability density function f (z). For n ≥ 1, the probability density function for the fixed sum ξ1 + · ·· + ξ n is the n-fold convolution of the density f (z), denoted by f (n)(z) and recursively defined by

f ( 1 ) ( z ) = f ( z )

and

(2.32) f ( n ) ( z ) = f ( n 1 ) ( z u ) f ( u ) d u for n > 1.

(See Chapter 1, Section 1.2.5 for a discussion of convolutions.) Because N and ξ1, ξ2, … are independent, then f (n)(z) is also the conditional density function for X = ξ1 + · ·· + ξ N given that N = n ≥ 1. Let us suppose that Pr{N = 0} = 0. Then, by the law of total probability as expressed in (2.26), X is continuous and has the marginal density function

(2.33) f x ( x ) = n = 1 f ( n ) ( x ) p N ( n ) .

Remark When N = 0 can occur with positive probability, then X = ξ1 + · ·· + ξ N is a random variable having both continuous and discrete components to its distribution. Assuming that ξ1, ξ2, … are continuous with probability density function f (z), then

Pr { X = 0 } = Pr { N = 0 } = p N ( 0 ) , while for 0 < a < b or a < b < 0 , then

while for 0 < a < b or a < b < 0, then

(2.34) Pr { a < X < b } = a b { n = 1 f ( n ) ( z ) p N ( n ) } d z .

Example A Geometric Sum of Exponential Random Variables In the following computational example, suppose

f ( z ) = { λ e λ z for z 0 , 0 for z < 0 ,

and

p N ( n ) = β ( 1 β ) n 1 for n = 1 , 2 , ….

For n ≥ 1, the n-fold convolution of f (z) is the gamma density

f ( n ) ( z ) = { λ n ( n 1 ) ! z n 1 e λ z for z 0 , 0 for z < 0 ,

(See Chapter 1, Section 1.4.4 for discussion.)

The density for X = ξ1 + · ·· + ξ N is given, according to (2.26), by

f X ( z ) = n = 1 f ( n ) ( z ) p N ( n ) = n = 1 λ n ( n 1 ) ! z n 1 e λ z + β ( 1 β ) n 1 = λ β e λ z n = 1 [ λ ( 1 β ) z ] n 1 ( n 1 ) ! = λ β e λ z e λ ( 1 β ) z = λ β e λ β z , z 0.

Surprise! X has an exponential distribution with parameter λβ.

Example Stock Price Changes Stochastic models for price fluctuations of publicly traded assets were developed as early as 1900.

Let Z denote the difference in price of a single share of a certain stock between the close of one trading day and the close of the next day. For an actively traded stock, a large number of transactions take place in a single day, and the total daily price change is the sum of the changes over these individual transactions. If we assume that price changes over successive transactions are independent random variables having a common finite variance, * then the central limit theorem applies. The price change over a large number of transactions should follow a normal, or Gaussian, distribution.

A variety of empirical studies have supported this conclusion. For the most part, these studies involved price changes over a fixed number of transactions. Other studies found discrepancies in that both very small and very large price changes occurred more frequently in the data than suggested by normal theory. At the same time, intermediate-size price changes were under represented in the data. For the most part, these studies examined price changes over fixed durations containing a random number of transactions.

A natural question arises: Does the random number of transactions in a given day provide a possible explanation for the departures from normality that are observed in data of daily price changes? Let us model the daily price change in the form

(2.35) Z = ξ 0 + ξ 1 + + ξ N = ξ 0 + X ,

where ξ0, ξ1, … are independent normally distributed random variables with common mean zero and variance σ2, and N has a Poisson distribution with mean v.

We interpret N as the number of transactions during the day, ξi for i ≥ 1 as the price change during the i th transaction, and ξ0 as an initial price change arising between the close of the market on one day and the opening of the market on the next day. (An obvious generalization would allow the distribution of ξ0 to differ from that of ξ1, ξ2, ….)

Conditioned on N = n, the random variable Z = ξ0 + ξ1 + · ·· + ξ N is normally distributed with mean zero and variance (n + 1)σ 2. The conditional density function is

φ n ( z ) = 1 2 π ( n + 1 ) σ exp { 1 2 1 z 2 ( n + 1 ) σ 2 } .

Since the probability mass function for N is

p N ( n ) = λ n e λ n ! , n = 0 , 1 , ,

using (2.33) we determine the probability density function for the daily price change to be

f Z ( z ) = n = 0 φ n ( z ) λ n e λ n ! .

The formula for the density fZ (z) does not simplify. Nevertheless, numerical calculations are possible. When λ= 1 and σ 2 = 1 2 , then (2.30) shows that the variance of the daily price change Z in the model (2.35) is Var[Z] = (1 + λ)σ2= 1. Thus, comparing the density fZ (z) when λ= 1 and σ 2 = 1 2 to a normal density with mean zero and variance, one sheds some light on the question at hand.

The calculations were carried out and are shown in Figure 2.2.

Figure 2.2. A standard normal density (solid line) as compared with a density for a random sum (dashed line). Both densities have zero mean and unit variance.

The departure from normality that is exhibited by the random sum in Figure 2.2 is consistent with the departure from normality shown by stock price changes over fixed time intervals. Of course, our calculations do not prove that the observed departure from normality is caused by the random number of transactions in a fixed time interval. Rather, the calculations show only that such an explanation is consistent with the data and is, therefore, a possible cause.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123814166000022