RANDOM NUMBER GENERATORS

What are random numbers?

Imagine yourself picking up a card from a well shuffled full deck of cards. What could the card you picked be? The jack of spades, the king of diamonds, or the queen of hearts?? Well it could be the ace of clubs too or it could be any of the other available cards in the deck.

In terms of probability theory, you can definitely say that picking up a card from a well shuffled full deck of cards is a ‘random experiment’. It is an experiment because, when one picks up a card there is an outcome, a result of the effort. And a random one cause you definitely know all the possible cards that may be picked.

Also in this particular random experiment of picking a card from a well shuffled full deck of cards, no card is preferred over the other. In simple words, the probability of picking up any card is the same, in statistical sense, the event of picking a particular card and the event of picking up another card will be equally likely. This is what we call a random phenomenon. So, picking up a card from a well shuffled full deck of cards could be interpreted as picking a card randomly from a full deck of cards.

Now imagine yourself selecting a number from the set (0,1,2,3,4,5,6,7,8,9) randomly. It means that you selecting 0 or 1 or 2 or 3 or …. or 9 are equally likely events. Thus, we are selecting a random number from the set.

How to generate random numbers?

A random number from the set of numbers can be generated by using the following methods:

  • Lottery Method: Suppose there are n numbers in the set. One can then take n similar balls such that each ball is given a unique number from the set, and put it in an urn. Shuffle the balls , and then start picking up the balls one by one with replacement and note down the number on each of the balls picked. The numbers noted will be the random numbers.
  • Roullette Wheel: One can also take a roulette wheel and divide the wheel into n equal pieces and writing the numbers(uniquely from the set) on each of the areas and spin the wheel, and note down the number. Here too the numbers noted will be random numbers.
  • Random Number Table: The above two methods are physical and it always take a considerate amount of time to draw random numbers that way. So instead, one can use a random number table, a table where random numbers are stacked up. However, since the random numbers drawn by one may be duplicated by another quite easily, there has always been an uncertainty about randomness in this method, we quite oftenly call the random numbers drawn by this method to be pseudo- random numbers.

Drawing Random Numbers using algorithms (for computers)

Since a computer is a deterministic device, it might seem impossible that it could be used to generate random numbers. The numbers generated are algorithmically computed and are quite deterministic. However, they appear to be random and must pass stringent tests designed to ensure that they can provide the same results that truly random numbers (such as the first two methods above) would be produced.

Requirements of a Random Number Generator:

  1. It should be fast.
  2. It should be repeatable.
  3. It should be amenable to analysis.
  4. It should have a long period.
  5. It should be apparently random.

Some Random Number Generator (RNF) Methods

SIMULATION

Simulation is a numerical technique for conducting experiments on a digital computer, which involves certain types of mathematical and logical models that describe the behavior of business or economic system or some component thereof over extended period of time.

Simulation deals with both abstract and physical models. Some simulation with physical and abstract models might involve real  people. Two types of simulation involving real people are-

  1. Operational gaming
  2. Man-machine simulator

Merits of Simulation

  1. In cases where obtaining data is either impossible or very expensive, simulation is used.
  2. The observed system may be too complex that it cannot be described by a mathematical model.
  3. Even though a mathematical model may be formed to describe the system, it may not always be possible to find a straight forward analytical solution.
  4. It may either be impossible or very costly to perform experiments to validate a mathematical model describing the system.

Demerits of Simulation

  1. Simulation is indeed invaluable and very versatile tool  in those problems where analytical techniques are inadequate. Although  being an impressive technique,  it provides only statistical estimates rather than exact results and it only compare alternatives rather than generating the optimal one.
  2. Simulation is also a slow and costly way to study a problem. It usually requires a large amount of time and great expense for analysis and programming.
  3. Finally simulation deals only numerical data about the performance of the system and sensitivity analysis of the model parameters is very expensive.

Monte Carlo Methods

A Monte Carlo method is a computational / numerical method that uses random numbers to compute / estimate a quantity of interest. The quantity of interests may be the mean of a random variable , functions of several means , distributions of random variables or high dimensional integrals.

Basically, Monte Carlo methods may be grouped into two types:

The direct/simple/classical Monte Carlo methods involves generating identical and independent sequences of random samples. And the other, in a sense, involves generating a sequence of random samples, which are not independent, and is the Markov Chain Monte Carlo methods.

Monte Carlo Integration

Let f(x) be a function of x and suppose we are interested in computation of the integral:

I= \int_0^1 f(x) dx

We can write the integral as,

I=\int_0^1 f(x) p(x) dx =E(f(X))\\
\textit{; where p(x) is the pdf of a r.v. X $\sim$ Unif(0,1)} 

Now suppose that x_1,x_2,\dots,x_n are independent random samples drawn from Unit(0,1), then by the law of large number we have,

\frac{1}{n} \sum_{i=1}^n f(x_i) \rightarrow E(f(X)) 

Thus an estimator of I may be:

\hat{I}=\frac{1}{n} \sum_{i=1}^n f(x_i)

On a more general note, if a < b < \infty then,

I= \int_a^b f(x) dx \\
\textit{Taking $y=\frac{x-a}{b-a}$}\\
I= \int_0^1 (b-a) f \left(\frac{a+(b-a)y}{b} \right) dy  \\
=  \int_0^1 h(y) dy \quad dy \\ \text{;where } h(y)=f \left(\frac{a+(b-a)y}{b} \right)  \\
= E\left[ h(Y) | Y \sim Unif(0,1) \right]

And when b=\infty ,

I= \int_a^\infty f(x) dx \\ 
\text{Taking $y=\frac{1}{x+1}$} \\
I= - \int_1^0  f\left( \frac{1}{y} -1 \right) \frac{dy}{y^2} \\
= \int_0^1  h(y) dy \\
\text{; where } h(y) =  f\left( \frac{1}{y} -1 \right)/y^2 \\
=E(h(Y) | Y \sim Unif(0,1))

Survival Analysis: Competing Risk Theory

INTRODUCTION

A distinctive feature of survival data is the concept of censoring. And an implicit concept in the definition of censoring is that if the study had been prolonged (or if subjects had not dropped out), eventually the outcome of interest would have been observed to occur for all the subjects. Conventional statistical methods for the analysis of survival data make the important assumption of independent or non-informative censoring. This means that at a given point in time, subjects who remain under follow-up have the same future risk for the occurrence of the event as those subjects are no longer being followed (either because of censoring or study dropout), as if losses to follow-up were random and thus non-informative.

A competing risk is an event whose occurrence precludes, the occurrence of the primary event of interest. For instance, in a study in which the primary outcome was time to death due to a cardiovascular cause, a death due to a non-cardiovascular serves as a competing risk.

Conventional statistical methods for the analysis of survival data assume that competing risk are absent. Two competing risks are said to be independent if information about a subject’s risk of experiencing one type of event provides no information about the subject’s risk of experiencing the other type of event. The methods that will be described later on will involve impeding risks which are independent of one another and further also in which competing risks are not independent of one another.

In biomedical applications, the biology often suggests at least some dependence between competing risks, which in many cases may be quite strong. Accordingly, independent competing risks may be relatively rare in biomedical applications.

When analyzing survival data in which competing risks are present, analysts frequently censor subjects when a competing event occurs. Thus, when the outcome is time to death attributable to cardiovascular causes, an analyst may consider a subject as censored once that subject dies of noncardiovascular causes. However, censoring subjects at the time of death attributable to noncardiovascular causes may be problematic.

First, it may violate the assumption of noninformative censoring: it may be unreasonable to assume that subjects who died of noncardiovascular causes (and were thus treated as censored) can be represented by those subjects who remained alive and had not yet died of any cause.

Second, even when the competing events are independent, censoring subjects at the time of the occurrence of a competing event may lead to incorrect conclusions because the event probability being estimated is interpreted as occurring in a setting where the censoring (eg, the competing events) does not occur.

In the cardiovascular example described above, this corresponds to a setting where death from noncardiovascular causes is not a possibility. Although such probabilities may be of theoretical interest, they are of questionable relevance in many practical applications, and generally lead to overestimation of the cumulative incidence of an event in the presence of the competing events.

Ratio Estimation

INTRODUCTION:

In Survey Sampling, we often use information on some auxiliary variable, to improve our estimator of the finite population parameter, by giving our estimator protection against selection of bad sample. One such estimator is the ratio estimator introduced as follows:

In practice, we are often interested to estimate the ratio of the type:

R= \frac{\bar{Y}}{\bar{X}} = \frac{\sum Y_i}{\sum X_i} 

For example, in different socio-economic survey, we may be interested in per-capita expenditure on food-items, infant mortality rate, literacy rate, etc. So, the estimation of R itself is of interest to us and beside that, we can get an improved estimate of Y ̄, as follows:

NOTATIONS:

Population Size N
Population U=(U_1,U_2,\dots,U_N)
Study variable Y=(Y_1,Y_2,\dots,Y_N)
Auxilliary Variable X=(X_1,X_2,\dots,X_N)
Mean of Y(study variable) \bar{Y} = \frac{1}{N} \sum Y_\alpha
Mean of X( auxilliary variable) \bar{X} = \frac{1}{N} \sum X_\alpha
Sample size n
Sample
(drawn by SRSWOR from U)
s=(i_1,i_2,\dots,i_n)
Sample Mean of Y(study variable) \bar{y} = \frac{1}{n} \sum y_i
Sample Mean of X( auxilliary variable) \bar{x} = \frac{1}{n} \sum x_i

Since,  \bar{Y} = R \bar{X}  , where R is unknown but  \bar{X}  is known, we can take

\hat{\bar{Y}}_R = \hat{R} \bar{X} = \frac{\bar{y}}{\bar{x}} \bar{X}

as an estimate of  \bar{Y}  and this estimator is called the ratio estimator of  \bar{Y}  .

RESULT 1:

\hat{\bar{Y}_R} is not an unbiased estimator of \hat{\bar{Y}} and its approximate bias is given by:

B(\hat{\bar{Y_R}}) = \frac{1}{\bar{X}} \frac{1-f}{n} \left( RS_X^2 - S_{XY} \right) 

;where \quad S_X^2= \frac{1}{N-1} \sum_{i=1}^{N} \left( X_i - \bar{X} \right)^2 , \\
\quad \quad  S_{XY}= \frac{1}{N-1} \sum_{i=1}^{N} \left( X_i - \bar{X} \right) \left(Y_i - \bar{Y} \right)

RESULT 2:

\frac{\left| B( \hat{\bar{Y}}_R) \right|}{ \sigma_{\hat{\bar{Y}}_R}} \leq \left| C.V.(\bar{x}) \right|

RESULT 3:

Mean square error of \hat{\bar{Y}_R} is given by:

MSE(\hat{\bar{Y}}_R) = \frac{1-f}{n} \left[ S_Y^2 + R^2 S_X^2 - 2 R S_{XY}   \right]

RESULT 4:

In SRSWOR, for large n, an approximation of the variance of \hat{\bar{Y}_R} is given by:

V(\hat{\bar{Y}}_R) = \frac{1-f}{n}  \frac{1}{N-1} \sum_{i=1}^{N} U_i^2 \\
\textit{; where} \quad U_i= Y_i-RX_i, \forall i=1(1)N