Information is the source of a communication system, whether it is analog or digital. Information theory is a mathematical approach to the study of coding of information along with the quantification, storage, and communication of information.
If we consider an event, there are three conditions of occurrence.
If the event has not occurred, there is a condition of uncertainty.
If the event has just occurred, there is a condition of surprise.
If the event has occurred, a time back, there is a condition of having some information.
These three events occur at different times. The difference in these conditions help us gain knowledge on the probabilities of the occurrence of events.
When we observe the possibilities of the occurrence of an event, how surprising or uncertain it would be, it means that we are trying to have an idea on the average content of the information from the source of the event.
Entropy can be defined as a measure of the average information content per source symbol. Claude Shannon, the “father of the Information Theory”, provided a formula for it as −
$$H = - \sum_{i} p_i \log_{b}p_i$$
Where pi is the probability of the occurrence of character number i from a given stream of characters and b is the base of the algorithm used. Hence, this is also called as Shannon’s Entropy.
The amount of uncertainty remaining about the channel input after observing the channel output, is called as Conditional Entropy. It is denoted by $H(x \mid y)$
Let us consider a channel whose output is Y and input is X
Let the entropy for prior uncertainty be X = H(x)
(This is assumed before the input is applied)
To know about the uncertainty of the output, after the input is applied, let us consider Conditional Entropy, given that Y = yk
$$H\left ( x\mid y_k \right ) = \sum_{j = 0}^{j - 1}p\left ( x_j \mid y_k \right )\log_{2}\left [ \frac{1}{p(x_j \mid y_k)} \right ]$$
This is a random variable for $H(X \mid y = y_0) \: ... \: ... \: ... \: ... \: ... \: H(X \mid y = y_k)$ with probabilities $p(y_0) \: ... \: ... \: ... \: ... \: p(y_{k-1)}$ respectively.
The mean value of $H(X \mid y = y_k)$ for output alphabet y is −
$H\left ( X\mid Y \right ) = \displaystyle\sum\limits_{k = 0}^{k - 1}H\left ( X \mid y=y_k \right )p\left ( y_k \right )$
$= \displaystyle\sum\limits_{k = 0}^{k - 1} \displaystyle\sum\limits_{j = 0}^{j - 1}p\left (x_j \mid y_k \right )p\left ( y_k \right )\log_{2}\left [ \frac{1}{p\left ( x_j \mid y_k \right )} \right ]$
$= \displaystyle\sum\limits_{k = 0}^{k - 1} \displaystyle\sum\limits_{j = 0}^{j - 1}p\left (x_j ,y_k \right )\log_{2}\left [ \frac{1}{p\left ( x_j \mid y_k \right )} \right ]$
Now, considering both the uncertainty conditions (before and after applying the inputs), we come to know that the difference, i.e. $H(x) - H(x \mid y)$ must represent the uncertainty about the channel input that is resolved by observing the channel output.
This is called as the Mutual Information of the channel.
Denoting the Mutual Information as $I(x;y)$, we can write the whole thing in an equation, as follows
$$I(x;y) = H(x) - H(x \mid y)$$
Hence, this is the equational representation of Mutual Information.
These are the properties of Mutual information.
Mutual information of a channel is symmetric.
$$I(x;y) = I(y;x)$$
Mutual information is non-negative.
$$I(x;y) \geq 0$$
Mutual information can be expressed in terms of entropy of the channel output.
$$I(x;y) = H(y) - H(y \mid x)$$
Where $H(y \mid x)$ is a conditional entropy
Mutual information of a channel is related to the joint entropy of the channel input and the channel output.
$$I(x;y) = H(x)+H(y) - H(x,y)$$
Where the joint entropy $H(x,y)$ is defined by
$$H(x,y) = \displaystyle\sum\limits_{j=0}^{j-1} \displaystyle\sum\limits_{k=0}^{k-1}p(x_j,y_k)\log_{2} \left ( \frac{1}{p\left ( x_i,y_k \right )} \right )$$
We have so far discussed mutual information. The maximum average mutual information, in an instant of a signaling interval, when transmitted by a discrete memoryless channel, the probabilities of the rate of maximum reliable transmission of data, can be understood as the channel capacity.
It is denoted by C and is measured in bits per channel use.
A source from which the data is being emitted at successive intervals, which is independent of previous values, can be termed as discrete memoryless source.
This source is discrete as it is not considered for a continuous time interval, but at discrete time intervals. This source is memoryless as it is fresh at each instant of time, without considering the previous values.