Entropy, (mis)concepts.

1.   Why discuss the concept or word Entropy [loss of energy] ?

In its physics context the development of a theory of heat loss in Carnot type cycles is easy to trace and verify. However the coining of the word (information) entropy in the context of sending signals over telephone lines has led to conceptual confusion mainly in the humanities  and electrical engineering, but also reflecting back on philosophical discussions on the nature of the universe.

This note investigates the application of the concept of information_entropy to image analysis. Image Analysis and Pattern Recognition are relevant to the field of Remote Sensing where most data are represented in the form of images.

A method for information extraction based on falsely applied concepts will propagate through the education system and lead to an increasing amount of pseudo scientific publications. As a consequence the field will be even more be an art than a science than it is already.

2.  How did the concept of Entropy develop ?

2.1 History

2.1.1 Heat engines  , Ref, http://en.wikipedia.org/wiki/History_of_entropy shows how the building of heat engines stimulated the development of models for understanding why initially a lot of ‘Heat’ was lost in the conversion to energy. Carnot put a firm bases to the theory of heat engines by defining The Carnot closed dynamic cycle, putting a limit on the maximum amount of useful mechanical energy that can be extracted by letting a quantity of heat Q work through a  ‘universal’  Carnot engine with heat reservoirs with temperatures T1, T2 . The measure for mechanical equivalent heat Q/T, gives  a  value over a Carnot [Clausius, 1854]

 \Delta S = Q\left(\frac {1}{T_2} - \frac {1}{T_1}\right)

The term entropy comes from Energy_Tropos = energy loss


2.1.2 Statistics of states.

In 1877, Ludwig Boltzmann formulated the alternative definition of entropy S defined as:

S = k_{\rm B} \ln \Omega \!


kB is Boltzmann’s constant and
Ω is the number of microstates consistent with the given macrostate.

Boltzmann saw entropy as a measure of statistical “mixedupness” or disorder. This concept was soon refined by J. Willard Gibbs, and is now regarded as one of the cornerstones of the theory of statistical mechanics.

For more details Quotes from: Ref. http://en.wikipedia.org/wiki/Boltzmann%27s_entropy_formula

In statistical thermodynamicsBoltzmann’s equation is a probability equation relating the entropy S of an ideal gas to the quantity W, which is the number of microstates corresponding to a given macrostate:

S = k \cdot \log W \!            (1) S= k ln W

where k is Boltzmann’s constant equal to 1.38062 x 10−23 joule/kelvin and W is the number of microstates consistent with the given macrostate

The equation was originally formulated by Ludwig Boltzmann between 1872 to 1875, but later put into its current form by Max Planck in about 1900.[2][3] To quote Planck, “the logarithmic connection between entropy and probability was first stated by L. Boltzmann in his kinetic theory of gases.”

For thermodynamic systems where microstates of the system may not have equal probabilities, the appropriate generalization, called the Gibbs entropy, is:

 S = - k \sum p_i \log p_i            (3)

This reduces to equation (1) if the probabilities pi are all equal.

Boltzmann used a \rho\log\rho formula as early as 1866.[4] He interpreted \rho as a density in phase space—without mentioning probability—but since this satisfies the axiomatic definition of a probability measure we can retrospectively interpret it as a probability anyway. Gibbs gave an explicitly probabilistic interpretation in 1878

Back to quotes from reference : http://en.wikipedia.org/wiki/History_of_entropy

2.1.3 Statistics of signal transmission over telephone lines.

Information theory

[Should be called theory of transmission of signals, NJM]

An analog to thermodynamic entropy is information entropy. In 1948, while working at Bell Telephone Laboratories electrical engineer Claude Shannon set out to mathematically quantify the statistical nature of “lost information” in phone-line signals. To do this, Shannon developed the very general concept of information entropy, a fundamental cornerstone of information theory. Although the story varies, initially it seems that Shannon was not particularly aware of the close similarity between his new quantity and earlier work in thermodynamics. In 1949, however, when Shannon had been working on his equations for some time, he happened to visit the mathematician John von Neumann. During their discussions, regarding what Shannon should call the “measure of uncertainty” or attenuation in phone-line signals with reference to his new information theory, according to one source:[10]

My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.

According to another source, when von Neumann asked him how he was getting on with his information theory, Shannon replied:[11]

The theory was in excellent shape, except that he needed a good name for “missing information”. “Why don’t you call it entropy”, von Neumann suggested. “In the first place, a mathematical development very much like yours already exists in Boltzmann’s statistical mechanics, and in the second place, no one understands entropy very well, so in any discussion you will be in a position of advantage.

In 1948 Shannon published his famous paper A Mathematical Theory of Communication, in which he devoted a section to what he calls Choice, Uncertainty, and Entropy.[12] In this section, Shannon introduces an H function of the following form:

H = -K\sum_{i=1}^k p(i) \log p(i),

where K is a positive constant. Shannon then states that “any quantity of this form, where K merely amounts to a choice of a unit of measurement, plays a central role in information theory as measures of information, choice, and uncertainty.” Then, as an example of how this expression applies in a number of different fields, he references R.C. Tolman’s 1938 Principles of Statistical Mechanics, stating that “the form of H will be recognized as that of entropy as defined in certain formulations of statistical mechanics where pi is the probability of a system being in cell i of its phase space… H is then, for example, the H in Boltzmann’s famous H theorem.” As such, over the last fifty years, ever since this statement was made, people have been overlapping the two concepts or even stating that they are exactly the same.

Shannon’s information entropy is a much more general concept than statistical thermodynamic entropy. Information entropy is present whenever there are unknown quantities that can be described only by a probability distribution. In a series of papers by E. T. Jaynes starting in 1957,[13][14] the statistical thermodynamic entropy can be seen as just a particular application of Shannon’s information entropy to the probabilities of particular microstates of a system occurring in order to produce a particular macrostate.

End of quote


3.   What can go wrong when we apply Boltzmann’s theorem in the spirit of Shannon ” give a similar formula a name borrowed from a physics model ”

3.1  Image Analysis: goal: invert image generation models  in terms of sensors, illumination, platform, object geometry , object physical properties such as spectral absorption or estimate a macrostate variable such as temperature.

The entropy concept is used in physics on a large collection of stochastically changing microstates.  Shannon defines :  pi is the probability of a system being in cell i of its phase space.

3.2.1 As a thought experiment take a picture of a traffic light, isolate the 3 pixels representing image samples of the red amber and green light . What are the microstates of the traffic image ? 3 pixels with on/off states allow 8 possible states, or 3 bits of entropy information (Shannon). An essential assumption in thermodynamical entropy models is that the microstates are statistically independent. The three lights are related by an exor function reducing the number of microstates to 3 (4 if the traffic light off state is included).. According to Shannon we now have an entropy of 2 bit or less. ?

Not correct, one picture of a traffic light does not have a probability distribution, it does not have any uncertainty. If there is no transmission problem then information_entropy is zero. !

3.2.2 What if we include the time domain ,like watching a video ? Traffic lights are artefacts , the states are programmed and distinguishable. Is the statistical model applicable ? Not a bit of Shannon entropy. As a generalisation we can say that for instance a modern camera with 10000 pixels and 24 bit/ pixel has 0 bit Shannon entropy because the image is clear and has fixed, non-stochastic micro states.

The above should be generalised to: living or constructed systems with stable parts (microstates) with non-stochastic behaviour have 0 Shannon entropy. What about constructed heat engines, plants animals ? In a thermodynamic sense they can be compared to the universal Carnot heat engine, and under careful considerations be modelled by Boltzman’s statistical model. But the system as a whole has no relevant thermodynamic , random state changes so it would have a thermodynamic equivalent temperature T =0 and random kinetic energy  kT = 0. I leave it to the reader to formulate the consequences for discussions on evolution, creation or applications to psychology.

3.3  What if we add thermodynamic ‘noise’ to the Traffic image ? Now the state of the traffic light is related to its image in a stochastic way.

3.3.1 The maximum benefit / cost strategy for image analysis or image classification is to build an expected benefit or expected utility matrix :

Real state :  1,2 ,3.

Estimated state A ,B,C

Let there be a few 1000 experiments in order to get the statistics of the ‘noise’. The table shows the frequency of occurrence.

Real State =             Red  Amber Green

Estimated = A         8990    330        450

Estimated =B           200    880       150

Estimated = C         500      600    9010


From frequency_of_occurrence(Real, Estimated), prob(Real | estimated),we derive  ProbRealgivenEstimated by normalising over the columns

ProbRealgivenEstimated  =
0.9278   0.1823   0.0468
0.0206   0.4862   0.0156
0.0516   0.3315   0.9376

If the utility of a correct decision is 1 and -1 for a wrong decision then the expected utility matrix for traffic light ‘interpretation’ is

UtilExpect =
0.9278    -0.1823   -0.0468
-0.0206     0.4862   -0.0156
-0.0516    -0.3315    0.9376

The rule of maximum expected utility provides the following mapping

if A then Red, if B then Amber, if C then Green.

It is hard to see how one could improve on a common sense rule like maximum_expected_Utility.

3.3.2 But Shannon’s play with worlds has to be evaluated is to its consequences: what are the microstates ? One could defend a choice of uncertain state in the mapping of image states A,B,C to object states Red, Amber, Blue. Shannon’s receipt requires  p log p

PlogP =
-0.0696 -0.3103 -0.1433
-0.0801 -0.3506 -0.0649
-0.1530 -0.3660 -0.0604

The sum over columns of PlogP is:

PlogPColumnNorm =
0.2299   0.3022   0.5334
0.2647   0.3414   0.2416
0.5054   0.3564   0.2249

Which I cannot interpret,

Another play with concepts is called Mutual Information:

 \sum_{i,j } p_{ij} \log \frac{p_{ij}}{p_i p_j }

we add the matrix PRowColNorm

PRowColXYNormLog =
-0.8536  -4.1584  -3.8483
-4.6592  -3.1776  -4.9469
-3.7429  -3.5606  -0.8514

PRowCol =
0.4259   0.0156   0.0213
0.0095   0.0417   0.0071
0.0237   0.0284   0.4268

The product :

Prod =
-0.3635 -0.0650 -0.0820
-0.0441 -0.1325 -0.0352
-0.0887 -0.1012 -0.3634

This shows minima on the diagonal, but otherwise does not help with the estimation of the overlap in sample sets.

The Mutual “Information” is :

Hij = sum(sum(Prod))
Hij =
-1.2756  ?!

3.3.3 If the problem statement had been find the overlap in the given set of samples(Real,  Estimated) = PRowCol

= FreqOcc/sum(sum(FreqOcc))

PRowCol =
0.4259 0.0156 0.0213
0.0095 0.0417 0.0071
0.0237 0.0284 0.4268

Then the overlap is found from the fraction on the diagonal

ans =

4  Conclusions:

4.1 in the domain of image analysis a physics and statistical model is available for estimating with maximum expected utility the class or other parameters of object(models) as found in Remote Sensing data.

4.2 The introduction of the term Entropy mixed with Information in the context with data transmission over (analog) telephone lines. Has led to widespread confusion and highly valued PhD theses and pseudo scientific publications.

4.3 The core of confusion is in modeling with wordplay instead of defining causal models and model inversion on the basis of sensible utility functions.


PM1. I define information as a statistical relation on specific domains of questions and answers. This contrasts with Shannon, who is concerned with data transmission and data storage. Without questions there is no information [NJM]

PM2. In thermal Remote Sensing, material properties heat capacity and heat conduction strongly relate to the degrees of freedom in the microstates of the atoms and molecules. In solid state or fluid state the microstates are space and time correlated. So simple kinetic gas theory is not applicable.

PM3. Photons in free space are a form of directed, non stochastic energy. Models have to take care of the distinction between what is called a photon gas and photons as directed and discrete packages of energy.

Nanno Mulder,

Rotterdam, saturday morning blog