Advertisement
Biophysical Reports
All content is freely available to readers and supported through open access

Estimating the number of available states for normal and tumor tissues in gene expression space

Open AccessPublished:April 03, 2022DOI:https://doi.org/10.1016/j.bpr.2022.100053

      Abstract

      The topology of gene expression space for a set of 12 cancer types is studied by means of an entropy-like magnitude, which measures the volumes of the regions occupied by tumor and normal samples, i.e., the number of available states (genotypes) that can be classified as tumor-like or normal-like, respectively. Computations show that the number of available states is much greater for tumors than for normal tissues, suggesting the irreversibility of the progression to the tumor phase. The entropy is nearly constant for tumors, whereas it exhibits a higher variability in normal tissues, probably due to tissue differentiation. In addition, we show an interesting correlation between the fraction (tumor/normal) of available states and the overlap between the tumor and normal sample clouds, interpreted as a way of reducing the decay rate to the tumor phase in more ordered or structured tissues.

      Why it matters

      Common knowledge points cancer as a complex disease that is very difficult to reverse. Theoretically, its genesis can be seen as a competence of two dominant genotypes (Kauffman attractors): the normal, homeostatic state and the tumor state. In this work, we support and further elaborate the theoretical framework with present-day gene expression data and a technique for dimensional reduction. We present a method to estimate the number of available states corresponding to possible realizations of the main genotypes. In addition, we introduce a magnitude to describe the intermediate low-fitness barrier between the attractors. It is argued that there is an intrinsic relation between the properties of the gene expression space and the cancer risk in the tissue.

      Introduction

      The extreme difficulties in treating cancer (
      • da Silva-Diz V.
      • Lorenzo-Sanz L.
      • Bernat-Peguera A.
      Cancer cell plasticity: Impact on tumor progression and therapy response.
      ) reveal that the survival capabilities of cancer cells are much stronger than those of the somatic cells in our body, restricted by the conditions of homeostasis. The reason for such “advantages” is explained in the atavistic theory of cancer (
      • Davies P.C.W.
      • Lineweaver C.H.
      Cancer tumors as Metazoa 1.0: tapping genes of ancient ancestors.
      ,
      • Domazet-Lošo T.
      • Tautz D.
      Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa.
      ,
      • Lineweaver C.H.
      • Davies P.C.W.
      • Vincent M.D.
      Targeting cancer's weaknesses (not its strengths): Therapeutic strategies suggested by the atavistic model.
      ,
      • Cisneros L.
      • Bussey K.J.
      • Davies P.
      • et al.
      Ancient genes establish stress-induced mutation as a hallmark of cancer.
      ,
      • Trigos A.S.
      • Pearson R.B.
      • Goode D.L.
      • et al.
      Somatic mutations in early metazoan genes disrupt regulatory links between unicellular and multicellular genes in cancer.
      ) as the result of a core genetic program, which helped primitive multicellular organisms to overcome the extreme conditions posed by the ancient earth.
      One aspect of these enhanced capabilities is related to tissue fitness. Cancer cells are known to turn off the mechanism of fitness control in homeostasis and exhibit higher replication rates than stem cells in healthy tissues (
      • Alberts B.
      • Bray D.
      • Raff M.
      • et al.
      Essential cell biology.
      ).
      In vivo measurements of fitness in normal and tumor tissues could be a difficult task. However, there is a way of looking at fitness that is related to the number of available states for a system in phase space and may be the subject of numerical computations. Indeed, for a tissue (or a small portion of it), there should be a fitness landscape in gene expression space (GES) (
      • Wright S.
      The roles of mutation, inbreeding, crossbreeding, and selection in evolution.
      ). Regions of high fitness are characterized by their volumes, which should be proportional to the number of available states for the system.
      In the present paper, we aim at estimating the number of available states for normal and tumor tissues or, more precisely, the ratio of numbers for the tumor and the corresponding normal tissue. To this end, we process gene expression data for 12 types of cancer, coming from The Cancer Genome Atlas (TCGA) portal (
      • Tomczak K.
      • Czerwińska P.
      • Wiznerowicz M.
      The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge.
      ). Notice that, in the TCGA data, gene expression levels are measured in small tissue samples, obtained from biopsies. Although modern techniques allow measuring the expressions in individual cells (
      • Blainey P.C.
      • Quake S.R.
      Dissecting genomic diversity, one cell at a time.
      ), we stress that a micro-sample contains the information coming from many different cells and their interactions. With the purpose of estimating the relative fitness, the comparison between pathologically cancerous and normal samples is meaningful and realistic.
      The idea to measure the number of available states is to use an analogy with statistical physics (
      • Moore C.C.
      Ergodic theorem, ergodic theory, and statistical mechanics.
      ) or semiclassical mechanics (
      • Berry M.V.
      Evolution of semiclassical quantum states in phase space.
      ) in which this number is proportional to the volume spanned by the system in phase space. In our biology problem, we understand that GES is a kind of configuration space, and the fitness landscape plays the role of external potential in physics (
      • Wang J.
      Landscape and ux theory of non-equilibrium dynamical systems with application to biology.
      ).
      The normal and tumor regions in GES define attractors (
      • Wang J.
      Landscape and ux theory of non-equilibrium dynamical systems with application to biology.
      ,
      • Kauffman S.A.
      Metabolic stability and epigenesis in randomly constructed genetic nets.
      ,
      • Huang S.
      • Ernberg I.
      • Kauffman S.
      Cancer attractors: A systems view of tumors from a gene network dynamics and developmental perspective.
      ), which is the local maxima of fitness. They are separated by a low-fitness barrier. As it will become apparent subsequently, normal and tumor samples from the TCGA data are distributed around their respective attractors. These samples come from different individuals, each one with a particular history of tumor progression. By using an analogy with the ergodic principle (
      • Moore C.C.
      Ergodic theorem, ergodic theory, and statistical mechanics.
      ), we assume that the actual distribution of samples is a phase portrait of the trajectories of an ensemble of microstates that start in the normal region. Some of these microstates progress to the tumor zone. Thus, we fit the observed distributions with gaussians (the functions with lowest bias from the point of view of information theory) and “compute” the volumes (hypervolumes) of their basins of attraction by means of an entropy-like magnitude that is roughly the logarithm of the volume.
      The computation has subtle details, in particular the dimensionality. Recall that the number of genes in the data is around 60,000. The PCA (
      • Wold S.
      • Kim E.
      • Paul G.
      Principal component analysis.
      ,
      • Gonzalez A.
      • Nieves J.
      • Sosa P.V.
      • et al.
      Gene expression rearrangements denoting changes in the biological state.
      ) of the data is truncated to the first 20 components by using the variance distribution and a criterium from information theory (
      • Cavanaugh J.E.
      • Neath A.A.
      The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements.
      ). The result is that we may not only compare the volumes of the normal and tumor basins of attraction in a tissue, but also among different tissues.
      In addition, the computed density distributions allow to estimate the overlap between normal and tumor clouds of samples. This magnitude shows an interesting correlation with the ratio of basin volumes.
      The main results of our paper are the following. First, the number of available states is much higher for tumors than for normal tissues. This may be expected since a homeostatic tissue has fewer possibilities of realization or a more constrained order than the primitive multicellular tumor. Second, the entropy of tumors takes a nearly constant value, a fact consistent with their common evolutionary origin in the atavistic theory. Normal tissues, on the contrary, exhibit a higher variability of their entropy, probably a manifestation of tissue differentiation. And third, there is a correlation between the ratio of basin volumes and the overlap between the normal and tumor clouds of samples, indicating a nontrivial topology of GES aimed at reducing the decay rate (the cancer risk) of more ordered or structured tissues. These facts are discussed next.

      Methods

      Principal component analysis

      The TCGA data for the tissues described in Table 1 is analyzed by means of the principal component analysis (PCA) technique. The details of the PCA are described in paper (
      • Gonzalez A.
      • Perera Y.
      • Perez R.
      On the gene expression landscape of cancer.
      ). We briefly sketch them in the present section.
      Table 1The set of studied cancer types and the main results of the paper. Note that the definition of the abbreviations used in this table and the details of how error bars are estimated are provided in the Supporting Material.
      TissueNormal samplesTumor samplesΔSStumorlnI
      BRCA112109616.43 ± 1.1287.6713.25 ± 1.51
      COAD4147322.11 ± 2.3383.0528.16 ± 6.19
      HNSC445029.93 ± 0.9986.7510.33 ± 1.05
      KIRC7253919.24 ± 1.5386.3217.51 ± 1.43
      KIRP3228923.32 ± 1.9783.8725.89 ± 6.16
      LIHC5037426.06 ± 0.9387.2216.12 ± 3.94
      LUAD5953523.40 ± 1.1889.3416.21 ± 2.40
      LUSC4950225.36 ± 0.9987.3421.84 ± 4.95
      PRAD524995.87 ± 1.4182.046.60 ± 0.67
      STAD3237514.38 ± 1.4487.6614.12 ± 4.04
      THCA5851013.38 ± 0.8784.009.77 ± 1.59
      UCEC2355215.75 ± 2.4382.9522.06 ± 5.14
      Gene expressions are given in fragments per kilobase of gene length per million of reads (FPKM) format. The number of genes is 60,483. This is the dimension of matrices in the PCA processing.
      We take the mean geometric average over normal samples in order to define the reference expression for each gene, eref. Then the normalized or differential expression is defined as ediff=e/eref. The fold variation is defined in terms of the logarithm eˆ=log2(ediff). Besides reducing the variance, the logarithm allows treating over- and subexpression in a symmetrical way.
      Deviations and variances are measured with respect to eˆ=0. That is, with respect to the average over normal samples. This election is quite natural because normal samples are the majority in a population.
      With these assumptions, the covariance matrix is written:
      σij=1Nsamples1eˆi(s)eˆj(s),
      (1)


      where the sum runs over the samples, s, and Nsamples is the total number of samples (normal plus tumor). eˆi(s) is the fold variation of gene i in sample s.
      As mentioned, the dimension of matrix σ is 60,483. By diagonalizing it, we get the axes of maximal variance: the principal components (PCs). They are sorted in descending order of their contribution to the variance.
      In lung squamous cell cancer (LUSC), for example, PC1 accounts for 67% of the variance. This large number is partly due to our choice of the reference, eˆ=0, and the fact that most of the samples are tumors. The reward is that PC1 may be defined as the cancer axis. The projection over PC1 defines whether a sample is classified as normal or tumor.
      The next PCs account for a smaller fraction of the variance. PC2 is responsible of 4%, PC3 of 3%, etc. Around 20 PCs are enough for an approximate description of the region of the GES occupied by the set of samples.

      Entropy and overlapping integral

      For a sample, the projections over the PC vectors define the new coordinates. These are the starting data for the computation of the configurational entropy. We organize it as 24 matrices M, each one corresponding to a tissue in a stage, for example M(LIHC, tumor). The number of columns in any case is 20 (number of PCs), and the number of rows is the number of samples, as reported in Table 1.
      From M the sample covariance matrix, Σ, is defined as
      Σjk=1N1Ni=1(Mijμj)(Mikμk),
      (2)


      where μj=1Ni=1Mij is the mean value of coordinate j in the set of samples.
      In order to find probability distributions for the sets of normal and tumor samples, we maximize the entropy taking the covariance matrices as constraints. These are quadratic constraints, so the result is a multivariate gaussian (
      • Caticha A.
      Entropic inference and the foundations of physics.
      ):
      ρ(x¯)=1(2π)D2|Σ|exp[12(x¯μ¯)TΣ1(x¯μ¯)].
      (3)


      Notice our convention for vectors, x¯. There are advantages in using this procedure. First, with normal distributions we may analytically compute the quantities of interest. Second, this distribution is, in accordance with the central limit theorem, an estimation of the actual distribution for much larger data sets. And third, this distribution is, from the point of view of information theory, the most unbiased one with regard to data covariance, i.e., no heuristic criteria have been used for choosing it.
      The general definition of our target quantities, the entropy, and the overlap integral are explained in Results and discussion. In the particular case of a probability distribution given by Eq. (3), we get the following:
      S=12ln|Σ|+D2(1+ln2π),
      (4)


      I=2D2|Λn|1/4|Λt|1/4|Λc|1/2exp[14(ηcTΛc1ηcμnTΛnμnμtTΛtμt)],
      (5)


      where Λj=Σj1 for j=n,t; ηc=Λnμn+Λtμt, and Λc=Λn+Λt.
      Details on the dependence of S and I on the effective dimension and the number of samples used in their computation are provided in the Supporting Material.

      Results and discussion

      Entropy in gene expression space

      As mentioned, our starting point is the TCGA expression data for 12 tumors and the corresponding normal tissues. The selected types of cancer are characterized by more than 20 normal and more than 300 tumor samples, as shown in Table 1.
      We perform a PCA (
      • Wold S.
      • Kim E.
      • Paul G.
      Principal component analysis.
      ,
      • Lever J.
      • Krzywinski M.
      • Altman N.
      Principal component analysis.
      ,
      • Ringnér M.
      What is principal component analysis?.
      ) of the expression data. Methodological aspects are detailed in paper (
      • Gonzalez A.
      • Perera Y.
      • Perez R.
      On the gene expression landscape of cancer.
      ), where we study the topology of GES for normal and tumor tissues. For completeness, we sketch the main results of that paper that shall be used in our computations. Details can be found in the Methods section and the Supporting Material, in particular:
      • 1
        Although there are around 60,000 genes, normal tissues and tumors span a region with reduced effective dimension. Then, we use the first 20 principal components in order to describe the state of a sample in GES. These 20 components capture no less than 85% of the total variance in the dispersion of experimental samples in GES, and practically saturate the Akaike information criterium (
        • Cavanaugh J.E.
        • Neath A.A.
        The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements.
        ).
      • 2
        For a given tissue, normal samples are well separated from tumor samples in GES. Both regions seem to be basins of attraction of two singular points: the normal homeostatic and the cancer attractors.
      Fig. 1, upper panel, shows as example the (PC1, PC2) plane for LUSC in TCGA notations. Points in the figure represent samples from different patients. The clouds of points are grouped in well-defined regions defining the attractors. We shall estimate the volume of each region, which gives an indication of the number of accessible states.
      Figure thumbnail gr1
      Figure 1Upper panel: PCA of gene expression data for squamous cell lung cancer (LUSC). The position along the first axis (PC1) discriminates between a normal sample and a tumor. Lower panel: Schematics of the fitness landscape. The x axis is again PC1, but the y axis represents the fitness with a minus sign. H and C label the normal (homeostatic) and cancer states, respectively. The maximum fitness in the H state is normalized to unity.
      More precisely, for both normal tissues and tumors, we shall introduce the entropy-like magnitude (
      • Cover T.M.
      • Thomas J.A.
      Elements of Information Theory.
      ):
      S=dDx¯ρ(x¯)lnρ(x¯),
      (6)


      where D=20 is the number of principal components to be used in the description of the system in GES, and ρ is the probability density, normalized to unity, coming from a fit to the observed sample data. The analytic expression for S used in the calculations is given in the Methods section.
      The relation between the S magnitude and the volume, V, of the basin of attraction is roughly Sln(V), so S measures the logarithm of the number of available states in the region.
      We fit the observed distribution of sample points to a multivariate gaussian density, ρ. This procedure guarantees a maximal entropy, as compared with any other possible ansatz for ρ, and a minimal bias from the point of view of information theory.
      We show in Table 1 the magnitudes Stumor and ΔS=StumorSnormal for the set of tissues under study. The way we estimate entropies and bar errors is described in the Supporting Material.
      The number of states in GES seems to be much larger for tumors than for normal tissues, leading to ΔS1.
      On the other hand, the number of accessible states appears to be nearly constant for all tumors. Normal tissues exhibit larger variations, which could be perhaps related to tissue differentiation. In other words, the process of de-differentiation of tumors (
      • Friedmann-Morvinski D.
      • Verma I.M.
      Dediérentiation and reprogramming: origins of cancer stem cells.
      ) seems to involve the increase of the accessible volume in GES to a nearly constant value.
      We cannot provide a rationale for the computed entropies of normal tissues, for example, lower values in epithelial tissues, nor relate the entropies to their developmental origin. We can neither relate the normal state entropy to the risk of cancer in the tissue. The prostate (PRAD), a tissue in which cancer is very common, seems to exhibit the higher disorder (entropy), but lung (LUAD, LUSC) and colon (COAD), with much lower entropies, are also high risk tissues. Naively, one would expect the entropy difference, not the entropy in the normal state, to correlate with the cancer risk. Indeed, more available tumor states should indicate more probability to transit to the tumor region. The question is, however, much more subtle, as shown in the next sections.

      Cloud overlapping

      We may introduce an additional magnitude characterizing the transition region between the two attractors, that is the overlapping between the clouds of normal and tumor samples.
      Let us define the density overlap:
      I=dDx¯ρtumor(x¯)ρnormal(x¯).
      (7)


      The square root is introduced for normalization purposes. The analytic expression for I, when the ρ are gaussian distributions, is provided in the Methods section.
      The results of computations are shown in Table 1 and Fig. 2 for the set of 12 tissues studied in the present paper. Fig. 2, in which we plot cloud overlapping versus entropy, can be understood as a complexity map (
      • Feldman D.P.
      • McTague C.S.
      • Crutchfield J.P.
      The organization of intrinsic computation: Complexity-entropy diagrams and the diversity of natural information processing.
      ) for different normal tissue-tumor pairs.
      Figure thumbnail gr2
      Figure 2The entropy-overlapping map. Notice that tumors exhibit a nearly constant entropy, and that there is an exponential relationship between the overlap I and the entropy variation ΔS. The details of how error bars are estimated are explained in the Supporting Material.
      The observed overlap could intuitively be related to the distance between the centers of the clouds. The distances along the first PC axis, PC1, are computed in paper (
      • Gonzalez A.
      • Nieves J.
      • Sosa P.V.
      • et al.
      Gene expression rearrangements denoting changes in the biological state.
      ). These computations confirm, for example, that in PRAD the cloud centers are much closer than in COAD or LUSC.
      The numbers in Table 1 and Fig. 2 indicate also the apparent correlation between lnI and ΔS, i.e. lnI=0.75 ΔS3.40 or I(Vn/Vt)0.75. The p-value of the linear fit in the log-log plot for these magnitudes is 0.008.
      The nature of this dependence is intriguing. The fact is that the larger the entropy difference is (the ratio of basin volumes), the smaller is the overlap between the tumor and normal sample clouds. An interpretation for this fact is provided in the next section.

      Fitness landscape and transition rates

      The normal homeostatic state shall be protected against transitions to the cancer state by a barrier. Otherwise the transitions are unavoidable because both the fitness and the number of available states in the cancer region are much higher than in the normal region.
      It is natural to assume that the intermediate region holds a low-fitness barrier, as schematically represented in Fig. 1 lower panel for LUSC. Indeed, the normal homeostatic state is a state with regulated fitness (
      • Benoit B.
      • Hochmuth C.E.
      • Jasper H.
      Maintaining Tissue Homeostasis: Dynamic Control of Somatic Stem Cell Activity.
      ). In cancer, on the other hand, these constraints are removed and tumor growth is only limited by the availability of space and nutrients. The intermediate region is a space for senescence or different kinds of illness, where fitness is reduced and the compensation mechanisms are not capable of keeping homeostasis.
      In Fig. 1, lower panel, we provide a schematic 1D representation of the fitness landscape. The x axis, as in the upper panel, is PC1, which is identified as the cancer axis (
      • Gonzalez A.
      • Perera Y.
      • Perez R.
      On the gene expression landscape of cancer.
      ). The normal and cancer states are well separated along this axis. The y axis, on the other hand, is a sketch for the fitness (with a minus sign), which is obtained simply by smoothing the histogram of samples. In other words, we assume that the observed density of samples at a given point of GES is proportional to the fitness.
      The absolute maximum of fitness is at the cancer attractor (denoted C in the figure). The normal homeostatic state (H) is a local metastable maximum, which should be characterized by a mean decay time, τH. In the figure, the fitness at the homeostatic maximum is normalized to unity. Notice that with a rough estimation of the fitness landscape, we could get, in principle, an estimation for τH, and thus the risk of cancer in a tissue.
      The time for the reverse process to occur, τC, that is, from the tumor to the normal state, is expected to be much larger than τH. We could get a rough value for it by using a kind of detailed balance equation (
      • Wang J.
      Landscape and ux theory of non-equilibrium dynamical systems with application to biology.
      ):
      τC=τHNstates(C)Nstates(H)=τHexp(ΔS).
      (8)


      • Wright S.
      The roles of mutation, inbreeding, crossbreeding, and selection in evolution.
      forces the product of the decay rate (1/τ) and the number of microstates to be equal in both states, H and C. Taking τH60 years, we get for prostate tumors, for example, τc106 years1 My. For thyroid cancer, on the other hand, τC200 My. These are fictitious numbers, not related to any biological processes. We compute them with the only purpose of confirming that the progression to cancer is an almost irreversible process.
      On the other hand, it is a curious fact that the required times for early multicellular organisms to evolve to modern metazoans are precisely hundreds of My (
      • Knoll A.H.
      • Nowak M.A.
      The timetable of evolution.
      ). At the level of conglomerates of cells, one can imagine evolution as jumps against entropy, that is, from states like C to states like H. These are highly improbable processes that, however, may be the source of further advantages at a different level of organization. When one says that it may take 200 My to occur, it means that from the many cell conglomerates living in this time period a few of them could make the transition and start a new line of evolution.
      Eq. (8) for the decay time lacks an important factor: the effect of the barrier, which is related to the magnitude I. Wider barriers, corresponding to lower values of I, that is higher lnI, should slow down the transitions. From this perspective, the correlations between lnI and ΔS are quite natural. Let us use the same Eq. (8), but now taking τC200 My as a reference to estimate τH. A more ordered tissue (greater entropy difference) has a smaller τH, and it should be separated from the tumor by a wider barrier in order to prevent the transitions. This argument, although qualitative and preliminary, indicates a possible very interesting relation between the topology of GES (volumes and intersections) and the decay rate, which is related to the risk of cancer in a tissue.

      Concluding remarks

      We initiated in papers (
      • Gonzalez A.
      • Nieves J.
      • Sosa P.V.
      • et al.
      Gene expression rearrangements denoting changes in the biological state.
      ,
      • Gonzalez A.
      • Perera Y.
      • Perez R.
      On the gene expression landscape of cancer.
      ) a quantitative study of the topology of GES in tumors. In particular, the distances between the center of the tumor and normal regions and their r.m.s. radii along the PC1 axis were computed and were shown to correlate with the characteristics of the GE distribution functions (
      • Gonzalez A.
      • Nieves J.
      • Sosa P.V.
      • et al.
      Gene expression rearrangements denoting changes in the biological state.
      ).
      In the present paper we deal with two more magnitudes quantifying the topology of GES. First, we estimated the volumes (hypervolumes) of the basins of attraction for the normal and cancer regions in each of the 12 types of cancer described in Table 1. Using an analogy from statistical physics (
      • Moore C.C.
      Ergodic theorem, ergodic theory, and statistical mechanics.
      ) and semiclassical quantum mechanics (
      • Berry M.V.
      Evolution of semiclassical quantum states in phase space.
      ), in which the volume of phase space is related to the number of states, we have related the computed volumes to the number of accessible biological microstates. Volumes are measured by means of a “configurational” entropy-like magnitude, constructed from the probability density of samples in the space. The latter is obtained from a multivariate gaussian fit to the observed distribution of samples. The second magnitude characterizing the topology of GES is the overlap between the normal and tumor clouds, computed from the same probability densities.
      There are subtle details concerning the computation of these quantities that are discussed in the Supporting Material. The first is related to the effective dimension employed in the calculations. We used the variance distribution in the PC analysis and ideas from information theory (
      • Cavanaugh J.E.
      • Neath A.A.
      The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements.
      ) to define the effective dimension. The same effective dimension, 20, is taken for all of the tissues in such a way that they may be compared. The second issue concerns the number of samples. We use the optimal combination of numbers of tumor and normal samples to obtain the best estimates for S and I.
      The results of the paper are mainly three. 1) The number of accessible states is much higher for tumors than for normal samples. 2) All studied tumor localizations have roughly the same number of accessible states, whereas normal tissues exhibit higher variability. 3) The overlap between the tumor and normal samples clouds of points is roughly proportional to exp(0.75 ΔS).
      The reduced number of accessible states for the normal tissue can be interpreted as a higher level of organization, compared with the tumor. The nearly constant entropy of tumors points to the common evolutionary origin of tissues, in accordance with the atavistic theory. The higher variability of entropy in normal tissues, on the other hand, can be taken as a manifestation of tissue differentiation and structure. Finally, the correlation between cloud overlapping and the entropy difference is interpreted as a way of slowing down the transition to the cancer state in more organized tissues, indicating a possible very interesting relation between the topology of GES (volumes and intersections) and the risk of cancer in a tissue.
      The results seem consistent with the fundamentals of evolution theory and the atavistic theory of cancer.

      Availability of Data and Materials

      The information about the data we used, the procedures, and results are integrated in a public repository that is part of the project “Processing and Analyzing Mutations and Gene Expression Data in Different Systems”: https://github.com/DarioALeonValido/evolp. To process the data set we include a script in ../evolp/Entropy_Tumors/. The script reads the TCGA data replicated in the folder ../databases_external/TCGA/ and the data coming from the PCA analysis located in path ../databases_generated/TCGA_pca/.

      Author contributions

      A.G. conceived and coordinated the work. F.Q and A.G. processed the experimental data. F.Q. and D.A.L. contributed to the GitHub repository. M.L.B. and P.V.S. introduced the information theory concepts. All authors analyzed and interpreted the results, contributed to the manuscript, and approved the final version.

      Declaration of interests

      The authors declare no competing interests.

      Acknowledgments

      A.G. acknowledges the Cuban Program for Basic Sciences, the Office of External Activities of the Abdus Salam Centre for Theoretical Physics, and the University of Electronic Science and Technology of China for support. The research is carried on under a project of the Platform for Bio-informatics of BioCubaFarma, Cuba. The data for the present analysis come from the TCGA Research Network: https://www.cancer.gov/tcga (
      • Tomczak K.
      • Czerwińska P.
      • Wiznerowicz M.
      The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge.
      ). Authors are grateful to Gabriel Gil for comments and a critical reading of the manuscript. Authors acknowledge useful suggestions made in the revision process.

      Supporting material

      References

        • da Silva-Diz V.
        • Lorenzo-Sanz L.
        • Bernat-Peguera A.
        Cancer cell plasticity: Impact on tumor progression and therapy response.
        Semin. Cancer Biol. 2018; 53: 48-58https://doi.org/10.1016/j.semcancer.2018.08.009
        • Davies P.C.W.
        • Lineweaver C.H.
        Cancer tumors as Metazoa 1.0: tapping genes of ancient ancestors.
        Phys. Biol. 2011; 8: 015001https://doi.org/10.1088/14783975/8/1/015001
        • Domazet-Lošo T.
        • Tautz D.
        Phylostratigraphic tracking of cancer genes suggests a link to the emergence of multicellularity in metazoa.
        BMC Biol. 2010; 8: 66https://doi.org/10.1186/1741-7007-8-66
        • Lineweaver C.H.
        • Davies P.C.W.
        • Vincent M.D.
        Targeting cancer's weaknesses (not its strengths): Therapeutic strategies suggested by the atavistic model.
        Bioessays. 2014; 36: 827-835https://doi.org/10.1002/bies.201400070
        • Cisneros L.
        • Bussey K.J.
        • Davies P.
        • et al.
        Ancient genes establish stress-induced mutation as a hallmark of cancer.
        Front. Cell. Dev. Biol. 2017; 12: e0176258https://doi.org/10.1371/journal.pone.0176258
        • Trigos A.S.
        • Pearson R.B.
        • Goode D.L.
        • et al.
        Somatic mutations in early metazoan genes disrupt regulatory links between unicellular and multicellular genes in cancer.
        ELife. 2019; 8: e40947https://doi.org/10.7554/eLife.40947.001
        • Alberts B.
        • Bray D.
        • Raff M.
        • et al.
        Essential cell biology.
        Garland Sci. 2013;
        • Wright S.
        The roles of mutation, inbreeding, crossbreeding, and selection in evolution.
        in: Proceedings of the Sixth International Congress of Genetics. 1932: 356-366
        • Tomczak K.
        • Czerwińska P.
        • Wiznerowicz M.
        The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge.
        Contemp. Oncol. (Pozn). 2015; 19: A68-A77https://doi.org/10.5114/wo.2014.47136
        • Blainey P.C.
        • Quake S.R.
        Dissecting genomic diversity, one cell at a time.
        Nat. Methods. 2014; 11: 19-21https://doi.org/10.1038/nmeth.2783
        • Moore C.C.
        Ergodic theorem, ergodic theory, and statistical mechanics.
        Proc. Natl. Acad. Sci. U S A. 2015; 112: 1907-1911https://doi.org/10.1073/pnas.1421798112
        • Berry M.V.
        Evolution of semiclassical quantum states in phase space.
        J. Phys. A Math. Gen. 1979; 12: 625-642https://doi.org/10.1088/0305-4470/12/5/012
        • Wang J.
        Landscape and ux theory of non-equilibrium dynamical systems with application to biology.
        Adv. Phys. 2015; 64: 1-137https://doi.org/10.1080/00018732.2015.1037068
        • Kauffman S.A.
        Metabolic stability and epigenesis in randomly constructed genetic nets.
        J. Theor. Biol. 1969; 22: 437-467https://doi.org/10.1016/0022-5193(69)90015-0
        • Huang S.
        • Ernberg I.
        • Kauffman S.
        Cancer attractors: A systems view of tumors from a gene network dynamics and developmental perspective.
        Semin. Cell Dev. Biol. 2009; 20: 869-876https://doi.org/10.1016/j.semcdb.2009.07.003
        • Wold S.
        • Kim E.
        • Paul G.
        Principal component analysis.
        Chemometrics Intell. Lab. Syst. 1987; 2: 37-52https://doi.org/10.1016/0169-7439(87)80084-9
        • Gonzalez A.
        • Nieves J.
        • Sosa P.V.
        • et al.
        Gene expression rearrangements denoting changes in the biological state.
        Sci. Rep. 2021; 11: 8470https://doi.org/10.1038/s41598-021-87764-0
        • Cavanaugh J.E.
        • Neath A.A.
        The Akaike information criterion: Background, derivation, properties, application, interpretation, and refinements.
        Wires Comput. Stat. 2019; 11: e1460https://doi.org/10.1002/wics.1460
        • Gonzalez A.
        • Perera Y.
        • Perez R.
        On the gene expression landscape of cancer.
        arXiv. 2019; (Preprint at)
        • Caticha A.
        Entropic inference and the foundations of physics.
        Brazilian Chapter of the International Society for Bayesian Analysis-ISBrA, Sao Paulo, Brazil2012
        • Lever J.
        • Krzywinski M.
        • Altman N.
        Principal component analysis.
        Nat. Methods. 2017; 14: 641-642https://doi.org/10.1038/nmeth.4346
        • Ringnér M.
        What is principal component analysis?.
        Nat. Biotechnol. 2008; 26: 303-304https://doi.org/10.1038/nbt0308-303
        • Cover T.M.
        • Thomas J.A.
        Elements of Information Theory.
        2nd Edition. Wiley-Interscience, 2006
        • Friedmann-Morvinski D.
        • Verma I.M.
        Dediérentiation and reprogramming: origins of cancer stem cells.
        EMBO Rep. 2014; 15: 244-253https://doi.org/10.1002/embr.201338254
        • Feldman D.P.
        • McTague C.S.
        • Crutchfield J.P.
        The organization of intrinsic computation: Complexity-entropy diagrams and the diversity of natural information processing.
        Chaos. 2008; 18: 043106https://doi.org/10.1063/1.2991106
        • Benoit B.
        • Hochmuth C.E.
        • Jasper H.
        Maintaining Tissue Homeostasis: Dynamic Control of Somatic Stem Cell Activity.
        Cell Stem Cell. 2011; 9: 402-411https://doi.org/10.1016/j.stem.2011.10.004
        • Knoll A.H.
        • Nowak M.A.
        The timetable of evolution.
        Sci. Adv. 2017; 3: e1603076https://doi.org/10.1126/sciadv.1603076