A parameterless one-dimensional model for carcinogenesis in the gene expression space

Any variation in gene expression is a shift in GE space. We conceptualize two types of GE variations: small shifts and large rearrangements. Naively, one can associate small shifts with variations in the expression of one or a few genes, whereas large GE rearrangements are coordinated variations in the expression of many genes.

Small variations in GE levels occur spontaneously and can have different origins. First, somatic mutations in the human genome are known to occur at a rate of 8 per cell generation17. Second, there is also an accumulation rate of epigenetic events (mainly methylation and phosphorylation) altering normal expression levels.18. Both processes could be stimulated by inherited mutations19.20 or external carcinogens21.

We can thus write for the (x_1) coordinate, characterizing the micro-state of a crypt at an instant (t=n+1)the following equation:

$$begin{aligned} x_1^{(n+1)}=x_1^{(n)}+delta x_1, end{aligned}$$



$$begin{aligned} delta x_1=mathbf{v}_mathbf{1}cdot delta {hat{mathbf{e}}}=sum _{i}v_{1i}, delta hat{e}_i, end{aligned}$$


and (delta hat{e}_i) corresponds to a random variation of the expression of the I-th gene. Eq. (1) describes a chain of Markov events22. On the other hand, eq. (2) shows that fluctuations in expression levels are filtered out by the (mathbf{v_1}) vector.

In Fig. 2, we draw the 30 genes with the greatest contributions to (mathbf{v_1}) in COAD11. Positive, (v_{1i}>0)and negative, (v_{1i}, the amplitudes correspond to over- and under-expressed (silenced) genes in tumor progression, respectively. We distinguished the CST1 and AQP8 genes. The former is a known marker for colon cancer23while the latter plays an important role in the homeostasis of the colon24 and must be silenced in tumors.

Figure 2

The 30 genes with the most significant contributions to (mathbf{v_1}) vector in COAD. The x-axis is the sequence number of a given gene in the TCGA data. CST1 is highlighted among the overexpressed genes and AQP8 among the silent genes.

The maximum value of (|v_{1i}|) defines a scale, Dfor fluctuations in (x_1). In COAD, it coincides with the module of the (v_{1i}) linked to the AQP8 gene. In order to get a simple estimate of cancer risk, we can adopt the following model for the fluctuations: (delta x_1=D,r)or r is a uniformly distributed random number in (-1,1). This pattern may result from an assumption of independent variation, i.e. random amplitudes and signs in individual variations of genes (delta hat{e}_i), so most of them cancel. In this way, eq. (1) for small displacements in GE space describes a 1D Brownian or Poisson process25.

One can use the well-known fact that in a Brownian process, the final amplitudes at a given instant are normally distributed, i.e. the probability density is given by:

$$begin{aligned} p(x)=sqrt{a/pi }, e^{-a (x-x_0)^2}, end{aligned}$$


or (a=2/(D^2 t)). We will evaluate the probability for a trajectory starting from the normal zone to reach the tumor zone. Above we pointed out that the minimum step length is (R=bar{x}_1-R_n-R_t). Thus, a risk estimate can be obtained from:

$$begin{aligned} int _{R}^{infty } p(x) mathrm{d}x=text {Erfc}(sqrt{a R^2}), end{aligned} $$


or (text {Erfc}(z)) is the complementary error function. The argument of this function is (z=sqrt{a R^2}=sqrt{2/t},R/D), in principle a large number. Then, we can use the asymptotic behavior (text {Erfc}(z)environ exp (-z^2)/(sqrt{pi }z)) for adults z. The cancer risk in COAD is obtained by multiplying the probability of escape for a single crypt by the number of crypts, or by the number of stem cells, which is proportional to it:

$$begin{aligned} risk sim N_{sc}, frac{Dsqrt{t}}{R},e^{-2(R/(Dsqrt{t}))^2 }, end{aligned}$$



$$begin{aligned} ln(risk/N_{sc}) = const + ln(Dsqrt{t}/R)-2(Dsqrt{t}/R)^{-2}. end{aligned}$$


This expression is general enough to apply to tissues other than the colon. The constant in Eq. (6) may explain other effects such as, for example, the role of the immune system. Microregions escaping the normal region and forming a prototumor could be under attack by the immune system at a very early stage26. By definition, the constant is less than zero because the overall constant in Eq. (5) is less than one.

In Table 2, we compile a set of parameters for a group of tumors. The geometry of the normal and tumor regions, i.e. the parameters (bar{x}_1), (R_n) and (R_t) come from Ref.ten. the D the value is estimated as the maximum of (|v_{1i}|)11. On the other hand, the number of tissue stem cells, (N_{sc})stem cell renewal rate, (m_{sc})and lifetime cancer risk (when available) are taken from refs.27.28. Reported risk values ​​represent averages of more than 380 cancer registries from different cities and countries around the world28.

Table 2 A set of compiled parameters for a group of tumors.

We can test Eq. (6) for cancer risk in a tissue resulting from small random variations in GE levels using the data included in Table 2. A left versus right plot of Eq. (6) must lead to a straight line with a slope close to one and a constant less than zero. Note that the life expectancy in Ref.27 is assumed to be 80 years old. Thereby, you is obtained by multiplying the rate of stem cells, (m_{sc})at 80 years old.

The results of this test are shown in Fig. 3. We get an almost flat curve (slope = (2.1times 10^{-5})Where (1.5times 10^{-4}) if we leave LUAD and THCA out of the fit), indicating that the proposed dependence of the risk on the parameters is not correct. Thus, the observed cancer risk cannot be explained by random variations of low amplitude in GE values. In the next section, we will consider large GE rearrangements, or equivalently large jumps in GE space.

picture 3
picture 3

A test of how Eq. (6) describes cancer risk in 8 tissues. The data in Table 2 is used for this purpose. A very small slope is obtained in both the full fit and the fit without LUAD and THCA, thus small amplitude fluctuations in the gene expression space may not explain cancer risk in these tissues .

Note that we use an expression like (t=m_{sc}times age) over a very wide age range. It is well known that (m_{sc}) experiences a significant decrease due to aging29.30. However, also due to aging, there is an accumulation of epigenetic events and DNA damage leading to reduced fitness and a shift to the low fitness zone. Thus, aging acts in the same direction as low amplitude fluctuations in GE values.

Christy J. Olson