Umlud's place: Projecting beyond the data = bad methodology

As tempting as it might be to extrapolate what seems to be a very obvious curve, it is methodologically wrong to do so for very many reasons. The biggest one is that you don't know what the data will actually show when you look beyond it.

Today, the Pharyngula blog presents this graph of THE FUTURE:

It purports to show the demise of Christianity in the year 2240, based on about 40 years of data that begin around 1970. From this, the extrapolation for the next 230 years is done as a straight line, and that's just methodologically insanely incorrect.

True enough, the graph does seem to show a very strong linear correlation from 1970ish to 2010ish, but all that means is that we know the trend during the time. It might be able to say something about 2011, maybe even something about 2015. However, the further out one draws the graph based only on this single parameter (percentage of Christians in the US), the less powerful the prediction, and it does nothing to model the underlying drivers of that parameter.

When constructing a relationship that provides predictive power, the dependent variable (the value on the y-axis) needs to be a result of the independent variable (the value on the x-axis). Only then can you say that by increasing the value of x can you determine the value of y. The best sort of such predictions are with things that have direct cause-effect relationships, such as with simple kinematic equations.

Since simple kinematic equations deal with direct cause-effect relationships (because they assume no friction and free movement of the mass), it is possible to say that if I have a mass moving with an initial velocity of 10 m/s, and an acceleration of 0.1 m/s/s, then after 10 seconds, the mass would have traveled 105 m -- even if I didn't actually move a mass with that velocity and acceleration for that amount of time. However, the prediction is only correct in a condition where there is no friction -- which is a rather limiting assumption. However, kinematics still work very well, even with this limitation, because they implicitly describe a deterministic outcome based on specific variables.

In the graph presented (facetiously) by Pharyngula, the year is not something that is a predictor of the percentage of Christians in the US. If it were, then the US would have had a greater-than-100% population of Christians during the 1950s! Furthermore, if year were a predictor of the percentage of Christians in the US, what does it mean when it goes negative after 2240? No. Of course not. This shows in another way why the relationship cannot be extrapolated beyond beyond the data. In many cases, what year you are in is not a great predictor of a phenomenon.

When I was taking undergraduate biology, I took one class of sport science (a.k.a. kinesiology). In it, one of the lecturers pulled up a graph of the changes in performance of male and female athletes over a variety of track events over time. All this was part of an explanation that while male athletes are able -- due to major physiological differences between men and women -- to have significantly faster sprint times than women, the two sexes were much more even when it came to long-distance running. She pointed to a series of graphs that showed how female sprint times -- while declining throughout the 1980s -- remained greatly above males and how female long-distance running times had been declining rapidly, and seemed to be approaching that of males. She made the unfortunate error of hypothesizing -- based on the existing trend -- that female runners would catch up to male runners within the next 20 to 30 years. I think of this example every time I look at extrapolating from a limited dataset, and I will try to re-create the latter here.

If we were in 1981 (I will use 1981 instead of 1996 -- when I was a student at university taking that class -- because it describes a situation that can be then tested based on currently existing data, as you will see), and were looking at the yearly best times for men's and women's 1500m outdoors track races, we would see that over the previous 10 years, women's 1500m times had been dropping more precipitously than men's times. Following that trend as a linear function, we would have then made the prediction that women would be running the 1500m faster than men by the year 2005. However -- not too strangely -- this didn't happen, and women -- although closing the gap -- have not yet equaled or surpassed men in terms of yearly 1500m track races (it's a similar story for the Boston Marathon).

Why haven't female 1500m athletes surpassed male 1500m athletes, and why did they look like the could have done so in the first place? The reasons for this explanation of why looking only at the year to predict future performance is a bad methodology. In this specific case, I chalk it up to two major reasons: training and physiology. Women have only recently been allowed to compete in the 1500m (and Boston Marathon); starting in 1970 (and 1966). During those years when there was no ability to get official sanction to compete in international events, there was little incentive for women to train in the event, and so the pool of potential athletes from which to draw was much smaller, less funded, and -- as a result -- were slower than the runners of today. (Running technology and training practices have also evolved over time, which can only add to the differences we see between the runners of today when compared to the runners of 40 years ago.) However, after women were allowed to compete in long-distance running, more money allowed for better training and for a larger pool of athletes. This meant that during the initial years of officially sanctioned competition, the ability of the athletes improved as well as the raw talent pool, thus precipitously dropping the resulting annual best times. However, at some point during the 1980s, women started to approach the physiological limit of the human body (even if we assume that running technology and training practices have continued to improve over this time as well). From this point on, women's times seem to have remained effectively unchanged, oscillating up and down, but hovering always just under 240 seconds (4 minutes). In contrast, during the same period, men already had a relatively large talent pool (raw talent) from which to draw, and had probably reached the limits of physiology based on previous training, and the slow improvement over time is likely due to advances in running technology and training practice. However, just like the women's best annual times hover around 240 seconds, the men's best annual times hover around 210 seconds. I would imagine that neither men's nor women's record times will really significantly decrease without some major advance in running technology or change in physiology (or the allowance of performance-enhancing drugs).

This explanation that rests behind the change in women's best annual 1500m times pre1980s compared to post-1980 presents a mechanism (that I personally think is plausible) as to why the imaginary extrapolation done in 1981 has never happened. It shows, also, why doing such extrapolations (even over a relatively much shorter period of time than the graph presented in Phyrangula) is fraught with problems.

Going back to the presented graph, we therefore have to determine what might be the underlying reasons for the decline, and whether there would there be a leveling-off of the decline of Christians as a percentage of the US population. I believe that the major forces behind this trend might be immigration and disillusionment. The United States has had increased immigration from non-Christian countries since the 1970s. There has also been an increased distancing from Christianity during this time as people feel there is less relevance given to them by Christianity (or possibly any religion) in their daily lives. These are the two major reasons (increase of the population of non-Christians and people leaving Christianity) for a decline. Immigration alone will not lead to an eradication of Christianity in the population, and it would also be unlikely to follow a linear trajectory. The introduction of non-Christian immigrants to a population shouldn't be seen as an absolutely negative thing, however, since it provides Churches with a new population from which to draw followers; these people may not be disillusioned by Christianity, like those who "left the fold."

People moving away from the Christian faith is likely to create a change in the process of conducting the faith as the Christian Church (as a whole) or smaller denominations attempt to keep the faith relevant to those who have "lost the way". In other words, the Church will (again, like it has several times in the past) have to re-model itself in order to keep itself relevant to a changing perception of existence and man's place in it. In other words, Christians are highly unlikely to just stand idly by and let Christianity fade into oblivion.

In the end, it seems unlikely that in an ever-increasing cosmopolitan country, domination by one religion at a percentage greater than 95% seems less and less possible. People want to come to the US, even though they aren't Christian. People who are Christian may well feel that it no longer applies to their lives. And people will be introduced into Christianity. The same is likely to happen with all the other religions, philosophies, and codes that abound in the country.

Bottom line: Christianity dead by 2240? It's bad methodology, and the underlying social mechanisms that keep social institutions alive will not allow for it to happen.

Umlud's place

Wednesday, July 14, 2010

Projecting beyond the data = bad methodology

No comments: