Spearman’s Law of Diminishing Returns: Implications for High End Testing

 

By Bob Williams

 

 

The vast majority of standard IQ tests have been designed to test for the middle range of ability, roughly 2.5d[1] above and below the mean.  Over that range tests are designed to force a Gaussian distribution curve.  At the high end of the intelligence spectrum standard tests begin to have problems, such as inadequate ceiling, insufficient statistical verification, the potential for increased errors, etc.  Of the host of factors that contribute to the difficulties in finding or even designing an ideal high end test, there is one that is intriguing and of unknown magnitude—Spearman's Law of Diminishing Returns (SLDR).  The following discussion is aimed at exploring the nature of SLDR and its implications.

 

 

SLDR

 

Charles Spearman discovered and reported psychometric g (also known as Spearman’s g, the general factor, or just g) in 1904 and he made the observation that is now known as SLDR.  He wrote (Spearman 1927): ‘‘The correlations [between different tests] always become smaller—showing the influence of g on any ability to grow less—in just those classes of person which, on the whole, possess this g more abundantly. The rule is, then, that the more ‘energy’ [i.e., g] a person has available already, the less advantage accrues to his ability from further increments of it.’’  See Jensen (2003), Appendix A.  This can be translated in a number of ways;  for example, it suggests that the variance in intelligence in bright people is less related to their differences in g than is the case for dull people and conversely that the correlations between test scores for dull people is higher than the correlations between the same tests are for bright people.  SLDR is a psychometric extension of the general concept of diminishing returns that has been discussed extensively in economics and is seen in engineering and other fields.

 

Jensen considered SLDR important enough to devote an appendix to it in The g Factor (Jensen 2003).  Here are a few comments from that appendix:

 

The higher a person’s level of g, the less important it becomes in the variety of abilities the person possesses.  Higher-g persons have more diversified abilities, with more of the total variance in their abilities existing in the non-g factors (i.e., the various group factors and specificity).

 

Like money, g isn’t very important if one has enough of it.

 

Persons with low IQs have less efficient central processes, hence overall low performance on most kinds of cognitive tasks.  Persons with higher IQs have more efficient central processes but may vary considerably in the less central, narrower processes.  Consequently, there should be higher correlations (and more g variance) among various tests in a low-IQ group and lower correlations (less g variance) in a high-IQ group.

 

[Related only by inclusion in the referenced appendix: At the end of the appendix, Jensen included two paragraphs that suggest that the heritability of IQ increases at high IQ levels and decreases at lower levels.]

 

 

IQ testing

 

IQ tests are typically composed of a number of subtests, each consisting of multiple test items.  The idea is to test known areas of intelligence (such as verbal, spatial, numerical, etc.) and to sum the scores.  In most cases, the subtest scores are not weighted, but the Woodcock-Johnson III, is an exception.[2]  Irrespective of the scoring method, various cognitive abilities are combined to produce a final raw score.  That score is then converted into IQ by determining its standing (percentile) with respect to large sample of test scores that are forced to fit a Gaussian curve.[3]  This process has been carefully documented (Jensen 1980) and is the basis of the most common means of intelligence evaluation.  Most of the literature on test design, norming, and distributions is based on the range of about +/- 2.5d.  For most purposes, this range is adequate.  Throughout most of the range, the g loading of tests is responsible for almost all of the external validity of the test.  The g loading can only be determined by factor analyzing[4] the test for a relatively large number of testees. 

 

So, the issue arises as to whether SLDR matters enough to cause a disruption to the normal testing procedure for testees who score above the usual range of interest.  If the SLDR effect is large at the upper end, then very bright individuals may be very bright not just because they fall higher on a g scale, but because they possess specific cognitive abilities that account for a significant portion of their high test scores.  Unfortunately, it is difficult to impossible to simply measure g loading for small increments of the total distribution.[5]

 

 

Loading and factor analysis

 

If an IQ test is given to a large, stochastic group of testees, the resulting responses can be factor analyzed, using a procedure that was invented by Charles Spearman.  It should be noted that other means of factor analysis can and have been suggested, but the procedural differences do not hide the important utility of factor analysis, which is to identify the correlations between test items and groups of test items. 

 

A very brief explanation of factor analysis follows.  See Appendix A for additional comments and a short discussion of group factors.

 

The correlations between test items are determined from large numbers of responses and grouped so that those with the highest correlations form related groups.  Tests are often constructed in the format of subtests that typically show the expected correlations.  In the case of the WAIS-III, these are Information, Vocabulary, Similarities, Comprehension, Arithmetic, Digit Span, Letter-Number Sequencing, Picture Arrangement, Picture Completion, Matrix Reasoning, Block Design, Coding, Symbol Search, and Object Assembly.  Each of these subtests consists of multiple test items of varying difficulty.  All of these test items are correlated to varying degrees.  As the correlations are combined, higher level categories can be identified.  In the case of the four-factor model (Juan-Espinosa, et al., 2002), they combine as follows:

 

Verbal Comprehension

Information, Vocabulary, Similarities, Comprehension

 

Perceptual Organization

Picture Arrangement, Picture Completion, Matrix Reasoning, Block Design

 

Working Memory

Arithmetic, Digit Span, Letter-Number Sequencing

 

Processing Speed

Symbol Search, Coding

 

These four factors can define g as the factor that is common to all four groups.  It is apparent from the nature of these categories that some of the tests items (see processing speed) must be related to time.  Others (most of them) are not time related and are usually not timed.  The ultimate goal of the test should be to get an accurate value of g (remember this is a measure of the correlation between the group factors that emerge from the test).  After g is removed from the other factors, the residue consists of random error plus specific abilities that do not call upon g.  If the residual abilities are evaluated for their correlation with measures of external validity (academic performance, job performance, etc.) very little correlation will be found, at least over the usual range of interest.[6]  At the high end, this may not be true.

 

All IQ test items load on three factors:  g, s, and e, such that g2 + s2 + e2 = 1.

 

g = the general factor; the final extraction of a factor analysis[7]

s = specificity; an ability that is not correlated with g

e = random error[8]

 

It is obvious that, if e is constant, there exists a trade-off between g and s, such that increasing either of them necessarily must be accompanied by a reduction in the other.  This means that if g loading decreases at the high end, the s loading must increase.  Specificity may be thought of as an error (true only in the sense that it is not generally the objective of the test), as a specific ability, or as a learned response.  When test items are novel, they must be resolved by cognitive processes that call upon innate abilities that are presumably the essence of g.  Few, if any, test items actually have zero specificity loading for all testees.  Some people, for example, have a non-g ability that applies to certain test items (such as series completion, or rotation) and other people have abilities that enhance their performance on other categories of test items. 

 

Besides individual ability differences that are presumably genetic, individuals have different specific abilities that are the result of learning a specific ability.  For example, the digit span test used in the Wechsler is moderately g loaded, but an individual can be trained to recall many multiples of the usually found span.[9]  Such training lowers the g loading of that test item and increases its s loading.[10] For testing purposes, this “teaching to the test” basically increases the testing error by displacing some of its g loading.  There may be implications here, if SLDR causes an increase in overall test s loading for very bright people.

 

 

Verification of SLDR

 

A number of papers have been published that report studies of the magnitude of SLDR.  Some have found that SLDR is real and measurable, but also that it is not a large effect.  A few papers have reported no effect.  The procedural requirements of measuring g loading as a function of intelligence is such that the only realistic option is to compare the top half of the distribution against the bottom half, or to compare samples that lie above and below the mean, but not close to the top end.  If the distribution function is not linear, there could be large effects at the upper end that are completely missed by such studies.  In fact, the procedures used could not possibly identify large magnitude effects at the upper end because the small number of data points in that range would necessarily be lost as the top half is factor analyzed.

 

The only recourse at present, is to attempt to reason and extrapolate what may be happening at the upper end of the intelligence spectrum.  There is no doubt that intelligence extends well beyond the range of most standardized tests; the question is how much of that high ability is due to g and how much is due to specific abilities that are not correlated with other abilities.

 

Perhaps the strongest indicator that SLDR is large enough to matter is the well documented decrease in test to test correlation as a function of IQ group.    Detterman (1991) wrote: “Low IQ subjects showed much higher correlations than high IQ subjects. Intercorrelations of IQ subtests, correlations of cognitive ability measures with each other, and correlations of IQ with measures of cognitive abilities all displayed the same effect.  For both the WAIS-R and WISC-R, average subtest correlations were highest in the low ability group. Correlations declined systematically with increasing IQ. In both studies, correlations were found to be two times higher in low IQ groups than in high IQ groups.   Measures from the basic tasks correlated more highly in the low IQ group than in the high IQ group.” (Detterman 1991)

 

Some attempts to verify SLDR have been flawed by low N or by very small differences in IQ scores between the groups being compared.  Kane and Brand (2001) reported more robust findings for two groups separated by 2d (IQs were 85 and 115) and measured by the WJ-R:

 

 

Primary Ability

g Loading

 

g Loading

 

Low IQ

 

High IQ

Fluid Intelligence (gf)

.89

 

.80

Visual Processing (gv)

.88

 

.77

Processing Speed (gs)

.95

 

.75

Long-Term Retrieval (glr)

.69

 

.65

Crystallized Intelligence (gc)

.84

 

.39

Auditory Processing (ga)

.81

 

.65

Short-Term Memory (gsm)

.72

 

.39

Quantitative Reasoning (gq)

.86

 

.72

 

 

 

 

The above data clearly show a SLDR effect, even for the separation that is only 1d above and below the mean.  At 3, 4, or more d above the mean, and with an nonlinear effect, there could be a large increase in the role of non-g abilities.

 

A very recent study (Facon 2006) examined the appearance of differentiated abilities (SLDR) as a function of age.  Below the age of 12 years, the differences between high and low IQ groups was small, but was substantial by age 13-15.  The high IQ subjects maintained a relatively constant differentiation between subtest correlations, while low IQ subjects showed increasing subtest correlations after age 12.  Falcon also presents a good review of various SLDR studies, commenting “… [SLDR] is now considered practically a fact that remains only to be explained.”

 

 

Biological factors

 

It is well known that a number of biological factors relate to g.  Jensen: “The g factor arises from the empirical fact that scores on a large variety of independently designed tests of extremely diverse cognitive abilities all turn out to be positively correlated with one another. The g factor appears to be a biological property of the brain, highly correlated with measures of information-processing efficiency, such as working memory capacity, choice and discrimination reaction times, and perceptual speed. It is highly heritable and has many biological correlates, including brain size, evoked potentials, nerve conduction velocity, and cerebral glucose metabolic rate during cognitive activity.”  [psycoloquy.99.10.023.intelligence-g-factor.1.Jensen]

 

These biological factors are ultimately limited.  Brain volume, for example, does not extend beyond some limiting point.  Nerve conduction velocity is limited by its own mechanism.  It does not seem unreasonable to postulate that the contributions from the various biological factors are individually subject to natural limits that are nonlinear as they approach their upper extremes.  When reaction time (RT) or inspection time (IT)[11] measurements are taken, they are found to correlate negatively with g, but cannot extend indefinitely because reaction times are ultimately limited by nerve conduction velocity.  A good example of this is shown in Brand (1996), Figure II, 5.  The figure shows a steep IT slope for the lower half of the IQ range and only a slight slope for the upper half.  The point is that, if g is primarily the product of biological components, those components may not vary linearly with increasing IQ test performance.  If so, SLDR is consistent with lowered rates of intelligence contribution from the biological correlates.

 

The other side of the biological question is whether or not group factors[12] are due to the same causes as g, or due to something else.  Jensen made the interesting speculation that when we ultimately understand the role of neural circuits in the brain, they will turn out to have little to do with g and perhaps account entirely for group factors (see Appendix A for a brief discussion of group factors).  This is consistent with the general picture that has emerged, showing g to be a reflection of physiological factors.

 

If the above assumptions are at least partially true, there should be a divergence between g and group factor residuals as intelligence increases, with less of the net cognitive power explained by g and more by group factors.  If this is nonlinear (the author’s guess), the contribution from group factors may be significant at the upper end.  Perhaps an experiment will eventually be designed to quantify what, if anything, is happening between g and s.

 

 

Do group factors matter?

 

Below the right tail, the external validity of IQ tests is almost entirely due to its g loading.  Jensen has repeatedly pointed out that if the g loading is factored out at the group factor level, the external validity of all of the residuals combined is nil.  But at the upper end of the spectrum, he commented (Jensen 2000) “In groups of people with high levels of g, relatively more of the variance in test scores lies in the lower-order group factors and in test specificity. Higher g persons tend to invest their g in a greater variety of intellectual activities and interests than lower g persons; that is, cognitive abilities are more differentiated at higher levels of ability. Analogously, wealthy people spend their money on a greater variety of things than do poor people.”  This observation is easily seen in daily experience and confirms the significance of SLDR at the individual level, not only as it applies to g loading, but also as it applies to human behavior.

 

If a significant part of the variance in intelligence at the high end is attributable to group factor residuals, then those factors cannot be ignored, at least for very bright individuals.  Among the most successful studies of high intelligence is the collection of data sets begun by Julian Stanley[13].  Interestingly, Stanley focused on math ability and the longitudinal studies included the Study of Mathematically Precocious Youth (SMPY) that is based on SAT-M scores given at about age 12 ½.   These longitudinal studies are still collecting data and have demonstrated striking career achievement differentiation between very bright people within the top percentile.  That is, the bottom of the top 1 percent has performed very well with respect to the rest of the IQ distribution, but not as well as the top quartile of that single 1 percent range.  (Wai, Lubinski, and Benbow 2005)  After hearing Lubinski and Wai present related papers in 2004 and 2005, I asked Lubinski if he thought that the cohorts in the SMPY groups achieved their intelligence and success as a result of high g or group factor related abilities.  His reply can be summarized simply as: “both.”

 

 

Implications for testing at the high end

 

(Jensen 2003):  “The total scores of individuals in the upper range of the ability distribution are considerably less g loaded, and consequently are more adulterated by non-g factors and test specificity, than are the scores in the lower range.    In high-ability groups, those tests that have the larger g loadings in the whole population systematically show the least decrement in g loading, and those tests that have the smallest g loadings show the most decrement. … A battery composed of diverse subtests is needed in order to minimize the proportion of the variance in the total test scores that is contributed by lower-order factors and test specificity. It should be possible to select or construct a large battery of highly reliable subtests having quite diverse content and information-processing demands yet maximizes their g variance, with all of the subtests having approximately equal g loadings.”  [underlining added]  Jensen was arguing that a battery of the most heavily g loaded subtests would be optimum for measuring intelligence, but pointed out that such tests would essentially hide non-g abilities that might be important at high levels.

 

Evans (1999): “The possibility of a breakdown of g at higher levels of intelligence, even with a narrow range of tests (as in the Armed Services Vocational Aptitude Battery) implies that we may have to reexamine the nature of intelligence.” … “There may be a single driving factor at low levels of g, but this may be manifested in a variety of different ways at high levels of g.”

 

This leads to the consideration that testing at the high end has inherent obstacles beyond those that relate to low N, verification of the incremental increases in difficulty, too few test items, and the lack of sufficient data to establish external validity.  The very nature of existing IQ tests, combined with SLDR means that one test will sort people this way, another will sort them that way, etc., because each test will measure and weigh group factors differently.  If the tests are designed to minimize group factor s loadings, g will presumably be measured properly but a large source of the variance in high end intelligence will be excluded.  Either testing for the specific abilities or trying to exclude them will introduce some difficulties.

 

Consider the case of the typical IQ test (structured with multiple subtests) that will pick up specific abilities.  What is the proper way to treat the non-g portions of the total score?  If they are added equally, the test design will strongly influence the scores of individuals on the basis of the particular group factors that are measured.  Some tests do not even attempt to measure some group factors, so an individual with strong specific ability in such a group factor would be penalized.  Arguably, some group factors have a lot of external validity (math, for example), while others may have less, none, or validity only in certain situations (musical ability).  Is it proper to score a non-g verbal increment equally to an identical non-g math increment, even if the external validities are different?  The answer likely lies with the intent of the test designer, but his decision will not serve to give greater clarity to the reported score.

 

One solution that has a rather limited window of opportunity is the use of well normalized tests for testees who are below the intended age range for the test.  This is the procedure used by Julian Stanley and those who have carried on the search for talented youth.  When the SAT (at least the old SAT) is given at age 12 ½, there is an inherent benefit that the test is very well normed against an older age group and it has a very high ceiling for the younger age group.  If the measurements are restricted to the SAT-M, it probably benefits additionally by picking up math abilities that, at least when combined with high g, seem to enhance the prediction of adult measures of career success.

 

 

Conclusions

 

The variance in intelligence among highly intelligent people is due to a combination of g and narrow abilities.  This divergence between g and s poses a dilemma for designers of tests with high ceilings.  If a test is designed to minimize the influence of narrow abilities, a significant part of the cognitive advantage of bright testees will be missed; if a test is designed to measure specific abilities, the test will inherently reflect a non-standard assessment of intelligence that will not correlate well with other IQ tests.

 

 

Appendix A

 

Group Factors

 

The items that are produced by the first extraction of a factor analysis create groups of related abilities.  At this level, there will be many such groups, but fewer than the number of test items from which they were extracted.  These groups of correlations are then put into a hierarchal matrix and the process is repeated, resulting in fewer correlated groups, known as first-order factors.  When groups are selected, they are assumed to be composed of at least three independent variables of the test (that is, three groups from the prior level of extraction).

 

The first order factors are obviously going to be more general than the groups from which they were extracted.  The process is repeated again, producing sets of still more general factors, known as second-order factors.  The correlations between the second-order factors form a single factor— psychometric g.

 

Proceeding through the extraction explanation again: the analysis begins with test items; those produce groupings known as tests; those produce first-order factors; those produce second-order factors; those produce g.  In a narrowly constructed test, g may emerge at the second-order.  Keep in mind that these are all mathematical representations of correlations.  The things being extracted are correlated; the parts of each extraction that are uncorrelated with anything else are stripped.  The extraction   from the second-order factors, is the thing that is common to each second order factor.  The rest of the second-order factors are the uncorrelated parts and are known as the residual variance.

 

A great deal of information can be derived from the factor analysis, for example, the loadings of each test on the g factor.  A good bit of the mathematics (such as treating the uncorrelated components as orthogonal) is more or less obvious to anyone who has dealt with related mathematics.  The actual process of going through a factor analysis is much more complicated than this simple sketch and is the subject of entire books devoted to the subject.  Most good books on psychometrics contain several pages of discussion of factor analysis and offer a far better explanation than by overly condensed one.  Although factor analysis was developed by a psychometrician for the specific application we are discussing, it has become widely used in many unrelated fields, such as economics.

 

So, where are the group factors?  They are the second-order factors from which g is extracted.  Jensen contends that there are 7 or 8 group factors that have been reliably established.  Other sources have suggested as many as 10 group factors.  The number of first-order factors can not be reliably stated, but the practical number is claimed to be 50-60.

 

Group factors have identities, based on the paths that support them, and are named accordingly.  The naming may vary from one source to another, but typical examples are verbal, numerical, spatial visualization, memory, and mechanical.

 

It turns out that IQ test scores are rather simply determined by scoring sub tests.  Ergo, they capture both g and non-g factors.  But it has been shown that almost all (around 96%) of the external validity of an IQ test is accounted for by its ability to measure g.  That means that the group factor residuals, while accounting for real cognitive abilities, do not contribute much to the measurement of intelligence ... at least over the range of +/- 2.0 or 2.5d. 

 

 

Appendix B

 

Tilt

 

Although there are several group factors (see Appendix A), intelligence testing often focuses on math and verbal.  There are frequent references to differences between the sexes and between population groups in these two categories.  When there is a significant difference in individual or group performance in these two categories, the term “tilt” is used.  For example, the United States has a verbal tilt, while East Asian countries have a math tilt. (Hunt 2005).  There are economic advantages associated with the math tilt.  These may have implications with respect to the findings that men outperform women in a number of career and cognitive disciplines.  In fact, Murray (2003) discusses the universal attainment gap between the sexes (P. 289) and documents the relative achievements of women in Chapter 12, where he shows that about 2.2% of the significant figures in the history from 800 B. C. to 1950 were women.

 

Tilt clearly does not explain all of the reasons why women have not established a rate of accomplishment that matches men.  In the past 2 to 3 years, Lynn, Irwing, and others have demonstrated that mean female IQ is on the order of 5 points below mean male IQ in adults.  This finding is missed in Murray (2003), as the associated studies had not yet been published.  He did not correct that omission in Murray (2005), but did provide an excellent discussion of the findings of male and female achievements (from his earlier book).

 

Aside from its relationship to group factors, tilt has some connection to the point made in the body of this text.  If testing at the high end is significantly related to how the non-g measurements are treated, then what is the most appropriate basis for weighting them?  External validity?  If so, should math scores be weighted more than verbal scores?  It is unlikely that there is a way to derive answers to these questions.  Arguments might be constructed to weigh non-g abilities differently, according to the intention of the test.  Likewise, those non-g components could be discarded, but at the expense of measuring factors that may matter among bright people, even if they matter much less throughout the rest of the IQ spectrum.

 

 

 

REFERENCES

 

Abad, F. J., Colom, R., Juan-Espinosa, M., & García, L. F. (2003). Intelligence differentiation in adult samples. Intelligence, 31, 157-166.

 

Brand, C. (1996). The g Factor: General Intelligence and Its Implications. Chichester, England: Wiley

 

Detterman, D.K. and Daniel, M.H. (1989). Correlations of mental tests with each other and with cognitive variables are highest for low IQ groups. Intelligence 13, pp. 349–359.

 

Detterman, D. K. (1991). Reply to Deary and Pagliari: Is g intelligence or stupidity?  Intelligence, Volume 15, Pages 251-255

 

Juan-Espinosa, M., García, L. F., Escorial, S., Rebollo, I., Colom, R., and  Abad, F. J. (2002).  Age dedifferentiation hypothesis,  Intelligence 30, pp. 395-408.

 

Evans, Martin G. (1999).  On the asymmetry of g, Joseph L. Rotman School of Management, University of Toronto

A similar version of the above paper was offered a year later:

Evans, Martin G. (2000).  Implications of the Asymmetry of g for predictive validity. Yale Conference on Intelligence.

 

Falcon, Bruno (2006). Does age moderate the effect of IQ on the differentiation of cognitive abilities during childhood?  Intelligence 34, 375-386.

 

Hunt, Earl (2005).  Speaking at the International Society for Intelligence Research conference.

 

Jensen, Arthur R. (1980). Bias in mental testing. New York: Free Press.

 

Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.

 

Jensen, Arthur R. (2000). Is There a Self-Awareness of One's Own G Level?, Psycoloquy: 11,#40 Intelligence G Factor (39)

 

Jensen, Arthur R. (2003). Regularities in Spearman’s Law of Diminishing Returns, Intelligence 31/ 95–105

 

Kane, H. & Brand, C. R. (2001). 'The Structure of Intelligence in groups of varying cognitive ability: a test of Carroll's three-stratum theory.' [Provisionally accepted for Intelligence.]

 

Murray, C. (2003).  Human Accomplishment.  New York: Harper Collins.

 

Murray, C. (2005).  The Inequality Taboo, Commentary Magazine.

 

Spearman, C.E., 1927. The abilities of man, Macmillan, London.

 

Wai, J., Lubinski, D., and Benbow, C. P. (2005). Creativity and Occupational Accomplishments Among Intellectually Precocious Youths: An Age 13 to Age 33 Longitudinal Study, Journal of Educational Psychology, Vol. 97, No. 3, 484–492.

 

 

home

 

 



[1] d = standard deviation units

[2] The WJ-III weights each subtest according to its g-loading.

[3] The fact that a distribution fits a particular function (in this case, Gaussian) over a wide range, does not imply that the distribution will take any particular form outside of that range.  It simply confirms that it is reasonable to use the fit that works over the range that is verified.   There are no good statistics to support the shape of the right tail above this range.  Even the best standard tests begin to look like extrapolations or worse above this range.

[4] Factor analysis is explained in moderate detail in Jensen (1998) and Brand (1996) and briefly in Appendix A.

[5] I asked both Jensen and Bouchard about the practicality of resolving the decline in g loading as IQ increases.  Jensen told me that it was inherently difficult (his implication was more along the lines of "impossible") because the methodology would break down.  Bouchard mostly agreed, but suggested that it may be possible to get a quartile view.

[6] g is the sine qua non of test validity. The removal of g (by statistical regression) from any psychometric test or battery, leaving only group factors and specificity, absolutely destroys their practical validity....

(Arthur Jensen. The g Factor. p270.)

[7] g, unlike any of the primary, or first-order, factors revealed by factor analysis, cannot be described in terms of the knowledge content of cognitive test items, or in terms of skills, or even in terms of theoretical cognitive processes. It is not essentially a psychological or behavioral variable, but a biological one, a property of the brain.  From -- Précis of The g factor: The science of mental ability. Westport, CT: Praeger.

[8] Test size (number of test items) is a limiting factor in what a test can accomplish.  Besides limiting resolution, it determines the magnitude of random error that is retained in a test.  Since random error is, surprise, "random," it declines as the number of test items increases.  If the test contained a huge number of test items, random error would be reduced to insignificance.

[9] Murray (2005) contains an interesting discussion of digit span.  Footnote 59, from that article: “The average adult gets a digits-backward score of 5 (Jensen 1998: 263). You may compare your own score with the highest I have observed, 13 and 12, achieved respectively by José Zalaquett, former chairman of Amnesty International, and the political analyst Charles Krauthammer. Zalaquett’s score might have been higher if he had not been in a car weaving through traffic at 70 miles per hour on the New Jersey Turnpike. Krauthammer’s score might have been higher if he hadn’t been driving.”

[10] The terms g and s are discussed under the heading “Loading and factor analysis.”

[11] For a through discussion of how RT and IT are measured, see Jensen (1998).

[12] The term “group factors” is sometimes replaced by “specific abilities,” or “narrow abilities.”

[13] Robert Plomin (1995). Genetics and Intelligence, In N. Colangelo & S. Assouline (Eds.), Talent Development III . Scottsdale, AZ: Gifted Psychology Press

"The high-ability samples for the second phase of the project have been selected from the Study of Mathematically Precocious youth (SMPY), begun two decades ago by Julian Stanley and now co-directed by Camilla Benbow and David Lubinski (e.g., Lubinski, D., & Benbow, C. P. (1992). Gender differences in abilities and preferences among the gifted: Implications for the math-science pipeline. Current Directions in Psychological Science, 1, 61-66.). SMPY includes a total of more than 5,000 gifted students who are currently being tracked. Subjects are selected through above-level testing, a procedure in which 7th and 8th graders scoring in the top 2- 3% on conventional achievement tests are invited to take the College Board Scholastic Aptitude Test (SAT). These students generate score distributions on the SAT indistinguishable from those of 11th and 12th grade high school students. The especially able children are selected for in-depth assessments plus extensive longitudinal tracking at 5- to 10-year intervals. Since 1972, more than a million 7th and 8th graders have been tested with the SAT, and more than 100,000 such students now take the SAT annually. These tests are remarkably predictive of exceptional academic achievements into adulthood (Benbow, C. P. (1992). Academic achievement in mathematics and science of students between ages 13 and 23: Are there differences among students in the top one percent of mathematical ability? Journal of Educational Psychology, 84, 51-61.). "