Spearman’s Law of Diminishing Returns:
Implications for High End Testing
By Bob
Williams
The vast majority of standard IQ tests have been designed to
test for the middle range of ability, roughly 2.5d[1]
above and below the mean. Over that range
tests are designed to force a Gaussian distribution curve. At the high end of the intelligence spectrum
standard tests begin to have problems, such as inadequate ceiling, insufficient
statistical verification, the potential for increased errors, etc. Of the host of factors that contribute to
the difficulties in finding or even designing an ideal high end test, there is
one that is intriguing and of unknown magnitude—Spearman's Law of Diminishing
Returns (SLDR). The following
discussion is aimed at exploring the nature of SLDR and its implications.
SLDR
Charles Spearman discovered and reported psychometric g (also known as Spearman’s g, the general factor, or just g) in 1904 and he
made the observation that is now known as SLDR. He wrote (Spearman 1927): ‘‘The correlations [between different
tests] always become smaller—showing the influence of g on any ability to grow
less—in just those classes of person which, on the whole, possess this g more
abundantly. The rule is, then, that the more ‘energy’ [i.e., g] a person has
available already, the less advantage accrues to his ability from further
increments of it.’’ See Jensen (2003),
Appendix A. This can be translated in a
number of ways; for example, it
suggests that the variance in intelligence in bright people is less related to
their differences in g than is the
case for dull people and conversely that the correlations between test scores
for dull people is higher than the correlations between the same tests are for
bright people. SLDR is a psychometric
extension of the general concept of diminishing returns that has been discussed
extensively in economics and is seen in engineering and other fields.
Jensen considered SLDR important enough to devote an appendix to
it in The g Factor (Jensen 2003). Here
are a few comments from that appendix:
The higher a person’s level of g, the less important it becomes in the
variety of abilities the person possesses.
Higher-g persons have more
diversified abilities, with more of the total variance in their abilities
existing in the non-g factors (i.e.,
the various group factors and specificity).
Like money, g isn’t very important if one has enough of it.
Persons with low IQs have less efficient
central processes, hence overall low performance on most kinds of cognitive
tasks. Persons with higher IQs have
more efficient central processes but may vary considerably in the less central,
narrower processes. Consequently, there
should be higher correlations (and more g
variance) among various tests in a low-IQ group and lower correlations (less g variance) in a high-IQ group.
[Related only by inclusion in the referenced appendix: At the
end of the appendix, Jensen included two paragraphs that suggest that the
heritability of IQ increases at high IQ levels and decreases at lower levels.]
IQ
testing
IQ tests are typically composed of a number of subtests, each
consisting of multiple test items. The
idea is to test known areas of intelligence (such as verbal, spatial,
numerical, etc.) and to sum the scores.
In most cases, the subtest scores are not weighted, but the
Woodcock-Johnson III, is an exception.[2] Irrespective of the scoring method, various
cognitive abilities are combined to produce a final raw score. That score is then converted into IQ by
determining its standing (percentile) with respect to large sample of test
scores that are forced to fit a Gaussian curve.[3] This process has been carefully documented
(Jensen 1980) and is the basis of the most common means of intelligence evaluation. Most of the literature on test design,
norming, and distributions is based on the range of about +/- 2.5d.
For most purposes, this range is adequate. Throughout most of the range, the g loading of tests is responsible for almost all of the external
validity of the test. The g loading can only be determined by
factor analyzing[4] the test for
a relatively large number of testees.
So, the issue arises as to whether SLDR matters enough to cause
a disruption to the normal testing procedure for testees who score above the
usual range of interest. If the SLDR
effect is large at the upper end, then very bright individuals may be very
bright not just because they fall higher on a g scale, but because they possess specific cognitive abilities that
account for a significant portion of their high test scores. Unfortunately, it is difficult to impossible
to simply measure g loading for small
increments of the total distribution.[5]
Loading
and factor analysis
If an IQ test is given to a large, stochastic group of testees,
the resulting responses can be factor analyzed, using a procedure that was
invented by Charles Spearman. It should
be noted that other means of factor analysis can and have been suggested, but
the procedural differences do not hide the important utility of factor
analysis, which is to identify the correlations between test items and groups
of test items.
A very brief explanation of factor analysis follows. See Appendix A for additional comments and a
short discussion of group factors.
The correlations between test items are determined from large
numbers of responses and grouped so that those with the highest correlations
form related groups. Tests are often
constructed in the format of subtests that typically show the expected
correlations. In the case of the
WAIS-III, these are Information, Vocabulary, Similarities, Comprehension,
Arithmetic, Digit Span, Letter-Number Sequencing, Picture Arrangement, Picture
Completion, Matrix Reasoning, Block Design, Coding, Symbol Search, and Object
Assembly. Each of these subtests
consists of multiple test items of varying difficulty. All of these test items are correlated to
varying degrees. As the correlations
are combined, higher level categories can be identified. In the case of the four-factor model
(Juan-Espinosa, et al., 2002), they combine as follows:
Verbal Comprehension
Information, Vocabulary, Similarities, Comprehension
Perceptual Organization
Picture Arrangement, Picture Completion, Matrix Reasoning, Block
Design
Working Memory
Arithmetic, Digit Span, Letter-Number Sequencing
Processing Speed
Symbol Search, Coding
These four factors can define g as the factor that is common to all four groups. It is apparent from the nature of these
categories that some of the tests items (see processing speed) must be related
to time. Others (most of them) are not
time related and are usually not timed.
The ultimate goal of the test should be to get an accurate value of g (remember this is a measure of the
correlation between the group factors that emerge from the test). After g
is removed from the other factors, the residue consists of random error plus
specific abilities that do not call upon g. If the residual abilities are evaluated for
their correlation with measures of external validity (academic performance, job
performance, etc.) very little correlation will be found, at least over the
usual range of interest.[6] At the high end, this may not be true.
All IQ test items load on three factors: g,
s, and e, such that g2 + s2 +
e2 = 1.
g = the general factor; the final extraction of a factor analysis[7]
s =
specificity; an ability that is not correlated with g
e =
random error[8]
It is obvious that, if e is constant, there exists a
trade-off between g and s, such that increasing either of them
necessarily must be accompanied by a reduction in the other. This means that if g loading decreases at the high end, the s loading must
increase. Specificity may be thought of
as an error (true only in the sense that it is not generally the objective of
the test), as a specific ability, or as a learned response. When test items are novel, they must be
resolved by cognitive processes that call upon innate abilities that are
presumably the essence of g. Few, if any, test items actually have zero
specificity loading for all testees.
Some people, for example, have a non-g
ability that applies to certain test items (such as series completion, or
rotation) and other people have abilities that enhance their performance on
other categories of test items.
Besides individual ability differences that are presumably
genetic, individuals have different specific abilities that are the result of
learning a specific ability. For
example, the digit span test used in the Wechsler is moderately g loaded, but an individual can be
trained to recall many multiples of the usually found span.[9] Such training lowers the g loading of that test item and
increases its s loading.[10]
For testing purposes, this “teaching to the test” basically increases the
testing error by displacing some of its g
loading. There may be implications
here, if SLDR causes an increase in overall test s loading for very
bright people.
Verification
of SLDR
A number of papers have been published that report studies of
the magnitude of SLDR. Some have found
that SLDR is real and measurable, but also that it is not a large effect. A few papers have reported no effect. The procedural requirements of measuring g loading as a function of intelligence
is such that the only realistic option is to compare the top half of the
distribution against the bottom half, or to compare samples that lie above and
below the mean, but not close to the top end.
If the distribution function is not linear, there could be large effects
at the upper end that are completely missed by such studies. In fact, the procedures used could not
possibly identify large magnitude effects at the upper end because the small
number of data points in that range would necessarily be lost as the top half
is factor analyzed.
The only recourse at present, is to attempt to reason and
extrapolate what may be happening at the upper end of the intelligence
spectrum. There is no doubt that
intelligence extends well beyond the range of most standardized tests; the
question is how much of that high ability is due to g and how much is due to specific abilities that are not correlated
with other abilities.
Perhaps the strongest indicator that SLDR is large enough to
matter is the well documented decrease in test to test correlation as a
function of IQ group. Detterman
(1991) wrote: “Low IQ subjects showed much higher correlations than high IQ
subjects. Intercorrelations of IQ subtests, correlations of cognitive ability
measures with each other, and correlations of IQ with measures of cognitive
abilities all displayed the same effect.
For both the WAIS-R and WISC-R, average subtest correlations were
highest in the low ability group. Correlations declined systematically with
increasing IQ. In both studies, correlations were found to be two times higher
in low IQ groups than in high IQ groups.
Measures from the basic tasks correlated more highly in the low IQ group
than in the high IQ group.” (Detterman 1991)
Some attempts to verify SLDR have been flawed by low N or by
very small differences in IQ scores between the groups being compared. Kane and Brand (2001) reported more robust
findings for two groups separated by 2d
(IQs were 85 and 115) and measured by the WJ-R:
|
Primary Ability |
g Loading |
|
g Loading |
|
|
Low IQ |
|
High IQ |
|
Fluid Intelligence (gf) |
.89 |
|
.80 |
|
Visual Processing (gv) |
.88 |
|
.77 |
|
Processing Speed (gs) |
.95 |
|
.75 |
|
Long-Term Retrieval (glr) |
.69 |
|
.65 |
|
Crystallized Intelligence (gc) |
.84 |
|
.39 |
|
Auditory Processing (ga) |
.81 |
|
.65 |
|
Short-Term Memory (gsm) |
.72 |
|
.39 |
|
Quantitative Reasoning (gq) |
.86 |
|
.72 |
The above data clearly show a SLDR effect, even for the
separation that is only 1d above and
below the mean. At 3, 4, or more d above the mean, and with an nonlinear
effect, there could be a large increase in the role of non-g abilities.
A very recent study (Facon 2006) examined the appearance of
differentiated abilities (SLDR) as a function of age. Below the age of 12 years, the differences between high and low
IQ groups was small, but was substantial by age 13-15. The high IQ subjects maintained a relatively
constant differentiation between subtest correlations, while low IQ subjects
showed increasing subtest correlations after age 12. Falcon also presents a good review of various SLDR studies,
commenting “… [SLDR] is now considered practically a fact that remains
only to be explained.”
Biological
factors
It is well known that a number of biological factors relate to g.
Jensen: “The g factor arises
from the empirical fact that scores on a large variety of independently
designed tests of extremely diverse cognitive abilities all turn out to be
positively correlated with one another. The g
factor appears to be a biological property of the brain, highly correlated with
measures of information-processing efficiency, such as working memory capacity,
choice and discrimination reaction times, and perceptual speed. It is highly
heritable and has many biological correlates, including brain size, evoked
potentials, nerve conduction velocity, and cerebral glucose metabolic rate
during cognitive activity.”
[psycoloquy.99.10.023.intelligence-g-factor.1.Jensen]
These biological factors are ultimately limited. Brain volume, for example, does not extend
beyond some limiting point. Nerve
conduction velocity is limited by its own mechanism. It does not seem unreasonable to postulate that the contributions
from the various biological factors are individually subject to natural limits
that are nonlinear as they approach their upper extremes. When reaction time (RT) or inspection time
(IT)[11]
measurements are taken, they are found to correlate negatively with g, but cannot extend indefinitely
because reaction times are ultimately limited by nerve conduction
velocity. A good example of this is
shown in Brand (1996), Figure II, 5.
The figure shows a steep IT slope for the lower half of the IQ range and
only a slight slope for the upper half.
The point is that, if g is
primarily the product of biological components, those components may not vary
linearly with increasing IQ test performance.
If so, SLDR is consistent with lowered rates of intelligence
contribution from the biological correlates.
The other side of the biological question is whether or not
group factors[12] are due to
the same causes as g, or due to
something else. Jensen made the
interesting speculation that when we ultimately understand the role of neural
circuits in the brain, they will turn out to have little to do with g and perhaps account entirely for group
factors (see Appendix A for a brief discussion of group factors). This is consistent with the general picture
that has emerged, showing g to be a
reflection of physiological factors.
If the above assumptions are at least partially true, there
should be a divergence between g and
group factor residuals as intelligence increases, with less of the net
cognitive power explained by g and
more by group factors. If this is
nonlinear (the author’s guess), the contribution from group factors may be
significant at the upper end. Perhaps
an experiment will eventually be designed to quantify what, if anything, is
happening between g and s.
Do
group factors matter?
Below the right tail, the external
validity of IQ tests is almost entirely due to its g loading. Jensen has
repeatedly pointed out that if the g
loading is factored out at the group factor level, the external validity of all
of the residuals combined is nil. But
at the upper end of the spectrum, he commented (Jensen 2000) “In groups of
people with high levels of g, relatively more of the variance in test scores
lies in the lower-order group factors and in test specificity. Higher g persons
tend to invest their g in a greater variety of intellectual activities and
interests than lower g persons; that is, cognitive abilities are more
differentiated at higher levels of ability. Analogously, wealthy people spend
their money on a greater variety of things than do poor people.” This observation is easily seen in daily
experience and confirms the significance of SLDR at the individual level, not
only as it applies to g loading, but
also as it applies to human behavior.
If a significant part of the
variance in intelligence at the high end is attributable to group factor
residuals, then those factors cannot be ignored, at least for very bright
individuals. Among the most successful
studies of high intelligence is the collection of data sets begun by Julian
Stanley[13]. Interestingly, Stanley focused on math
ability and the longitudinal studies included the Study of Mathematically
Precocious Youth (SMPY) that is based on SAT-M scores given at about age 12
½. These longitudinal studies are
still collecting data and have demonstrated striking career achievement
differentiation between very bright people within the top percentile. That is, the bottom of the top 1 percent has
performed very well with respect to the rest of the IQ distribution, but not as
well as the top quartile of that single 1 percent range. (Wai, Lubinski, and Benbow 2005) After hearing Lubinski and Wai present
related papers in 2004 and 2005, I asked Lubinski if he thought that the
cohorts in the SMPY groups achieved their intelligence and success as a result
of high g or group factor related
abilities. His reply can be summarized
simply as: “both.”
Implications
for testing at the high end
(Jensen 2003): “The
total scores of individuals in the upper range of the ability distribution are considerably
less g loaded, and consequently are more adulterated by non-g factors and
test specificity, than are the scores in the lower range. … In
high-ability groups, those tests that have the larger g loadings in the whole population systematically show the least
decrement in g loading, and those tests that have the smallest g loadings show the most decrement. … A battery composed of diverse subtests is
needed in order to minimize the proportion of the variance in the total test
scores that is contributed by lower-order factors and test specificity. It
should be possible to select or construct a large battery of highly reliable
subtests having quite diverse content and information-processing demands yet
maximizes their g variance, with all of the subtests having approximately
equal g loadings.” [underlining
added] Jensen was arguing that a
battery of the most heavily g loaded
subtests would be optimum for measuring intelligence, but pointed out that such
tests would essentially hide non-g
abilities that might be important at high levels.
Evans (1999):
“The possibility of a breakdown of g
at higher levels of intelligence, even with a narrow range of tests (as in the
Armed Services Vocational Aptitude Battery) implies that we may have to
reexamine the nature of intelligence.” … “There may be a single driving factor
at low levels of g, but this may be manifested in a variety of different
ways at high levels of g.”
This leads to
the consideration that testing at the high end has inherent obstacles beyond
those that relate to low N, verification of the incremental increases in
difficulty, too few test items, and the lack of sufficient data to establish
external validity. The very nature of
existing IQ tests, combined with SLDR means that one test will sort people this
way, another will sort them that way, etc., because each test will measure and
weigh group factors differently. If the
tests are designed to minimize group factor s loadings, g will presumably be measured properly
but a large source of the variance in high end intelligence will be
excluded. Either testing for the
specific abilities or trying to exclude them will introduce some difficulties.
Consider the
case of the typical IQ test (structured with multiple subtests) that will pick
up specific abilities. What is the
proper way to treat the non-g
portions of the total score? If they
are added equally, the test design will strongly influence the scores of
individuals on the basis of the particular group factors that are measured. Some tests do not even attempt to measure
some group factors, so an individual with strong specific ability in such a
group factor would be penalized.
Arguably, some group factors have a lot of external validity (math, for
example), while others may have less, none, or validity only in certain
situations (musical ability). Is it
proper to score a non-g verbal
increment equally to an identical non-g
math increment, even if the external validities are different? The answer likely lies with the intent of
the test designer, but his decision will not serve to give greater clarity to
the reported score.
One solution
that has a rather limited window of opportunity is the use of well normalized
tests for testees who are below the intended age range for the test. This is the procedure used by Julian Stanley
and those who have carried on the search for talented youth. When the SAT (at least the old SAT) is given
at age 12 ½, there is an inherent benefit that the test is very well normed
against an older age group and it has a very high ceiling for the younger age
group. If the measurements are
restricted to the SAT-M, it probably benefits additionally by picking up math
abilities that, at least when combined with high g, seem to enhance the prediction of adult measures of career
success.
Conclusions
The variance
in intelligence among highly intelligent people is due to a combination of g and narrow abilities. This divergence between g and s poses a dilemma for designers of tests with high
ceilings. If a test is designed to
minimize the influence of narrow abilities, a significant part of the cognitive
advantage of bright testees will be missed; if a test is designed to measure
specific abilities, the test will inherently reflect a non-standard assessment
of intelligence that will not correlate well with other IQ tests.
Appendix
A
Group
Factors
The items that are produced by the first extraction of a factor
analysis create groups of related abilities.
At this level, there will be many such groups, but fewer than the number
of test items from which they were extracted.
These groups of correlations are then put into a hierarchal matrix and
the process is repeated, resulting in fewer correlated groups, known as
first-order factors. When groups are
selected, they are assumed to be composed of at least three independent
variables of the test (that is, three groups from the prior level of
extraction).
The first order factors are obviously going to be more general
than the groups from which they were extracted. The process is repeated again, producing sets of still more
general factors, known as second-order factors. The correlations between the second-order factors form a single
factor— psychometric g.
Proceeding through the extraction explanation again: the analysis
begins with test items; those produce groupings known as tests; those produce
first-order factors; those produce second-order factors; those produce g.
In a narrowly constructed test, g
may emerge at the second-order. Keep in
mind that these are all mathematical representations of correlations. The things being extracted are correlated;
the parts of each extraction that are uncorrelated with anything else are
stripped. The extraction from the second-order factors, is the thing
that is common to each second order factor.
The rest of the second-order factors are the uncorrelated parts and are
known as the residual variance.
A great deal of information can be derived from the factor
analysis, for example, the loadings of each test on the g factor. A good bit of the
mathematics (such as treating the uncorrelated components as orthogonal) is
more or less obvious to anyone who has dealt with related mathematics. The actual process of going through a factor
analysis is much more complicated than this simple sketch and is the subject of
entire books devoted to the subject.
Most good books on psychometrics contain several pages of discussion of
factor analysis and offer a far better explanation than by overly condensed
one. Although factor analysis was developed
by a psychometrician for the specific application we are discussing, it has
become widely used in many unrelated fields, such as economics.
So, where are the group factors? They are the second-order factors from which g is extracted. Jensen contends
that there are 7 or 8 group factors that have been reliably established. Other sources have suggested as many as 10
group factors. The number of
first-order factors can not be reliably stated, but the practical number is
claimed to be 50-60.
Group factors have identities, based on the paths that support
them, and are named accordingly. The
naming may vary from one source to another, but typical examples are verbal,
numerical, spatial visualization, memory, and mechanical.
It turns out that IQ test scores are rather simply determined by
scoring sub tests. Ergo, they capture
both g and non-g factors. But it has been
shown that almost all (around 96%) of the external validity of an IQ test is
accounted for by its ability to measure g. That means that the group factor residuals,
while accounting for real cognitive abilities, do not contribute much to the
measurement of intelligence ... at least over the range of +/- 2.0 or 2.5d.
Appendix
B
Tilt
Although there are several group factors (see Appendix A),
intelligence testing often focuses on math and verbal. There are frequent references to differences
between the sexes and between population groups in these two categories. When there is a significant difference in
individual or group performance in these two categories, the term “tilt” is
used. For example, the United States
has a verbal tilt, while East Asian countries have a math tilt. (Hunt
2005). There are economic advantages
associated with the math tilt. These
may have implications with respect to the findings that men outperform women in
a number of career and cognitive disciplines.
In fact, Murray (2003) discusses the universal attainment gap between
the sexes (P. 289) and documents the relative achievements of women in Chapter 12,
where he shows that about 2.2% of the significant figures in the history from
800 B. C. to 1950 were women.
Tilt clearly does not explain all of the reasons why women have
not established a rate of accomplishment that matches men. In the past 2 to 3 years, Lynn, Irwing, and
others have demonstrated that mean female IQ is on the order of 5 points below
mean male IQ in adults. This finding is
missed in Murray (2003), as the associated studies had not yet been published. He did not correct that omission in Murray
(2005), but did provide an excellent discussion of the findings of male and
female achievements (from his earlier book).
Aside from its relationship to group factors, tilt has some
connection to the point made in the body of this text. If testing at the high end is significantly
related to how the non-g measurements
are treated, then what is the most appropriate basis for weighting them? External validity? If so, should math scores be weighted more than verbal scores? It is unlikely that there is a way to derive
answers to these questions. Arguments
might be constructed to weigh non-g
abilities differently, according to the intention of the test. Likewise, those non-g components could be discarded, but at the expense of measuring
factors that may matter among bright people, even if they matter much less
throughout the rest of the IQ spectrum.
REFERENCES
Abad, F. J., Colom, R., Juan-Espinosa, M., & García, L. F. (2003). Intelligence differentiation in adult samples. Intelligence, 31, 157-166.
Brand, C. (1996). The g Factor: General Intelligence and Its Implications. Chichester, England: Wiley
Detterman, D.K. and Daniel, M.H. (1989). Correlations of mental tests with each other and with cognitive variables are highest for low IQ groups. Intelligence 13, pp. 349–359.
Detterman, D. K. (1991). Reply to Deary and Pagliari: Is g intelligence or stupidity? Intelligence, Volume 15, Pages 251-255
Juan-Espinosa, M., García, L. F., Escorial, S., Rebollo, I., Colom, R., and Abad, F. J. (2002). Age dedifferentiation hypothesis, Intelligence 30, pp. 395-408.
Evans, Martin G. (1999). On the asymmetry of g, Joseph L. Rotman School of Management, University of Toronto
A similar version of the above paper was offered a year later:
Evans, Martin G. (2000). Implications of the Asymmetry of g for predictive validity. Yale Conference on Intelligence.
Falcon, Bruno (2006). Does age moderate the effect of IQ on the differentiation of cognitive abilities during childhood? Intelligence 34, 375-386.
Hunt, Earl (2005). Speaking at the International Society for Intelligence Research conference.
Jensen, Arthur R. (1980). Bias in mental testing. New York: Free Press.
Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger.
Jensen, Arthur R. (2000). Is There a Self-Awareness of One's Own G Level?, Psycoloquy: 11,#40 Intelligence G Factor (39)
Jensen, Arthur R. (2003). Regularities in Spearman’s Law of Diminishing Returns, Intelligence 31/ 95–105
Kane, H. & Brand, C. R. (2001). 'The Structure of Intelligence in groups of varying cognitive ability: a test of Carroll's three-stratum theory.' [Provisionally accepted for Intelligence.]
Murray, C. (2003). Human Accomplishment. New York: Harper Collins.
Murray, C. (2005). The Inequality Taboo, Commentary Magazine.
Spearman, C.E., 1927. The abilities of man, Macmillan, London.
Wai, J., Lubinski, D., and
Benbow, C. P. (2005). Creativity and Occupational Accomplishments Among
Intellectually Precocious Youths: An Age 13 to Age 33 Longitudinal Study, Journal of Educational Psychology, Vol.
97, No. 3, 484–492.
[1] d = standard deviation units
[2] The WJ-III weights each subtest according to its g-loading.
[3] The fact that a distribution fits a particular function (in this case, Gaussian) over a wide range, does not imply that the distribution will take any particular form outside of that range. It simply confirms that it is reasonable to use the fit that works over the range that is verified. There are no good statistics to support the shape of the right tail above this range. Even the best standard tests begin to look like extrapolations or worse above this range.
[4] Factor analysis is explained in moderate detail in Jensen (1998) and Brand (1996) and briefly in Appendix A.
[5] I asked both Jensen and Bouchard about the practicality of resolving the decline in g loading as IQ increases. Jensen told me that it was inherently difficult (his implication was more along the lines of "impossible") because the methodology would break down. Bouchard mostly agreed, but suggested that it may be possible to get a quartile view.
[6] g is the sine qua non of test validity. The removal of g (by statistical regression) from any psychometric test or battery, leaving only group factors and specificity, absolutely destroys their practical validity....
(Arthur Jensen. The g Factor. p270.)
[7] g, unlike any of the primary, or first-order, factors revealed by factor analysis, cannot be described in terms of the knowledge content of cognitive test items, or in terms of skills, or even in terms of theoretical cognitive processes. It is not essentially a psychological or behavioral variable, but a biological one, a property of the brain. From -- Précis of The g factor: The science of mental ability. Westport, CT: Praeger.
[8] Test size (number of test items) is a limiting factor in what a test can accomplish. Besides limiting resolution, it determines the magnitude of random error that is retained in a test. Since random error is, surprise, "random," it declines as the number of test items increases. If the test contained a huge number of test items, random error would be reduced to insignificance.
[9] Murray (2005) contains an interesting discussion of digit span. Footnote 59, from that article: “The average adult gets a digits-backward score of 5 (Jensen 1998: 263). You may compare your own score with the highest I have observed, 13 and 12, achieved respectively by José Zalaquett, former chairman of Amnesty International, and the political analyst Charles Krauthammer. Zalaquett’s score might have been higher if he had not been in a car weaving through traffic at 70 miles per hour on the New Jersey Turnpike. Krauthammer’s score might have been higher if he hadn’t been driving.”
[10] The terms g and s are discussed under the heading “Loading and factor analysis.”
[11] For a through discussion of how RT and IT are measured, see Jensen (1998).
[12] The term “group factors” is sometimes replaced by “specific abilities,” or “narrow abilities.”
[13] Robert Plomin (1995). Genetics and Intelligence, In N. Colangelo & S. Assouline (Eds.), Talent Development III . Scottsdale, AZ: Gifted Psychology Press
"The high-ability samples for the second phase of the project have been selected from the Study of Mathematically Precocious youth (SMPY), begun two decades ago by Julian Stanley and now co-directed by Camilla Benbow and David Lubinski (e.g., Lubinski, D., & Benbow, C. P. (1992). Gender differences in abilities and preferences among the gifted: Implications for the math-science pipeline. Current Directions in Psychological Science, 1, 61-66.). SMPY includes a total of more than 5,000 gifted students who are currently being tracked. Subjects are selected through above-level testing, a procedure in which 7th and 8th graders scoring in the top 2- 3% on conventional achievement tests are invited to take the College Board Scholastic Aptitude Test (SAT). These students generate score distributions on the SAT indistinguishable from those of 11th and 12th grade high school students. The especially able children are selected for in-depth assessments plus extensive longitudinal tracking at 5- to 10-year intervals. Since 1972, more than a million 7th and 8th graders have been tested with the SAT, and more than 100,000 such students now take the SAT annually. These tests are remarkably predictive of exceptional academic achievements into adulthood (Benbow, C. P. (1992). Academic achievement in mathematics and science of students between ages 13 and 23: Are there differences among students in the top one percent of mathematical ability? Journal of Educational Psychology, 84, 51-61.). "