The General Character of Philosophy
Philosophy is uncertain by definition.
Philosophy differs from mathematics and the experimental sciences because it deals with systems of thought wherein neither axioms nor data are known with certainty. This can be considered as simply a matter of definition: a field of knowledge based on fixed and known axioms, and thus defined in terms of these axioms, becomes a branch of mathematics, and (as we shall discuss) one whose data is certain becomes an experimental science. What is left--and that is a lot--remains philosophy.
Philosophys uncertainty is related to that of language.
Mathematical axioms are stated in terms of basic, technical words whose meaning is fixed and certain in their context. Words clearly defined in terms of them have correspondingly clear meaning. Thus, a triangle is defined in terms of straight lines. As long as the meaning of the latter is axiomatic, then the meaning of the word triangle is clear and certain.
As we shall see, meanings of words in experimental science are more problematic because they can shift when axioms shift. Even space and time change meanings in the shift from non-relativistic to relativistic theory. But between periods of change, meanings stabilize and the situation becomes that of mathematics. Thus, in both mathematics and science, a clear technical vocabulary can be created.
Philosophy constantly seeks but never attains axioms, and the clarity and certainty of definition they provide. Its words are often defined and given technical meanings, but since axiomatic meanings are unavailable (i.e. philosophers rarely agree on each others axioms), 'technical' no longer implies either certainty or clarity. In this sense, despite a technical vocabulary, philosophic language is really like ordinary language.
The meaning of a substantial fraction of words used in ordinary language are highly contingent and therefore ambiguous. Words directly related to things and phenomena are not contingent, but idea-words, like 'cause', are. Since reasoning can be no more certain than the meaning of the words (or symbols) it uses, philosophic conclusions are correspondingly uncertain and inconclusive.
Socrates sought greater certainty through sharper definition.
This is why Socrates, who, as we shall see, sought to extend systematic logical thought beyond natural philosophy, concentrated so much of his efforts on definitions. As Aristotle tells us in his Metaphysics:
Now Socrates devoted his attention to the moral virtues, and was the first to seek universal definitions concerning these things . . .; and he did well to ask what a thing is; for he sought to reason logically, and what a thing is, is the beginning of logical reasoning .... There are two innovations which may be justly attributed to Socrates, inductive reasoning and universal definitions. Both of these are about the beginning of scientific knowledge.
For example, the Euthyphro begins with the proposition that,
pious is to .prosecute the wrongdoer, ; not to prosecute is impious.
and Socrates asks:
Tell me then what this form itself is, so that I may look upon it, and use it as a model...
By form he means a definition. The use of the word form is in the sense of an ideal which can be copied,--a template. Given the form of piety, particular examples of behavior can be tested to see if they fit. A form is also like a formula. Given the formula for the Pythagorean theorem, sets of three numbers can be tested to see if they fit, if they are an instance of it. Most importantly, just as the Pythagorean formula is required in order that it be combined with other formula to deduce new formula so also is a definition of piety required in order that the concept be logically combined with others to reach new conclusions.
The power of verbal logic is severely limited in comparison to mathematical logic.
Although definitions in verbal and mathematical logic have the same purpose, the former type are severely limited compared to the latter due to the ambiguity of ordinary language. Ambiguity creates paradox, and Socrates often demonstrated how paradoxical conclusions were quickly reached as one tried to create chains of logic. An example extracted from the Platonic dialog Phaedo is outlined in Figure 1. Socrates, using a series of questions, guides his follower, Simias, to the conclusion shown at the top.
The structure of the argument is indicated schematically in Figure 2. In contrast to mathematics where relatively few axioms can generate unlimited numbers of theorems, the typical chain of deductions found in philosophy requires the introduction of new premises every step of the way--none are axioms. Therefore a careful thinker must consider them at best as only probable. They are as uncertain as the words used in them.
Premises in Figure 2 are shown as blackened statements and enter into the logic successively as do corresponding premises shown in Figure 1. The validity of each of those in Figure 1 is contingent on numerous assumptions, for example, that each human is made up of an independent body and soul, that the soul can be separated from the body and remain conscious, that the soul gains no valuable experience while being bound to the body, and that topics of interest to philosopher are not related life. Thus, their logical product, the conclusion that philosophers are in training for dying, is far less likely than any of the premises leading to it. It has not been observed, in fact, that philosophers do fear death least of all men.
In sum, logical structures in philosophy are not anchored in generally accepted axioms, they are bedeviled by undefined words, and they require continuous infusions of uncertain premises. As a result, chains of logic can be made to lead anywhere the philosopher wants them to go and when they get there, they have no certainty.
Verbal truth hardly exists.
The preceding discussion suggests that, outside of mathematics and the experimental sciences, however one may wish to define truth, it hardly exists. Mathematics has its axiomsagreed upon truthsfrom which infinite numbers of further forms of truths may be derived. Experimental science has its experimental truths from which general laws may be inferred and tested by their ability to predict new experimental truths. Philosophy can scarcely state either an axiom or an observation because of the uncertainty of the words uses and, even ignoring this uncertainty, because of the divergence of opinions concerning basic truths.
Furthermore, to reach further conclusions, starting from axioms or observations, requires logic. Conclusions reached from one step of logic are trivially related to their assumptions. It need many steps to reach deep, and surprising conclusions (a staple quality of the sciences). But, as shown above, many steps of verbal logical rapidly lose any probability of validity. The net conclusion is that philosophy produces few truths, and none with any certainty.
If truth cannot be the goal of philosophy, what is?
Despite the inability of deriving truth from its arguments, philosophy serves a number of very important purposes. First, it is to multiply demonstrate the effects of the problem of verbal logic, thereby teaching sophistication in assessing philosophic conclusions. Second, it should help its students use philosophy--to find, or create, their own philosophic structures and worldviews--with sophisticated insight into what they are doing, both its meaning and its limitations.
But why are we interested in these goals? Why are we interested in assessing the philosophy of others and of creating our own? If philosophy does not provide truth, what does it provide? As a step towards answering this, well will now re-examine and clarify the concept of information and relate philosophy to information processing and reduction in particular.
Information
The technical definition of information does not imply meaning
Shannon developed the concept of information while working for Bell Telephone, a communications company. Such companies are not paid on the basis of a messages meaning. As long as a customer is willing to pay for transmission, a company has no reason to be concerned with meaning. Correspondingly, the technical definition of information does not involve meaning. This is easy to forget since it disagrees with the everyday definition. However meaningless a message may be, it must nonetheless be faithfully transmitted. Information theory is concerned with finding and measuring the most efficient way to faithfully transmit messages..
A company which charges for transmission by the letter seeks to transmit on average the fewest bits per letter.
Customers are also unconcerned about a messages form during transmission. A company can thus code messages so as to minimize transmission costs; this is a problem Shannon studied.
If a company charges by alphabetical letter, it tries to send as many as possible over the life of its channels. If the channel has a useful lifetime T, a channel capacity C, and is fully employed, it supplies a total of CT bits. The company wants to send the maximum number of letters, L, using these CT digits. Equivalently, it wants CT/L, the average digits per letter, the cost/product ratio, to be minimized.
A communications company must be prepared to send any message a customer happens to bring. Messages may be imagined to exist in a pool waiting to be picked by the public at random (like differently colored balls in the proverbial urn of probability theory). They have a relative probabilities for being picked which is the only thing the company knows. Clearly, an efficient code should use the least number of bits for the most probable messages.
The number of bits a message requires when most efficiently coded is defined as its information content.
Transmissions are more efficient when more frequent items are given shorter codes.
We have already discussed improvements in efficiency in various contexts. In the simplest context, such as a stationary business that transmits orders for only sheets of paper, pens, pencils, and erasers, the most efficient code was seen to be given by table 1 when all items were equally likely to be ordered.
|
Item |
paper |
pen |
pencil |
eraser |
|
Letter |
a |
b |
c |
d |
|
Binary code |
00 |
01 |
10 |
11 |
Table 1:Coding Table
The best code makes the expected length of bit strings as short as possible. Suppose it is known beforehand that 99.9% of orders will be for paper. The code shown in Table 1 treats all items equally. Using it will probably lead to messages having hundreds of consecutive orders of paperlong strings of 00s. Reducing them is an obvious way to improve efficiency.
A couple of years after Shannon published his initial papers on information theory, and had, along with others, tried to discover the best possible code for a situation such as this, it was assigned as a homework problem to a class on information theory at MIT and solved by a student, D.A.Huffman. Using Huffmans code in the present problem leads to the following table:
|
letter |
a |
b |
c |
d |
|
binary code |
1 |
01 |
001 |
0001 |
Table 2:A Huffman code
Now aabbcad translates into 11010100110001. At first sight it does not seem to be usable; how does one distinguish where the code for each letter begins and ends? In the previous code, this was done by simply divided bits into pairs. In the present code, you start from the left and a sequence of bits is identified with a letter as soon as it can be. Inserting commas to separate identifiable sequences we get: 1,1,01,01,001,1,0001
This code is better than the previous one because each long string of double bits 00000000 becomes 1111 ; halved in length. This more than makes up for the longer lengths of the codes for c and d. Because of the expected great preponderance of long sequences of as, this code has almost double the efficiency of the previous one. Over the life of the channel, it permits almost twice as many messages to be sent.
The average number of bits per letter used in this code is slightly greater than 1. But, remembering the definition of information refers to the best possible code, that does not mean the average information per letter is 1. In fact using what is known as block coding, it will be seen that 1 is far more bits than necessary and that the average information per letter in this example is close to zero.
Block coding is needed to calculate information.
When the items in Table 1 each have a 25% chance of being ordered there is no longer an advantage in using a shorter code for any letter. And as long as each lettera,b,c,dhas an equal probability of appearing at any point on the string the code in Table 1 cannot be improved. Using it, a message string of N letters needs 2N bits. Since the number of bits per letter is 2, and cannot be less than 2, the information contained in one letter truly is exactly 2 bits.
A simple generalization of this beyond the case of 4 items is as follows: If the number of items, A, is given by an integral power of 2 (that is, A= 21, 22, 23, 24, , =2a, where a is an integer), and if each item is equally likely to occur, then each item requires a code of exactly log2A=a bits: the information received per item transferred is a bits. A message string of N items carries Na bits of information.
But what if A is not an integer power of 2? It then turns out that the expression log2A for the information per item is still correct. For example, when A=3, each item requires at least log23=1.58 bits to be transmitted on average and hence carries 1.58 bits of information. This can be understood as follows.
Given 3 possible items to be ordered, a,b, and c, orders could be collected into blocks of N=8 before being sent. There are 38=6561 different blocks possible. One block might be the string aaaaaaaa, another might be aaaaaaab, and another, cbaacbca, and so on. If each item has equal probability of being ordered, each block has the a probability of one out of 38 of being ordered. We now assign a unique number to each block. The sender sends the number in binary as a bit string; the receiver translates it back into the correct string of letters using the code book.
As log238 = log26561= 8log23=12.679 , at least 13 bits are needed to number the different blocks; this, because the largest number representable by 13 bits is 13 1s: 1111111111111 (binary) = 213-1=8191 (decimal). 12 bits would be too few. Note that because 8191>6561, these 13 bits actually permit us to represent more numbers than are needed.
The average number of bits per letter used by this code is (13 bits)/(8 letters)=1.625 bits/letter. But it was stated that the information carried by a letter for any A was given by log2A which, in this case (A=3), is 1.58 bits/letter. Since the information is the bits per letter for the most efficient code, this means that the block coding we have used, 8 letters per block, is almost but not quite the most efficient.
Block coding allows the calculation of the information in strings of uncorrelated symbols having unequal probabilities.
We have seen that, when transmitting messages made up out of A symbols, each have equal probabilities, it takes a minimum of log2A bits to code each symbol: the information per symbol is defined as this minimum. Furthermore, to achieve this, on average, requires, except when log2A=integer, the use of block coding. With this as background, we can now discuss Shannons solution to the following problem: when messages are made symbols appear with unequal probabilities, find the information per symbol.
Shannon was mainly concerned with the case in which the symbols were the alphabetical letters making up ordinary language. These letters have probabilities neither equal to, nor independent of, each other. For instance, in English the chance that an e appears at some point on a string of letters is not the same as that of, say, a u, nor is it independent of what precedes it. A preceding th gives a high probability of an e, a preceding q, gives a low probabilityprobabilities are correlated. Such correlations complicate matters considerably, and will be initially neglected
The un-correlated frequencies of letters in English are known. The most common letter is 'e' which, has a frequency of about pe=.131picking letters at random 131 out of every thousand will, on average, will be e. The least common letters are x, j, q, and z, which each occur about px=pj=pq=pz=.001 of the time. Thus, e appears 131 times more frequently than x, j, q, or z.
What then is the most efficient code based solely on letter frequency? The method is basically the same as already discussed when the frequencies were equal.
First, imagine combining many messages into one, thus sending large blockslong strings of letters--of length L. If p- is the probability of a space, almost all such strings will have, with almost equal probability, close to p-´ L spaces, pa´ L as, , pz ´ L zs in any such string. The shortest code is simply to assign a number to each such string.
How many such strings are there? The answer is well known.
![]()
Equation 1:Number of Strings With Most Probable Distribution
Thus each message sent will simply be a number, in binary, lying between 1 and N. It will be a number having log2N bits; that value divided by L is the number of bits per letter as shown in Equation2.

Equation 2: Average Number Of Bits Per Symbol
Equation 2 calculated using the actual probabilities of letters in English yields 4.14 bits per letter. If language were made up of letters in any order, the average information carried by each letter in a message in English would be 4.14 bits.
As a practical matter, 4.14 bits provides a goal for an actual code. This procedure is impractical. For one thing, it would require creating a numbered list of the N (=many millions) different messages of length L involved. The practical procedure is to use a Huffman method to assign short codes to probable letters. 4.14 bits per letter gives an estimate of how well you can expect the code to perform if done correctly.
Letters correlated in words lower information.
Letters generally appear combined as words. Knowledge of spelling is knowledge about words which if used in designing a code, reduces the average number of bits per letter. One way to do this is to simple code the words themselves instead of the letters. The analysis then proceeds exactly as before except that words replace letters.
A list of English words in order of frequency shows that the is the most commonly used word, occurring about 10% (more exactly, 13%) of the time. Second is of, occurring about (10/2)%, and so on. The nth word has a frequency approximately given by Zipfs law: pn= 0.1/n.
Inserting Zipfs law into Shannons formula, Equation 2, leads to:![]()
The sum is 9.14 bits/word. Since the average word in English is 4.5 letters long, the average letter takes about 9.14/4.5‰2 bits. This would be the information per letter in English if its words appeared in random order restricted only by their relative frequencies. Thus taking into account knowledge of English spelling, cuts the information content of letters by about half.
Knowledge of English lowers the information content of messages in English.
Word order is restricted by grammer, custom, culture and so onknowledge which lowers languages information per letter. But these are very difficult to evaluate. Shannon showed how all the effect of this information could be estimated. His own explanation goes as follows: The new method of estimating entropy [information] exploits the fact that anyone speaking a language possesses, implicitly, an enormous knowledge of the statistics of the language. Familiarity with the words, idioms, clichés and grammar enables him to fill in missing or incorrect letters in proof-reading, or to complete an unfinished phrase in conversation. An experimental demonstration of the extent to which English is predictable can be given as follows: select a short passage unfamiliar to the person who is to do the predicting. He is then asked to guess the first letter in the passage. If the guess is correct he is so informed, and proceeds to guess the second letter. If not, he is told the correct first letter and proceeds to his next guess. This is continued through the text As the experiment progresses, the subject writes down the correct text up to the current point for use in predicting future letters. The result of a typical experiment of this type is given below. Spaces were included as an additional letter, making a 27 letter alphabet. The first line is the original text; the second line contains a dash for each letter correctly guessed In the case of incorrect guesses the correct letter is copied in the second line.
THE-ROOM-WAS-NOT-VERY-LIGHT-A-SMALL-OBLONG-
- - - - ROO- - - - - - - - NOT-V- - - - - I- - - - - - SM- - - - OBL- - - - -
READING- LAMP- ON- THE- DESK-SHED-GLOW-ON-
REA- - - - - - - - - - - O- - - - - - - D- - - - SHED-GLO- - - O- -
POLISHED- WOOD- BUT- LESS- ON- THE- SHABBY- RED- CARPET-
P- L- S- - - - - - O- - - BU- - L- S- - O- - - - - - SH- - - - - - R- - -
C- - - - - -
Of a total of 129 letters, 89 or 69% were guessed correctly. The errors, as would be expected, occured most frequently at the beginnings of words and syllables where lines of thought have more possibilities of branching.
Both lines contain the same information because it is possible, at least in principle, to recover the first line from the second. To accomplish this we need an identical twin, B, of the individual, A, who (or a copy of the machine which) produced the first line on the basis of the guesses and answers shown in the second line. B (who must be mathematically, not just biologically identical to A) will guess what A guessed in the same situation.
Thus, suppose B is shown the reduced textthe second line. Everywhere there is a dash, B will guess correctly. Everywhere there is a letter, Bs incorrect guess is corrected. Thus, B recreates the original message. A creates the code, and B decodes it, because B and A both have the same knowledge base.
This experiment allows us to estimate the information in written English. The reduced text consists of 26+1+1=28 symbols: the 26 letters of the alphabet, the space between words, and the dash indicating a correct guess. Correct guessesdashesoccur with a frequency of about .69. Let us assume for simplicity that relative letter frequency in the reduced and unreduced texts are equal. Equation 2 can then be applied to the 28 symbols used in the reduced text and yields 1 bit per letter. Shannon, performing a careful statistical analysis of more complete data (and using a more elaborate version of this guessing scheme), came up with upper and lower bounds of 1.3 and .6 bits respectively.
Shannons twins can be automated so that, at least in principle, English can be coded, transmitted at the average cost of 1 (±35%) bit per letter, and finally decoded purely by machine. This is what information means in its technical sense. It might mean more, but any other meaning ascribed to it is speculative.
Knowledge
Knowledge decreases the information contained in a message.
Imagine two sets of twinsA,B, and C,Dand suppose twins A,B know more than C,D. For example, A,B have come across the word oblong in their reading, but C,D do not read much and have never seen the word. As in the example above, A,B might well have guessed how to end the word beginning obl whereas C,D probably would not. Because of many such instances, coded messages between C and D need to be longer than those between A and B.
Thus the number of bits required, on average, to transmit messages will be less, the more knowledge is possessed by the coder/decoder system. Therefore, that number of bits, the information content of a message, is, not a property of that message alone; a message contains less information the more the knowledge built into the system handling it!
Ordinary reading illustrates the same principle. The beginner laboriously reads syllable by syllable, later, reading is word by word, eventually, a reader can scan whole sentences at a time. Processing speed increases at each stage as a result of learning, the addition of knowledge to the data processing system. The eye skips letters and words, treating the written material as if it were coded, reading the reduced message even though the full one is before it, and the mind in effect fills in the blanks using its embedded knowledge. Reading just the code is reading just the information. In the example above, the readers visual system processed the obl but not ong; it was saved the effort of unnecessary processing.
The greater the readers knowledge, the less information in a messagefor that reader. In fact, a reader who knows enough to already know the content of a message, before it is sent, receives no information from it when it is sent!
Randomization maximizes information as it minimizes meaning.
The fact that a reader receives information does not mean that meaning is received. Consider a message consisting of completely the 26 alphabetical letters in random orderpure noise. As none of it could be skipped, the minimum number of bits needed to code it would be log226. That is also the maximum information per bit in a message made up of alphabetical letters. A message of complete noise has maximum information! It also has no meaning! Furthermore, none of it could be used to increase the knowledge of the system receiving it.
Without knowledge, all messages convey a maximum of information.
Consider a communications system designed for messages that could include of random strings of letters. As noted, minimum length coding would require block coding which on the average would consume log226 bits per letter. That would be the information per letter carried by a message.
This system has no knowledge concerning the probabilities of letters or of words in English because its designers have no expectation of encountering the correlations these would produce. Suppose, however, that over this system, only proper English is sent. Despite the presence of the properties of such messages which could be used to code them with less than log226 bits per letter, their coding has been set in its design. The information carried per letter remains as it was designed to be.
Knowledge is expressed through correlations, connections and structure.
The reader connected the ong to the obl; some sort of corresponding physical connection had to have been established in the brain. A set of such connections forms a structure. One example of such a structure is the neuronal circuits of the visual systemthe templatespreviously discussed. Other forms such structures take in the brain is at present unknown.
A connection between the parts of oblong led to one part being stimulated subsequent to the perception of the other. In visual circuits, on the other hand, all parts of the circuit work simultaneously to produce a single output signal representing, say, the perception of a line segment.
These physical structures represent knowledge, and their physical connections correspond to correlations in messages. The letters obl are commonly followed by ong in English; their appearances are correlated.
Knowledge, acting as a template, imposes a predetermined structure on incoming data.
Consider a straight edge: the simplest of physical templates. Input two points to a straight edge, and it fills in a straight line between them. The points, however, may actually be part of a curved line, or may be part of no line at all. Input certain signals from the retina to a line segment sensor in the visual system and a straight line segment, with its infinity of intermediate points is perceived--filled in by the mind. There may very well be no such line segments in physical reality (whatever that means!) at all; we create that reality.
Similarly, there is a mental template connecting obl with ong; that which was called up by the scanning of the former set of letters to fill in the latter. But ong may not be there. The word may be obligatory rather than oblong.
Luckily, the brain has mechanisms to correct such an error. If the first word assumed does not make sensedoes not fit into a larger template, a larger knowledge structurewe re-scan. But what happens when, upon re-scanning, sense can still not be made of the message? If the message cannot fit in, if no sense can be made of it, it is discardedas an error, as noise, or as nonsense.
All knowledge structures act in some degree as templates which impose their structure on the data they process, and in the final analysis, the brain (including especially ones world-view) completely circumscribes the information it receives.
These points are worth re-expression as follows.
The discussion of the visual system in the previous chapter started with the fact that the rate of bit flow rate from the retina was reduced by a factor of ten million. The brain has a capacity of about 50 bps, and the purpose of the visual system is to reduce the original data flow to the much smaller information flowto extract information from data.
As noted, reduction is accomplished in two ways: by the elimination of irrelevant, and by redundant data. But how is irrelevance determined by the sensory system? Clearly, data that does not trigger a neuronal circuit does nothing; in effect, it is eliminated. Data that does trigger a circuit, is relevant. These circuits constitute the knowledge of the visual world possessed by the visual system. Thus what we can learn, is determined by what we already know, because the latter determines what information gets through to us. This is a fundamental principle of epistemology; it pertains to all learning! It tell us that everything we learn is biased by what we (think we) already know.
Intelligence is the ability to convert information into knowledge.
Consider a transmission system that contains a mechanism for recording and statistically analyzing the frequency of letters in the messages sent over it. To do so, it needs not only such a mechanism--such intelligencebut also it needs to have individually recognized each letter as information. This latter condition is analogous to having an open mind.
If the system also has a mechanism to alter its coding to take advantage of the different frequencies of letters it senses, it will thereby increase its knowledge and decrease the load that messages put on its data processing capacity. Thus, learning increases the rate at which data that can be processed by a system without physically increasing its size. Intelligence is the ability to learn from information in order to convert it to knowledge thereby increasing the rate at which data may be processed to extract knowledge. Intelligence acts as the catalyst that enables knowledge to increase thereby also accelerating its rate of increase.
Information need not be useful.
A string of completely random letters represent a maximum of information that is generally completely useless. Before proceeding with the idea of utility, however, it is important to emphasize the generally; random strings are not necessarily useless. On the one hand, a string determined by the flipping of a coin is certainly useless. On the other hand, the digits in a transcendental number like pi, are random in the sense that they satisfy all the statistical tests of randomness, but they are certainly not useless. But I will ignore such deep questions for the time being.
The brain learns in at least two ways. Most obviously, at the conscious level in which people do in fact gather data and construct theories, as in the practice of science. This is analogous to the example discussed above in which there is a mechanism built into a system that can alter its coding based on information it accumulates.
In contrast to this there are the processes of nerve growth in which we learn to recognize figures, sounds, and so on. These processes are unconscious and generally unknown. A reasonable working hypothesis is that they operate by trial and error, perhaps as follows: a small change is made (connection made, nerve grown) and if this further reduces the incoming datadecreasing the redundant and ultimately irrelevant bits (noise) passed on to the brainthe change is stays, otherwise it is allowed to wither.
The purpose of this model is merely to bring out one idea that is probably true independent of its particulars (none of which are probably true), and that is that in a growth process, it is immediate utility that counts. Information is useful to a system that changes incrementally by growth, only if it produces an immediate effectonly if it immediately converts information into knowledge.
Philosophy increases knowledge by increasing the connections between our most basic concepts.
The philosopher seeking definitions, axioms, and logical deductions is making connections between verbal concepts. An increase in the number of connections creates an increase in knowledge. Philosophy is distinguished from other forms of knowledge increase by its focus on the most basic verbal concepts.

Figure 1. A fanciful picture of knowledge structure
What does basic mean? At least two answers are possible, both probably correct. They can be understood by analogy with the help of the structure shown in Figure 1. Each circle represents a concept (itself some sort of structure). The connections between concepts give rise to the structure that is knowledge. Circles near the edge represent concepts close to experience, for instance that of a particular object. The two blackened circles represent examples of more basic concepts. One is more basic because of its many connections (such as green-ness, beauty, ), the other because of its greater abstraction, meaning its more indirect connection to experience (such as good-ness, truth, ). Both types of basic concepts have, on average, closer connections with other concepts than do concepts on the periphery of knowledge.
A philosopher would be a person who is interested in connecting these more basic concepts (as in Keats "Truth is Beauty, and Beauty, Truth"). One reason for doing so, for increasing knowledge in this way, has now been described in terms of data processing. For suppose your worldview included a connection between truth and beauty, and a concept was presented to you as being true but did not appear beautiful. You would reject it; it would not be accepted as knowledge but rejected as error in the same way as your sensory system rejects certain data as noise because it does not fit in with the template structure of your already accepted knowledge.
Because of the many connections of basic concepts, connection between them have greater effect than other connections. They lead to the creation of many closer indirect connections. Their effect on knowledge is correspondingly greater. We shall see in more detail how this works in both philosophy and science.