As you know, the English alphabet is far from being a regular and consistent system of representing all the sounds in English. For instance, think of the letter group ough. How many different way can it sound like:

Word Rhymes with..
(in Standard American Dialect)
through true
though go
cough off
thought not
tough stuff

And as you can see, "ough" can produce a myriad of sounds seemingly randomly. In addition, these endings may rhyme different in other dialects of English as well. Therefore linguists cannot rely on such whimsical system to scientifically represent sounds in a language. The solution was the creation of symbols explicitly designed to represent all sounds that humans can produce. We call such systems "Phonetic Alphabets".

Unfortunately concensus is the last thing linguists have between them and consequently several systems exist. The most famous one is the International Phonetic Alphabet or IPA, but the American Phonetic Alphabet is also quite widespread. I have chosen to adhere to the American system in this page because that's what I've been taught in. If you are familiar with the IPA there shouldn't really be any problems once you understand corresponding equivalent symbols in the two systems.

The following are some of the signs of the American phonetic system. When used for transcription, sounds are put inside square brackets, ie [ ]. Related and similar sounds in a language often occur in complementary distribution, that is, each of these sounds appear only in unique situations. For example, in English, the "t" in "top" sounds different from that in "stop". However, the "t"-sound in "stop" (which is less powerful the the "t" in the beginning of a word) only occurs after a "s" sound, while the "t" in "top" occurs everywhere else, and therefore these two sounds are in complementary distribution. We call this set of sounds a phoneme, and write it between two slashes, ie / /.

Formally, /t/ becomes [t] after [s], and becomes [th] everywhere else. The superscript h means that the consonant before it is produced with a little more air.


Some important points:

  • V+ denoted "voiced", and V- is "voiceless". Voiceless and voiced simply mean that whether the vocal cords vibrate while making a sound. If you put your hand on your throat and alternate between saying "cod" and "god", you'll notice that "god" makes your vocal cord (or larynx) vibrates more. This is called voiced.
  • [p], [t], and [k] are unaspirated. For people who know Spanish well, they correspond to the sounds in 'pelo', 'té', and 'cosa'. Such sounds do not occur alone in English, but mostly after the consonant [s], such as in 'space'. Compare 'space' and 'pace', and you'll notice how the /p/ in 'pace' is stronger.
  • As just mentioned, the sounds /p/, /t/, and /k/ in English occuring at the beginning of the word is aspirated, meaning that more air is pushed out. In Linguistics they are transcribed as [ph], [th], and [kh]. You may think that is impossible to have aspirated /b/, /t/, and /g/, but Proto-Indo-European and Indic languages have them (like in the name of the great Indian epic Mahabharata).
  • The columns on the chart refer to points of articulation, that is, places in your mouth where sounds are produced. Bilabial means both of your lips come together, and the sound comes out there (you can feel the vibration between your lips if you try). Labio-dental between your upper lip touches your lower teeth. Inter-dental sounds are relatively rare in the world, and what you do is put your tongue between your two rows of teeth.
  • Apico-alveolar means putting the tip of your blade right behind your upper row of teeth. Apico-palatal sounds are also called Retroflex. They are pronounced like the Apico-aveolar except with your tongue curled back a little. The most common example for an American English speaker is the 'r' in "road". Retroflex /d/ and /t/ occur in Indian languages (both Indo-European and Dravidian).
  • Lamino-palatals are very much like apico-palatals but instead having the tip of your tongue as the highest point the blade, the part behind the tip, almost touches the roof of your mouth.
  • Dorso-velar, or just velar, sounds are produced between the back of your tongue and the back of your palate. Its cousin, Uvular makes your uvula vibrates, like Parisian French /r/.
  • Glottal simply means your larynx.
  • The categories that form the bold rows refer to the type of articulation. Stops are sounds that are maintained for a very short amount of time. You can't stretch no matter how hard you try. On the other hand, Fricatives can persists for forever. Compare between /t/ and /s/.
  • Sometimes you can merge stops and fricatives to get Affricates, which starts as a stop and turns into a fricative. The /ch/ in English "church" is just an example of an affricate. It starts as a /t/, and turns into a /sh/ sound.
  • Nasals are, well, nasal. They make your sinus vibrates.
  • I have no idea why Liquids are called liquids. The voiced apico-palatal liquid /r/ occurs in American English "red" and the voiced apico-alveolar liquid /l/ is like in English "lock", not "table".
  • The flap is the Spanish short /r/, ie in "toro". Also occurs in Italian, Japanese, and American English in the form of the /dd/ in "ladder" or /tt/ in "butter" said rapidly.
  • Semi-vowels are really vowels that appear as the less-powerful part of a diphthong. In other words, they are non-syllabic vowels.


Even though they look like English, don't be tempted to pronounce the symbols as if they were English letters. For instance, the symbol [i] really sounds like the 'ee' in "reed". The symbol [e] doesn't sound like the 'e' in 'be', but more like French 'être'.

When you say a vowel, you unconsciously change your tongue and lip into an unique configuration characterized by three attributes:

  • Unrounded vs rounded. This feature applies to your lip. If you say [u] as like "room", you'll notice that your lips forming a circle and you look like you're about to kiss someone. On the other hand, if you say [i] as in "feet" your lips are straight. That's why before you take a picture in America you will tell the people you're about to capture on film to say "cheese", because [i] makes the lips look like smiling.
  • High to low. You probably never noticed this, but when you say a vowel part of your tongue will raise toward the roof of your mouth while other parts will stay near the bottom. The height of your tongue's peak determines the vowel you say. The sound [i] like in "feet" forces your tongue higher up than, say, the sound [a] as in "father".
  • Front, central, and back. This same peak that I just described above can also change in position in your mouth. When the peak is closest to your teeth, it is in front. Toward the throat is back. Between the two is, obviously, central. With [i], the peak of the tongue is a little bit behind your teeth, while with [u] the peak of the tongue is at the back of your mouth, near where the hard palate changes to the soft palate. If you can't picture it, try feeling around with your finger.
  • Vowels can be long or short. A long vowel is denoted by a colon (:) after the vowel. The best example in English of long vs short can be found in cases like "sad" (long) and "sat" (short). Notice how the 'a' (phonetically [æ]) sounds longer in "sad" than in "sat". So, "sad" is transcribed as [sæ:d] while "sat" is [sæt].


In many languages of the world, tone plays an important role in distinguishing one morpheme from another.

Notice that tone isn't the same as stress or intonation. All of these involve changes in the pitch of the voice. Stress, sometimes also known as accent, is the rise and fall of the pitch throughout the syllables of a word. In English, there is usally a highest stress in a word, like "kéyboard" or "exáct", but also in some cases two stresses, one higher than the other, occur, like "singularity". Intonation is the rise and fall of the pitch throughout the words of a sentence. Notice how the statement "You are sick" sounds different from the question "You are sick?" In the statement, the words have more or less even pitches with respect to each other. On the other hand, the question's pitch peaks at the adjective "sick". Both contrasts with an interjection like "You are sick!", which places highest pitches on "You" and "sick".

Tone is somewhat like stress in that it also is the rise and fall of the pitch throughout a word. However, tone is used to distinguish words that have the same sounds which may have unrelated meanings, while stress is not. (Actually, in a few cases, stress does serve to distinguish different meanings or version of the same word, but never consistently as tone.)

Furthermore, the beginning pitch and the ending pitch of a tone is central to distinguishing words. Slightly different beginning or ending pitch means different words. On the other hand, the highest point in a stress can be any degree of pitch above the unstressed syllables. The difference doesn't matter as long as the stress rises above the other syllables.

There are several ways of representing tones in Romanization. Pinyin (for transcribing Mandarin) and Vietnamese uses diacritics. Some phonetic transcriptions use single digit numbers. So 1 in Cantonese is the high falling tone, 2 is the low falling tone, and so on. Neither system directly indicates the tone.

There are two other systems that do directly illustrate the tonal change. One uses a vertical bar to denote a scale, and horizontal or diagonal lines to represent the change in pitch.

The best system that I have seen is a two digit number, ranging from 1 to 5. The first (leftmost) digit is the starting pitch, and the second (rightmost) digit is the ending pitch. Together, it tells you which pitch to start and which to end.

Since I am a native speaker of Cantonese, I'll use its tonal system for demonstration. In traditional Cantonese, there are 9 basic tones, but in my dialect (Hong Kong) the high rising and low rising tones have become indistinguishable. Also, the high falling tone has become very similar to the high-level tone (which doesn't technically exist in Cantonese but can be found in Mandarin). I will try to reproduce all the distinguishing details in these tones, but don't take my pronunciation as canonical. The rest are relatively close to reality.

Description Example Sounds
High falling  [ma53] "mother" AU | WAV
Low falling[ma31] "sesame; hemp" AU | WAV
High rising[ma35] ??? AU | WAV
Low rising[ma13] "horse" AU | WAV
Mid level[ma33] "question marker" AU | WAV
Low level[ma11] "to scold" AU | WAV
High short[pok55] "to hit" (quite onomatopeic) AU | WAV
Mid short[pok44] "to struggle (restlessly)" AU | WAV
Low short[pok22] "thin" AU | WAV