Archive for the 'language' Category

I know, I know … all I ever write about these days are me and my braces. “What a boring blog, roscivs,” I can hear you all say. “Where are the endless discussions of the nuances of copyright? Where are the random riffs on economics? Where are the random yet interesting youtube videos?”

Well, those things have taken a back seat in my life to braces. (But fear not, interesting linguistical tidbits will be found in this braces-related post nonetheless.) Fortunately, not everything in my life has been replaced by a fixation on braces. In fact, my eating habits have nearly returned to normal. This last week I started eating english muffins again, and yesterday I even ate a hamburger! Carefully, I admit, and with no bacon or pickles, but I successfully chewed through a hamburger and fries with no pain or discomfort whatsoever. Normality ensues!

In my Japanese conversation exchange last week, I learned all sorts of braces-related words, like “teeth”, “braces”, “dentist”, “orthodontist”—and a very interesting Japanese word, “歯軋り”, or ha-gi-shi-ri. It means, according to my dictionary, “involuntary nocturnal tooth grinding”.

My conversation partner asked me to explain why I got braces. In English, the explanation is a little lengthy. “Well, my teeth interfere with each other, and so I grind my teeth at night, which causes problems, so I figured if I got my teeth straightened out it would help with that.”

In Japanese, the sentence is basically, “ha-gi-shi-ri-de-su”. Because of involuntary nocturnal tooth grinding. How convenient! The Japanese have a word for it indeed!

Shades of Sapir-Whorf, eh? Does language alter the way we think? Are Japanese people more easily able to express these sorts of concepts because of the way their language is shaped?

As I’ve discussed many times, I think the causal chain is reversed—it is because the Japanese more often discuss these sorts of concepts that they’ve shaped their language in such a way that makes it easy. (My conversation partner hypothesized that this is because Americans are able to release their stress in ways other than gnashing of the teeth, whereas the Japanese tend to keep their stress bottled up inside. I can’t speak to the accuracy of this theory, but it seems interesting.)

In support of the “culture-shapes-language, not vice-versa” argument, I present to you the word bruxism. If I’m not mistaken, it is a word nearly identical in meaning to 歯軋り. So, in theory, my English explanation could be just as succint as my Japanese explanation. Why did you get braces? “Oh, because of bruxism.”

But most people wouldn’t understand what I was saying. They either wouldn’t recognize the word, or wouldn’t recognize the further context that such a statement implied (e.g. the other medical problems that such a condition causes). Why is that? Why would a Japanese person understand a wealth of information behind the word 歯軋り, but not an American with bruxism? It’s not because there’s something lacking in the language. The word is there, ready to be used. The difference is in the culture, in how often the word is used, and in what contexts the word is used. If suddenly it became an important concept culturally, the English language would be fully “prepared to do its work” in conveying those ideas clearly and succinctly.

Humans like classifying things. There’s no doubt about it. We often end up puttings things into categories without even realizing it. Whether it’s music, colors, movies, food, or pretty much anything else we interact with on a daily basis, we’re consciously or subconsciously attempting to slice and dice the world around us into discrete, individual groups.

There’s a very good reason for this. I didn’t really understand it until I took a class on artificial intelligence—which is, at its core, the task of trying to get a computer to do the same sort of thing. Many people have a very romantic view of artificial intelligence, but what most people in the field are actually doing is simply taking a lot of data and trying to get a computer to classify it correctly. The usual name for this is “machine learning“—amusing, since at first glance the idea of “learning” and “classification” seem to be rather unrelated.

But the thing that I discovered in this class is that, if you classify everything perfectly—that is to say, every different thing is in its own separate bucket—then there’s no way to make any sort of inferences or predictions about things you haven’t seen before.

For example, let’s say I know that white stones, when thrown at a window, will break said window, and I know that black stones will also break windows—but I’ve never seen a green stone before. How do I know if a green stone will also break a window? I have to make some sort of classification or generalization based on what I’ve seen before. White stones and black stones are both part of the “small, hard thing” category, and things in that classification break windows. If I see a green stone, and can correctly classify it as a “small, hard thing” based on its attributes, I have successfully learned something additional about green stones that wasn’t present in the data I was originally given. Green stones break windows!

Unfortunately, whenever you make these sorts of predictions or inferences, you may get things wrong some of the time. If the only things you’ve seen break windows are green stones, green metal, and green-painted bricks, then you might mistakenly think that a green feather will break windows too. Similarly, if every Frenchman you’ve met has been rude and snotty, then you might mistakenly think that all Frenchmen are conceited and haughty. This is an unfortunate side effect of classification—but the alternative (to treat every person you meet as a completely separate individual, in their own group) is to never learn anything about people you’ve never met based on people you have met who are similar to them.

There’s another major difficulty with our human tendency towards classification. The real world isn’t discrete; it’s continuous. While some songs are obviously “rap” while others obviously “pop”, some are in between. While some hues are obviously blue and others are obviously green, some are more ambiguous. Is this film a comedy or is it a drama? Are tomatoes fruits or vegetables?

Recently I heard a discussion of whether “Mexican” ought to be considered a separate language from “Spanish”, or whether it was simply a dialect. Unfortunately, there is no linguistic criteria for distinguishing between a language and a dialect. There is a saying, “A language is a dialect with an army and navy“. Why can’t linguists draw some sort of line, saying, “Everything past this point is a separate language”?

The fact of the matter is that the language-dialect spectrum is a continuum. It’s much the same as with biological species. Most intermediate forms have died out, so to the classification-happy human brain, there are distinct categories that we can place “separate languages” into just like we can place “separate species” into. But then there are exceptions like Larus gulls and Ensatina salamanders for species, or Arabic for languages.

So is Mexican a different language from Spanish? Is American a different language from English? Are either on their way towards divergence into separate languages?

It simply depends on how you define your classifications, a totally arbitrary concept imposed upon a non-discrete continuum.

(From Gödel, Escher, Bach, by Douglas Hofstadter.)

[In ordinary prose … problems of translation do occur]. Suppose you are translating a novel from Russian to English, and come across a sentence whose literal translation is, “She had a bowl of borscht.” Now perhaps many of your readers will have no idea what borscht is. You could attempt to replace it by the “correspoding” item in their culture—thus, your translation might run, “She had a bowl of Campbell’s Soup.” Now if you think this is a silly exaggeration, take a look at the first sentence of Dostoevsky’s novel Crime and Punishment in Russian and then a few different English translations. I happened to look at three different English paperback translations, and found the following curious situation.

The first sentence employs the street name “S. Pereulok” (as transliterated). What is the meaning of this? A careful reader of Dostoevsky’s work knows that Leningrad (which used to be called “St. Petersburg”—or should I say “Petrograd?”) can discover by doing some careful checking of the rest of the geography in the book (which incidentally is also given only by its initals) that the street must be “Stoliarny Pereulok”. Dostoevsky probably wished to tell his story in a realistic way, yet not so realistically that people would take literally the addresses at which crimes and other events were supposed to have occured. In any case, we have a translation problem; or to be more precise, we have several translation problems, on several different levels.

First of all, should we keep the initial so as to reproduce the aura of semi-mystery which appears already in this first sentence of the book? We would get “S. Lane” (”lane being the standard translation of “pereulok”). None of the three translators took this tack. However, one chose to write “S. Place”. The translation of Crime and Punishment which I read in high school took a similar option. I will never forget the disoriented feeling I experiences when I began reading the novel and encountered these streets with only letters for names. I had some sort of intangible malaise about the beginning of the book; I was sure that I was missing something essential, and yet I didn’t know what it was … I decided that all Russian novels were very weird.

Now we could be frank with the reader (who, it may be assumed, probably won’t have the slightest idea whether the street is real or fictitious anyway!) and give him the advantage of our modern scholarship, writing “Stoliarny Lane” (or “Place”). This was the choice of translator number 2, who gave the translation as “Stoliarny Place”.

What about number 3? This is the most interesting of all. This translation says “Carpenter’s Lane”. And why not, indeed? After all, “stoliar” means “carpenter” and “ny” is an adjectival ending. So now we might imagine ourselves in London, not Petrograd, and in the midst of a situation invented by Dickens, not Dostoevsky. Is that what we want? Perhaps we should just read a novel by Dickens instead, with the justification that it is “the corresponding work in English”. When viewed on a sufficiently high level, it is a “translation” of the Dostoevsky novel—in fact, the best possible one! Who needs Dostoevsky!

We have come all the way from attempts at great literary fidelity to the author’s style, to high-level translations of flavor. Now if this happens already in the first sentence, can you imagine how it must go on in the rest of the book? What about the point where a German landlady begins shouting in her German-style Russian? How do you translate broken Russian spoken with a German accent, into English?

Then one may also consider the problems of how to translate slang and colloquial modes of expression. Should one search for an “analogous” phrase, or should one settle for a word-by-word translation? If you search for an analogous phrase, then you run the risk of committing a “Campbell’s soup” sort of blunder; but if you translate every idiomatic phrase word by word, then the English will sound alien. Perhaps this is desirable, since the Russian culture is an alien one to speaker of English. But a speaker of English who reads such a translation will constantly be experiencing, thanks to the unusual turns of phrase, a sense—an artificial sense—of strangeness, which was not intended by the author, and which is not experenced by readers of the Russian original.

Problems such as these give one pause in considering such statements as this one, made by Warren Weaver, one of the first advocates of translation by computer, in the late 1940s: “When I look at an article in Russian, I say, ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’” Weaver’s remark simply cannot be taken literally; it must rather be considered a provocative way of saying that there is an objectively describable meaning hidden in the symbols, or at least something pretty close to objective; therefore, there would be no reason to suppose a computer could not ferret it out, if sufficiently well programmed.

Sealionii brings us a bit of doggerel macaronic verse:

O sibili siemgo
Fortibus es inaro.
O nobili demis trux
Watis inem? Causand dux.

This little bit of faux-Latin has brought me no end of amusement. A bit of Googling reveals hundreds of different variations—the poem is apparently quite old. Google Books has a scanned copy of a publication from 1892 containing a variant quoted from the New York Tribune: “I Sabilli Hoeres ago / Fortibus es in : Aro / Nosces Mari the be trux / Votis innem . . . pes an dux.” I can only imagine that if this clever little rhyme had made it to the New York Tribune by 1892, it must be considerably older than that. The oldest reference I can find is from something published by Oxford University Press in 1849—over a hundred and fifty years ago.

The relative ancientness of it, and its many extant variations, make it quite a rich source for linguistical curiousities. The only line that remains consistent appears to be “fortibus es”, although even that has variations (such as fortubus es and fortebus es, as well as non-bus related variations), making it difficult to find all the different possibilities. But simply from the first ten google hits for fortibus es we find quite an amazing amount of differences.

In the first line, we find “Civili”, “Civile”, “Sybilli”, “si vile”, “Isabili”, “sibile”, “Sybili”, and “sibili”—some with the vocative “O” and some without. “Si ergo” is the most common continuation, although “si ego” is also relatively common. Sealionii is the only hit for “siemgo”, although there is a “sidemgo” and also “heres ago”, which is closer to the much older versions.

After “fortibus es” we find “in ero”, “in aero”, “in aro”, “enero”, and “inero”. That line, as I mentioned earlier, stays the most consistent among all the versions.

The next line has some curious variations, however. The 1849 version reads, “No ces billi themis trux,” whereas the 1892 version has Mary as the contradictory party. All the first-page Google hits had either “nobili” or “novili” (again with or without the vocative), but no extraneous “says” in between. The latter half of the line has plenty of variations, though—while “trux” remains constant, the word before can be “demis”, “doser”, “deus”, “themis”, “deis”, or “thebi”. Even curiouser is an extra bit in some versions in between the address to Billy/Willy and the correction: “doser nobus” and “deus nobuses es”.

The final line is the curiousest still. The question ranges from “watis inem” to “vadis inem”, “vatis inem”, “vattis inem”, “sivat sinem”, “vadis indem”, “vatis enim”, “vates enim”, “se vatis enim”, and “vatis indem”. The final answer, despite “peas and ducks” being the climax of the older versions, is almost always cows and ducks, although various spellings are of course present: “causand dux”, “causen dux”, “causet dux”, “causan dux”, “causem dux”, and “causa an dux”. (I guess cows somehow make more sense than peas?) There are only two exceptions: one hit which didn’t include the fourth line at all, and one “pax a dux”, which I can only assume is meant to be read as “packs of ducks”.

But the bizarre bit comes in the “English” reading supplied for the last line. In two of the ten hits, the poster of the verse describes the last line as “geese and ducks”. This led me to discover yet another series of variations, seemingly morphed into Spanish and almost all, even stranger, talking about “lorries” rather than “buses”. What makes Señor go together with lorries? I don’t know, but Google says they do.

Si senor, der dago
Forti loris inaro
Demant loris, demam trux
Fulla cowsan ensan dux.

Last week, I saw an article on Slashdot about languages going extinct. Somewhat surprisingly, the majority opinion seemed to be that this was a good thing. A rather representative comment stated:

A language is just a communication protocol. Would you say that having 7000 incompatible networking protocols is a good thing? No, it patently isn’t. Thousands of incompatible languages simply help create pockets of ignorance and deprivation.

Another poster points out that people can take advantage of the ignorance that comes from “incompatible languages”:

If someone in, say, America were to tell you that the Canadians as a whole are preaching holy Jihad upon the infidel Americans, everyone would just call him nuts. There are maybe millions of people who live close to the border or travel across the border, and can tell you relatively first hand what the Canadians actually say. Or if not, you can just order a newspaper and read for yourself what they do say. Even if they were to manage to find one nutcase preaching holy war, everyone would point out just that: it’s just one idiot that no one else takes seriously.

Now try Americans vs Arabs, Arabs vs Jews, or whatever other manipulation across a language barrier. Now that works much better, doesn’t it? You can cherry-pick which extremists (on both sides) to translate out of context, to make it sound like a whole language or ethnic group is hell-bent on wiping you off the face of the Earth. (Never mind that no group that size ever agreed on anything else, for as long as we have a recorded history.)

Perhaps this is one of the reasons for the Italian expression, “traduttore, traditore”—translator, traitor.

But this perspective was by no means the only one. There were also quite a few posters shocked at the prevalent opinion that the fewer languages the better. “Not all languages are equally expressive,” asserted one poster. “There are some things you just cannot say in certain languages because they lack the constructs and idioms. … Various words just have no real translation.”

Anyone who has spoken multiple languages won’t find this idea difficult to swallow. It is quite common to find words or phrases that are difficult to translate from one language to another; speakers of multiple languages will often switch back and forth between languages to more richly express the concepts they want to talk about. I frequently use the Afrikaans word snaaks, which means “funny”—but not “ha ha” funny. The English word is ambiguous, but the Afrikaans word is not.

Does this mean Afrikaans is more expressive? What if I want to be more ambiguous? A different Slashdot thread on machine translation describes a Japanese poem that is “written to be gender-ambiguous and person-ambiguous,” something that happens by default in Japanese, but is difficult or impossible to write in English.


But this discussion of languages seems bizarre in the light of one of linguistics’ foundational claims, that “all languages are equal in their communicative and expressive abilities”[1]. Or, in the words of Edward Sapir, “The outstanding fact about any language is its formal completeness … To put this … in somewhat different words, we may say that a language is so constructed that no matter what any speaker of it may desire to communicate … the language is prepared to do his work.”[2]

How do we reconcile these two different perspectives? I think the mistake in thinking that “not all languages are equally expressive” is one of confusing cause and effect. The lack of a word (e.g. snaaks) doesn’t prevent you from expressing the concept (”funny, but not ‘ha ha’ funny”) at all. And if the concept is important enough in the culture where that language is used, the longer phrase will, over time, become shortened into a single word.

Utahraptor in Dinosaur Comics gives a good example: “As things become more prominent, they move to become words. Like ‘electronic mail’ becoming ‘e-mail’ and finally ‘email’—that was due to email becoming more popular, not because people were creating the word in order to MAKE it more popular. You know?”

Douglas Hofstadter gives a cogent example in a single language, highlighting the distinction: “people who grow up in a rural area are more aware of, say, the difference between a pickup and a truck, than a city dweller is. A city dweller may call them both ‘trucks.’ It is not the difference in the native language, but the difference in culture (or subculture), that gives rise to this perceptual difference.” The country mouse and the city mouse are speaking the same language—languages equal in their communicative and expressive abilities—and yet if you hear the city mouse say “truck,” it’s more ambiguous what he’s referring to.

I believe you can think of all the world’s languages being equivalent in a similar way. English has the ability to express gender-ambiguity and person-ambiguity just as it has the ability to express truck/pickup ambiguity, but we don’t use words that way because it’s not important to us culturally. English speakers are the country bumpkin to Japan’s city slicker in that respect.

If there is an apparent “hole” in the language—things you seemingly cannot say because it’s missing certain constructs or idioms—it is for the same reason that English as spoken in different regions or different countries has words or phrases that the same language spoken elsewhere lacks. And if a language disappears but the culture of its speakers remains unchanged, then the language those people adopt will change and grow to accommodate the exact same concepts and idioms that the newly-extinct language could convey.

Today, on one of the mailing lists I’m on, someone mentioned a building that was “kiddy-corner” to another building. I’ve often heard the phrase “kitty-corner” to refer to diagonal adjacency, but I wasn’t sure as to the origins of the word, nor whether “kiddy-corner” was an acceptable variant. After some googling, I found this list of variants:

catty-corner, catercorner, catacorner, catta-, catter-, catti-; rarely caddy-, cally-, caper-, catacorners, cat-cornered, catercorn, catty-corned, catty-cornereds, catty-corner, caty-corner, katter-kornered.

That’s an awful lot of entries, but no “kiddy corner”. World Wide Words has more on the etymology of the word:

The first part comes from the French word quatre, four. It’s actually quite an old expression that first appeared in English as the name for the four in dice, soon Anglicised to cater.

Eventually it lost all connection with its original meaning, and people thought it had something to do with cats, so “catercorner” became “cattycorner” became “kittycorner”. And now, apparently, has further eggcorned into “kiddycorner”, at least for some people.

On Monday I began my Japanese class again! It’s pretty exciting to be learning another language. Today was the second day of class, and although I need a ton of more practice to get the vocabulary cemented, a lot of things are coming faster and faster.

So in honor of all this, I decided to order Hikaru No Go—the manga—in Japanese! Hikaru no Go I’m so excited! I can actually read bits and pieces of it (assisted of course by my knowledge of the storyline in general, but I suppose that’s how we learn to read our mother tongue as well). All the kanji has furigana, which makes it even easier for me to understand what’s going on. Hip, hip hooray for Hikaru!

One of the coolest Google features (from a corpus linguistics point of view) is the ability to do wildcard searches in the middle of phrases.

For example, if you search for “what has * here”, Google will find all pages that have that phrase with something substituted for the asterisk. This lets you see first of all what word is most common in such a phrase (in this case, “happened”). Or you might be looking for different variations of the phrase (such as “transpired”, which shows up on the first page).

In my most recent case, I heard the phrase “sometimes a cigar is just a cigar,” a cautionary tale against reading too much into things, and I was interested in the origins of the phrase. Why is the word “cigar” used, as opposed to any other number of ordinary objects? It turns out that the phrase is a very popular one for substitutions: a search for “sometimes a * is just a *” reveals eight different alternatives to “cigar”, such as “snake”, “squirrel”, and “fool”—just on the front page. Paging through the results, there’s no end to variations on the phrase.

The origin of the phrase, and the reason for the word choice, appears to trace back to Sigmund Freud. Freud, of course, is infamous for his sexual imagery, constructing elaborate meanings for everyday items or occurrences, framing them in terms of repressed sexuality. He also commonly smoked cigars. One day, according to legend, a cheeky student asked him what his obsession with cigars signified. Freud allegedly responded, “Sometimes a cigar is just a cigar.”

I have no way of knowing whether this story is accurate, but of all the times Freud seemed to read too much into something seemingly innocuous, this was one time I wonder if he didn’t read quite enough into it. Perhaps he was protesting too much his innocence?

BoubaOrKiki.png

Which shape would you call “Kiki” and which shape would you call “Bouba”?

It turns out that over 90% of humans, no matter what their native tongue is, will choose the same answer.

Interestingly enough, the same is not true of autists, who will agree only slightly more than half the time.

The reason I brought up the topic of formal grammars yesterday is that John Backus, a pioneer in the field of computer science and creator of the first high-level programming language (FORTRAN), passed away last Saturday. He’s probably known best for Backus-Naur Form (BNF), a formal notation for describing formal grammars. In BNF, my example grammar from yesterday would look like this:

<S> ::= <N> <V> <N>
<V> ::= saw | kissed
<N> ::= Bob | Jill

The major differences are the use of “::=” instead of an arrow, and the use of a vertical pipe “|” as sort of a shorthand to compress multiple production rules into one.

Not only is BNF a simple and standard way of representing grammars, but there are computer programs that will take as input a grammar in BNF, and automatically generate code that will parse (generate the derivation rules) for any given string. This turns out to be very convenient, since nearly all computer languages can be described in BNF. (Like Java, for example.)

Now, not every grammar is as simple as my example grammar, which could only generate eight unique strings. Nearly every formal grammar that is remotely interesting can generate an infinite number of strings. For example, here’s another very simple grammar in BNF:

<S> ::= <S> <S> | a

This grammar can generate strings like, “a”, or “a a”, or “a a a”, and so on. It can generate an infinite number of strings (or you can say the language of this grammar contains an infinite number of strings), but obviously it doesn’t contain every possible string. “a b a”, for example, is not in this language.

Chomsky, when describing formal grammars and formal languages, separated them out into four distinct categories of complexity, also called the Chomsky Hierarchy. The very simplest kind is called a “regular grammar”. Regular grammars are only allowed to have two types of rules:

<X> ::= z
<X> ::= <Y> z

In other words, a rule can either consist of a single nonterminal on the left, and a single terminal on the right, or a single nonterminal on the left, and a single nonterminal on the right followed by a single terminal. The grammar I listed above does not qualify, since it has a rule with two nonterminals on the right-hand side. However, the language it describes (strings of “a”s) can be described by a regular grammar.

Would anyone care to try their hand at finding it?