Photography by Timothy Archibald
Illustrations by Giorgia Virgili
to hear his daughter’s first word. In Madeline’s babbling, he’d already discerned those classic baby sounds “ba,” “da” and “ma,” but when she was 10½ months old, she began saying “BAba” each time she saw Brown Bear, Brown Bear, What Do You See? The book, by Eric Carle, was one of her favorites. At first, Frank doubted that “BAba” constituted a word (the etymological root for “babble” is, after all, the repeated use of “ba” by toddlers), but as he observed Madeline speaking it, he noted its “word-y” qualities: the stress on the first syllable, the descending intonation and a hint of an R after each B. She made the sound only when the book was around, “with the exception of one or two potential false alarms when another book was present,” he wrote in his blog. This was, indeed, language, he decided. Then, three weeks later, she stopped using the word, and he never heard it again.
Frank, ’03, wasn’t just an attentive father describing the nuances of his firstborn’s proto-language with the zeal of a connoisseur; he was also a Stanford psychologist specializing in that earliest of linguistic fermentations: children’s language acquisition. For the past five years, the associate professor has been building Wordbank, an online trove in which he collects the utterances of tykes from 8 to 36 months. So far, he has gathered those of 39,964 females, 40,113 males and 2,900 children whose genders are unidentified. They hail from 29 language groups, including Cantonese, Hebrew, Kigiriama, Norwegian, Turkish, French in Quebec and France, and English in Australia, the United Kingdom and the United States.
While Wordbank has many uses, its primary purpose is to answer a question that has long haunted linguists: How much of language acquisition is innate and therefore the same everywhere on earth, and how much of it is affected by environment? “Early language is our first clue about this process,” Frank says. “The approach we take is directly inspired by this idea of what’s universal across languages and across the process of language learning.”
The challenge in Wordbank—and Frank’s forte—lies in making sense of the sundry infantile proclamations that he has accumulated in the millions. He and his team have spent years building computational tools to create order from hullabaloo, and the first results began coming in around the time that Madeline was making her earliest forays into speech. They revealed that while education and nurturing are, of course, extremely important, in the end, tots and their linguistic tactics are unpredictable.
“There are a lot of differences between kids that can’t be explained by their demographics or their backgrounds,” Frank says. “Kids are really variable, and I find that liberating as a parent—that you can relax a little bit and watch them grow in the direction and at the pace that they want to, knowing that a lot of that variability is out of your control. It’s about the path that they want to take into language.”
The biggest constant, it turns out, may be difference. Rates and styles of language learning vary within social classes, schools—even the same home. In the forthcoming book on Wordbank, Variability and Consistency in Early Language Learning, Frank and three colleagues write, “Although some 18-month-olds already produce 50–75 words, others produce no words at all, and will not do so until they are two years or older.”
Even when there are patterns, such as in the most common first words (among the first 10 words uttered in many languages are “mommy,” “daddy,” “woof woof,” “no,” “bye,” “hi,” “yes,” “vroom,” “ball” and “banana”), babies can also be distinct in how they emerge onto the linguistic stage, as Madeline’s use of “BAba” reminds us. Many of Wordbank’s other findings show similar consistency and variability, such as how firstborns speak compared with their siblings, whether toddlers prefer nouns or verbs, which words are more likely to be spoken by girls or by boys, and how girls master language more rapidly than boys.
Though Wordbank can’t always reveal the reasons children learn in the ways that they do, its data allows researchers to see the patterns in child learning that hold steady across cultures. It also provides them with new avenues for exploration, allowing them to conduct studies with greater precision, searching for potentially larger, subtler or more complex factors that influence language acquisition.
And, like the children whose data it stores, Wordbank is growing, absorbing new data that, along with its code, is open to everyone.
Stanford’s first steps toward becoming a hub for the study of hubbub took place in the 1950s, when linguistics professor Charles Ferguson became interested in how people spoke to infants and pets. After Eve Clark joined the linguistics faculty in 1971, she took over teaching language acquisition. In 1973, she and a committee of graduate students began organizing the Child Language Research Forum, the first—and for many years the only—
conference on language acquisition, which ran until 2009. During Clark’s half-century in the field, Stanford researchers made a number of discoveries, such as that small children know a great deal about how language is used and adapt their role-playing to take into account gender, social status and setting. However, much of the research from that time was in response to MIT linguist Noam Chomsky’s proposal that children had an innate capacity for language. “He argued that children didn’t need feedback,” Clark says, “and that they could learn things that weren’t even present in the input they were getting.” Research at Stanford, in contrast, showed that a staggering 60 percent of children’s errors in word choice, word form and pronunciation were implicitly corrected when parents interpreted the talk (“So a child might say,” explains Clark, “‘I come that in,’ meaning ‘I brought it in,’ and the parent might say, ‘Oh, you brought it in?’”) Furthermore, when children used verbs incorrectly, their parents provided interpretation of this nature 90 percent of the time.
The debate around innateness was still very much alive in 1999, when Mike Frank came to Stanford as an undergraduate with a fascination for languages. He double-majored in comparative literature and symbolic systems, an interdisciplinary program created in 1985 by faculty in philosophy, linguistics, computer science and psychology. “Language was this window into human uniqueness,” he says, “and the uniqueness of our ability to tell stories and narratives to define ourselves. Language allows us to coordinate our activities at unprecedented scale and leads to a tremendous number of uniquely human achievements.”
During his sophomore year, Frank investigated whether the language we speak changes how we think about the world. Under Lera Boroditsky, PhD ’01, a Stanford doctoral student and later assistant professor who now teaches at UC San Diego, he worked on a study evaluating whether Russian speakers, who have two words for blue—one for light blue and one for dark blue—distinguished those shades more readily than English speakers, for whom the two colors are just called blue. (They do, the study concluded.)
Frank also steeped himself in the history of linguistics—the debates over whether all humans, regardless of culture, have a similar universal linguistic template in their brains or whether “language emerges from an intersection of specific abilities and orientations, not just innate grammar,” as he puts it. The latter theories argued “that languages are learned through social interaction and that learning is more gradual,” he says. Michael Ramscar, then a professor of psychology, told him that the best way to investigate philosophical questions about the nature and origins of language was to study children. That, Frank remembers thinking, “was an immensely powerful and exciting argument.”
As a doctoral student at MIT, he created computer models to predict how children would learn under different circumstances—for instance, how a child might acquire language when observing other people speaking and interacting as opposed to when being taught words directly by an adult.
“But once you create the theory,” Frank says, “you need to go out and get the data to test it.” This is precisely what he began to do in 2010, after he joined the Stanford faculty. “I looked around and there weren’t any more data on offer. Nobody had the data that I needed.”
In 2015, Frank approached psychologist Virginia Marchman, who is now one of his co-authors on the Wordbank book. Marchman was on the advisory board of the MacArthur-Bates Communicative Development Inventories (CDIs), questionnaires created by language researchers in 1988 to allow parents to record how their children communicated. Having parents inventory their kids’ vocabularies at home, in their natural environment, had been shown to be more effective than studying children in the lab. Researchers around the world also adapted CDIs to their languages, using words important in those cultures. And in every region, before the researchers could use CDIs to evaluate individual children, they had to do norming studies—surveys of thousands of monolingual children to establish local norms. The studies turned out to be an untapped trove.
“Each of those groups had CDIs for thousands of kids, often in a filing cabinet or in an Excel file or whatever on their computer,” Frank recalls. So he made a proposal to Marchman: Would it be possible to bring all that data together to stimulate innovation and answer the most challenging questions about linguistics?
The idea appealed to her. “Making data open and accessible to other people is good for the field,” she says, “and it’s good for science in general.” She told him that the CDI board meeting would be the following week in San Diego and invited him to make his pitch.
Shortly after doing so, he began receiving CDIs, but several years passed before many of the researchers responded. “I like to say that I started with the Field of Dreams model: ‘If you build it, they’ll come running and they’ll give you your data,’” Frank says. “But I ended up much more with a sense that if you build something really compelling, then it forms a way for you to ask them repeatedly to contribute.”
The heart of Wordbank is its openness. Looking back on his presentation at that CDI conference five years ago, Frank sees it as the moment when he transitioned away from focusing on theory. “That experience moved me toward being somebody who works on getting data out there and sharing it openly and trying to create tools for dealing with those data.”
Wordbank pages have a link to GitHub, a software development and sharing platform, where users can download the data as well as the code that Frank and his team developed to analyze it. This allows other researchers to evaluate how Wordbank’s results were derived, to apply the code to their own work or to crunch the information in a different way.
The data itself has many applications, Frank explains—from studying cognitive development to evaluating notions of fairness among children. There is one limitation researchers are working to remedy: It’s hard to use Wordbank to study language acquisition in multilingual children, since the bulk of its CDIs were taken from norming studies, which tested only monolingual children to ensure consistency.
Despite his pivot to information-sharing, Frank remains committed to his theoretical investigation. “We use our data to do a crosslinguistic look at what is consistent across languages and try to use the data to constrain our theories,” he says. “But it all comes down to understanding why and how kids learn language—what’s the shared core of these abilities across different languages and cultures.”
Frank’s tenacity has paid off with insights about how children around the world engage with language. At times, Wordbank has shown consistency within one language group but variability across groups, as with the question of whether kids prefer nouns or verbs early on. Children in most Western language groups, such as French, Norwegian and English, tend to learn nouns first. “You’ve got these really annoying verbs like ‘make’ or ‘do’ that are hard to figure out from context,” Frank says, “because you could make the bed, you could make lunch or you could make a mess. That’s a complicated thing to figure out by looking, because there’s not that much in common between making the bed and making a mess.” Cantonese and Mandarin, however, have concrete verbs that small children can identify and learn early on by watching those who speak them.
Wordbank also reveals how children’s birth order affects their speech. Firstborns often speak earlier than later-born children, most likely because they get more one-on-one attention from parents. And they favor different words than their siblings. Whereas firstborns gabble on about animals and favorite colors, the rest of the pack cut to the chase with “brother,” “sister,” “hate” and such treats as “candy,” “popsicles” and “donuts.” The social dynamics of siblings, it would appear, prime their vocabularies for a reality different than the firstborns’ idyllic world of sheep, owls, the green of the earth and the blue of the sky.
Children also adopt vocabulary quite differently depending on their mother’s level of education. In American English, among the words disproportionately favored by the children of mothers who have not completed secondary education are “so,” “walker,” “gum,” “candy,” “each,” “could,” “wish,” “but,” “penny” and “be” (ordered starting with the highest frequency). The words favored by the children of mothers in the “college and above” category are “sheep,” “giraffe,” “cockadoodledoo,” “quack quack,” the babysitter’s name, “gentle,” “owl,” “zebra,” “play dough” and “mittens.” (Frank tends to focus on word production, which is more reliably measured than comprehension because it involves less subjective evaluation by parents.)
Since few American children gambol with giraffes or zebras or—in a country where more than 82 percent of people live in urban areas—even with sheep, ducks and roosters, Wordbank users can surmise that the favored words for this group were learned from children’s books and trips to the zoo, rather than from expeditions on the Serengeti. Given that Frank’s wife, Alison Kamhi, ’03, a Fulbright scholar and an immigration attorney, is in the “college and above” category, it’s no surprise that “BAba,” Madeline’s first word, was inspired by a book about a brown bear—an apex predator she has surely never had to outrun.
One area of remarkable consistency across language groups is the degree to which the language of children is gendered. The words more likely to be used by American girls than by boys are “dress,” “vagina,” “tights,” “doll,” “necklace,” “pretty,” “underpants,” “purse,” “girl” and “sweater,” whereas those favored by boys are “penis,” “vroom,” “tractor,” “truck,” “hammer,” “bat,” “dump,” “firetruck,” “police” and “motorcycle.”
Even for those who don’t speak many of the languages in Wordbank, a quick scan of the lists reveals easily recognizable words, especially for the boys: “vroum” (Quebecois French), “vrn vrn” (Czech), “brum brum” (Italian), “br/brm/brum” (Latvian) and so on. In nearly every list of boys’ words, “tractor,” “helicopter,” “police,” “hammer,” “motorcycle” and other mechanical objects stand out. The words for girls rely less on onomatopoeia (the creation of a word for an object by evoking the sound associated with it). On their lists, “pretty” and “dress” make frequent appearances.
Wordbank also includes information on British sign language—and children use it in a significantly less gendered way than they speak British English. The top three words more likely to be signed by boys than by girls are “peekaboo,” “hello” and “shower”; the three more likely to be spoken are “tyre,” “vroom” and “cowboy.” The pattern holds true for girls, though in sign language and spoken English, “pretty” remains a favorite.
“You don’t have to be an expert in gender socialization,” says Frank, “to see that it’s interesting that you’re getting these sex-linked words early on.” The challenge in analyzing the data, he points out, is in determining whether children speak this way because of nature or nurture. “We don’t know whether it’s the parents saying these words to the kids or the kid being interested or both.”
Mika Braginsky, a lab tech who helped create Wordbank and co-authored the forthcoming book, agrees with the challenges of assigning significance to such gendered results. They (Braginsky is nonbinary) say, “By ‘girls’ and ‘boys,’ we have the assigned sex at birth of these kids. There’s not really a way to disassociate what is and isn’t genetics- or socialization-related.”
Wordbank’s results become even more difficult to explain where they show the rates of learning for girls compared with boys.
“Girls are more or less better at just about everything,” says Frank. “If you go into a preschool classroom in the United States, you might notice that the girls talk more than the boys on average. They have larger vocabularies. They’re better with language. Is that because of gender socialization in the United States or some feature of the way we culturally interact with different kids? Or is that due to a more invariant mechanism that’s kind of the same across kids in different cultures? It turns out it’s actually the latter. Across most of the languages that we have data for, girls have a bigger vocabulary than the boys and with a relatively similar degree.”
Wordbank can’t explain why girls acquire language with relative ease; it doesn’t tell us whether the gender difference results from societal features that hold constant across cultures or earlier development in babies with two X chromosomes. (Decades of studies by Eve Clark show no differences in production or comprehension between boys and girls; Clark doesn’t know why Wordbank would yield different results but considers the possibility that parents might talk more to girls and therefore have a clearer sense of their vocabulary when completing CDIs.)
Wordbank has, however, presented a few clear patterns—how children’s interests and social environment appear to drive language learning in ways that are surprisingly similar across cultures, and how variable children are in the speed and approach with which they acquire language. “We have found some interesting consistencies across cultures and languages,” Frank says. “I still hesitate to call them universals.”
Though as a new father Frank found reassurance in the varied rates of learning, he also saw the long-term repercussions of the different speeds at which children acquire language.
“Something really striking is how well different aspects of children’s language hang together. Kids who gesture more early on also have bigger vocabularies. Kids who have bigger vocabularies tend to combine words more and have a stronger knowledge of grammar. They tend to put the right endings on words. So one of the things that’s really consistent across culture is that we see all of the different parts of language hanging together. Language is kind of one unified system or one unified skill, which is, I think, fascinating from a cognitive science perspective. If you go to a linguistics department, there are different courses on syntax, grammar, morphologies, phonology, but in acquisition, they all fit together. They’re all part of the same system, and that is really robustly true across all the languages we look at.”
Anne Fernald, associate professor emerita of psychology at Stanford, has shown that socioeconomic factors affect language learning and that underserved children often have smaller vocabularies. Marchman, who works in Fernald’s research group, explains that early levels of language acquisition correlate with performance in many areas later in life—“with your literacy level,” she says, “with how well you do in math, with high school graduation rates. We’ve learned that birth to 5 is an important critical period in development. Language is one of the important skills that we can give our children early on. So I’m interested, given that there’s so much variability, in when that variability is just natural variability and when it is telling us that a child might need some extra help.”
And yet, even in households with similar levels of income and education, variability is high, which is why Frank emphasizes that children have many styles and paths in terms of language acquisition. “We may have a naive story like, ‘Oh, well, parents are really different in all parts of the United States and we’re a diverse nation with lots of different kids from different backgrounds. Maybe that’s why there’s variability.’ But if you go to, for example, a Beijing Mandarin sample where all the kids are monolingual and going to the same state-sponsored early childhood care, the variability is just as high.”
On the flip side, Wordbank allows educators and medical professionals to better identify the normal range of variation. Frank points out that if children are unusually delayed in comprehension or production by the age of 2, parents should consider consulting a pediatrician.
Fortunately for Frank, as Madeline was developing language, he saw from the Wordbank data that she was within the normal range. “It was just super fun to watch the interesting and idiosyncratic things that she did as she broke into language,” he recalls.
He now also has a son, Jonah, whose first word he awaits.
As for whether Wordbank has provided answers to all the questions with which generations of linguists have struggled or has validated the computational models of how children learn that Frank devised in his PhD days, he isn’t sure.
He still needs more data.