Let’s say you live in the New York Metropolitan region, the area defined by New York City and its surrounding suburbs. You want to learn a language, but you don’t want to learn one to travel somewhere. Instead, you want to learn a language to communicate with people in your region who speak that language. What language should you learn?
I argue that your best bet is Spanish, Chinese (Mandarin/Cantonese), Russian, Korean or Bengali. I argue this based on a combination of three metrics. The first is total number of speakers of each language, the second is total number of speakers of each language who also don’t speak English well, and third is the percentage of each linguistic community who do not speak English well.
I use data from the 2019 American Community Survey (ACS)’s Table B16001 to answer this question. The dataset I used covers the New York-Newark-Jersey City, NY-NJ-PA Metro Area (not just New York City), which you can find defined here. You can find the raw dataset here.
1. The “Big 5” rank in the top 10 on three key metrics. Six additional languages came close
Using the ACS data, I calculated three separate rankings for each of the languages. They are: total number of speakers (TNS RANK), total number of speakers who do not speak English well (TNS – LEP RANK), and the percentage of total speakers who do not speak English well (L%T RANK). The Big 5 ranked in the Top 10 of each category, as seen below. (Spreadsheet manually linked here, and the full data set with rankings can be found here).
Some other languages that ranked highly, but didn’t rank in the top 10 of each category include Haitian, Yiddish, Arabic, Portuguese, Polish, and Italian.
2. These three key metrics reflect three assumptions about a typical language learner
These metrics reflect three assumptions. The first is that I assume our hypothetical language learner wants to learn a language with a large number of speakers. Second, I assume that they want to use the target language, and not English. Hence, they want a language that has many speakers that also do not speak English well. The third assumption is that I assume they want to visit communities where the language is used as a daily language of communication. I assume that as a language has a higher percentage of total speakers who do not speak English well, all speakers of that language will opt to use the target language, rather than English. Hence the third metric, the percentage of total speakers who do not speak English well.
3. However, some languages are geographically concentrated, and the Census Bureau’s language groupings don’t always reflect linguistic reality
There are a couple of things to keep in mind that complicate this picture.
- Geographic concentration. Although the dataset covers the entire metro region, some of these languages are highly concentrated in a few areas, generally in New York City. For example, of the estimated 192,600 speakers of Haitian, 115,751, or 60%+, live in just three counties: Kings County, NY (Brooklyn), Queens County, NY (Queens), and Essex County, NJ (the county that contains Newark, NJ). Nearly half of Russian speakers (123,340) live in Brooklyn, and nearly two thirds of Bengali speakers (97,223) live in Queens and Brooklyn. 1/3rd of Korean speakers (51,332) live in Bergen County, NJ.
- Language versus a dialect. There is no universally agreed upon definition for what constitutes a dialect, versus a separate language. Generally speaking, for two dialects to be considered part of the same language, they need to have some degree of mutual intelligibility. But for non-linguistic reasons, sometimes separate languages can be considered dialects, or two dialects that are mutually intelligible are considered different languages.
- In this case, the Census Bureau and many speakers of Cantonese and Mandarin consider their languages to be dialects of Chinese, despite the fact that they are completely mutually unintelligble. In the opposite direction, speakers of Hindi and Urdu consider themselves to speak different languages, despite the fact that the normal, everyday language is nearly the same, just written in different scripts.
- If languages were grouped based on linguistic criteria, then Chinese would be split, and Hindi/Urdu would be combined. I reran the analysis with this reshuffling, which gives the ranking shown here. The top 5 languages remain Spanish, Mandarin, Cantonese, Russian, and Korean. Bengali gets kicked out of the top 10, and Hindustani (Hindi/Urdu) becomes the fifth most spoken language.
Notes
The source data is not a full census, but rather a random sampling of all households
The data used is Table B16001 of the 2019 American Community Survey (ACS). The ACS is conducted by the Census Bureau but is not the same as the decennial Census. Instead of contacting every resident of a selected geography, it uses random sampling and surveys to glean insights about the US population.
1 year estimates were used instead of 5 year estimates to use more recent data
This analysis uses 1 year estimates, instead of 5 year estimates. 5 year estimates are more accurate, but the last set of 5 year estimates for this data set is from 2015. I opted to use a more recent data set even though it might be less accurate.
The dataset covers the entire New York City Metropolitan area, not just New York City
The geography used for this analysis is the New York-Newark-Jersey City, NY-NJ-PA Metro Area. It includes the City of New York, as well as several counties adjacent to it.
Motivation is more effective than utility for language learning
Your reason for learning a language shouldn’t be based on number of speakers alone. You should want to learn a language because you’re interested in speaking that language, or learning more about cultures that speak the language. If you’re truly ambivalent about which language to learn, and believe that you can maintain interest in any language, then it might make sense to use metrics like these. But if you have an interest in a different language that doesn’t appear here, you should pursue that.