How well does OpenAI support different languages?
In most examples I know, if not all, OpenAI uses English. But OpenAI actually supports multiple languages both as input and output. But how well does OpenAI support the different languages? I set out to find out, completely unscientifically.
I googled around and found a long list of supported languages. I initially implemented all the languages and then tested to what extent OpenAI can actually write meaningful and relatively complex text in them.
The initial test was whether OpenAI could author a meaningful text using Markdown that contains:
- a SQL query that finds the average movie length rounded down to 2 decimal places in each category (see below)
- a short subsequent explanation
The SQL query should be:
SELECT
c.name AS category,
ROUND(AVG(f.length), 2) AS average_length
FROM
film f
JOIN film_category fc ON f.film_id = fc.film_id
JOIN category c ON fc.category_id = c.category_id
GROUP BY
c.name;
The prompts sent to OpenAI contains:
- user input (“What is the average length of films in each category? 2 decimals.”)
- prompt template
- dvdrental database schema
- desired language added as “system” instruction
OpenAI’s response in English looked like this. The failed languages that didn’t pass the test (note some of the failed languages can still handle simpler prompts):
- Albanian
- Arabic
- Armenian
- Awadhi
- Azerbaijani
- Bashkir
- Belarusian
- Bosnian
- Brazilian Portuguese
- Bulgarian
- Cantonese (Yue)
- Chinese
- Croatian
- Czech
- Danish
- Dogri
- English (British)
- Estonian
- Faroese
- Georgian
- Gujarati
- Haryanvi
- Indonesian
- Irish
- Japanese
- Javanese
- Kannada
- Kashmiri
- Kazakh
- Konkani
- Korean
- Kyrgyz
- Latvian
- Lithuanian
- Macedonian
- Maithili
- Malay
- Maltese
- Mandarin Chinese
- Marathi
- Marwari
- Min Nan
- Moldovan
- Mongolian
- Montenegrin
- Nepali
- Norwegian
- Oriya
- Pashto
- Persian (Farsi)
- Polish
- Rajasthani
- Romanian
- Russian
- Santali
- Serbian
- Sindhi
- Sinhala
- Slovak
- Slovene
- Slovenian
- Ukrainian
- Urdu
- Uzbek
- Welsh
- Wu
The languages that succeeded and generated acceptable responses:
- Basque
- Bengali
- Bhojpuri
- Catalan
- Chhattisgarhi
- Dutch
- English
- Finnish
- French
- Galician
- German
- Greek
- Hindi
- Hungarian
- Italian
- Mandarin
- Portuguese
- Punjabi
- Sanskrit
- Spanish
- Swedish
- Turkish
- Vietnamese
Of course, I can’t guarantee that the accompanying explanation in the different languages is meaningful, but the answers had a correct SQL query and a short explanation in a language I assume to be correct. OpenAI may mix similar languages, e.g. support for the relatively small language Bhojpuri is rather odd.
Without adding a database schema, I asked OpenAI to find: “get all users who live in downtown Boston using lat/lng” for Postgres. Most of the above languages fail and give sub-optimal answers. Only English, Mandarin, Spanish, French and surprisingly Finnish give satisfactory answers. German comes close, but without being able to find lat/lng for Boston. The same goes for Bengali.
Here are the generated snippets by each language:
If we switch the desired database engine to Trino (AWS Athena), then only English and Mandarin makes the cut. So, completely unscientifically, I would conclude that the best supported languages are English and Mandarin. Finnish is surprisingly well supported given it only has around 5 million native speakers.