Japanese-Chinese Translation with GenAI: What Works and What Doesn’t | Towards Data Science

March 27, 2025

5

Alex (Qian) Wan: Alex (Qian) is a designer specializing in AI for B2B products. She is currently working at Microsoft, focusing on machine learning and Copilot for data analysis. Previously, she was the Gen AI design lead at VMware.
Eli Ruoyong Hong : Eli is a design lead at Robert Bosch specializing in AI and immersive technology, developing systems that bridge technical innovation with human social dynamics to create more culturally aware and socially responsive technologies.

Imagine you’re scrolling through social media and come across a post about a house makeover written in another language. Here’s a direct, word-for-word translation:

Finally, cleaned up this house completely and adjusted the design plan. Next, just waiting for the construction team to come in. Looking forward to the final result! Hope everything goes smoothly!

Illustration by Qian (Alex) Wan.

If you were the English translator, how would you translate this? Gen AI responded with:

I finally finished cleaning up this house and have adjusted the design plan. Now, I’m just waiting for the construction team to come in. I’m really looking forward to the final result and hope everything goes smoothly!

The translation seems to be clear and grammarly perfect. However, what if I told you this is a social post from a person who is notoriously known for exaggerating their wealth? They don’t own the house—they just left out the subject to make it seem like they do. Gen AI added “I” mistakenly without admitting the vagueness. A better translation would be:

The house has finally been cleaned up, and the design plan has been adjusted. Now, just waiting for the construction team to come in. Looking forward to seeing the final result—hope everything goes smoothly!

The languages where the “unstated” context plays an important role in literature and daily life are called “high-context language“.

Translating high-context languages such as Chinese and Japanese is uniquely challenging for many reasons. For instance, by omitting pronouns, and using metaphors that are highly associated with history or culture, translators are more dependent on context and are expected to have a deep knowledge of culture, history, and even differences among regions to ensure accuracy in translation.

This has been a long-time issue in traditional translation tools such as Google Translate and DeepL, but fortunately, we are in the era of Gen AI, the translation has significantly improved because of context-aware ability, and Gen AI is able to generate much more human-like content. Motivated by technological advancement, we decided to develop a Gen-AI powered translation browser extension for daily reading purpose.

Our extension uses Gen AI API. One of the challenges we encountered was choosing the AI model. Given the diverse options on the market, this has been a multi-month battle. We realized that there might be many people like us – not techy, with a lower budget, but interested in using Gen AI to bridge the language gap, so we tested 10 models with the hope of bringing insights to the audience.

This article documents our journey of testing different models for Chinese Japanese translation, evaluating the results based on specific criteria, and providing practical tips and tricks to resolve issues to increase translation quality.

Anyone who is working or interested in using multi-language generative AI for topics like us: maybe you are a team member working for an AI-model tech company and looking for potential improvements. This article will help you understand the key factors that uniquely and significantly impact the accuracy of Chinese and Japanese translations.

It may also inspire you if you’re developing a Gen Ai Agent dedicated to language translation. If you happen to be someone who is looking for a high-quality Gen AI model for your daily reading translation, this article will guide you to select AI models based on your needs. You’ll also find tips and tricks to write better prompts that can significantly improve translation output quality.

This article is primarily based on our own experience. We focused on certain Gen AI as of Feb 2, 2025 (when Gemini 2.0 and DeepSeek were released), so you might find some of our observations are different from current performance as AI models keep evolving.

We are non-experts, and we tried our best to show accurate info based on research and real testing. The work we did is purely for fun, self-learning and sharing, but we’re hoping to bring discussions to Gen AI’s cultural perspectives.

Many examples in this article are generated with the help of Gen AI to avoid copyright concerns.

Our initial consideration was straightforward. Since our translation needs are related to Chinese, Japanese and English, the translation of the three languages was the priority. However, there were very few companies that detailed this ability specifically on their doc. The only thing we found is Gemini which specifies the performance of Multilingual.

Capability Multilingual

Benchmark Global MMLU (Lite)

Description MMLU translated by human translators into 15 languages. The lite version includes 200 Culturally Sensitive and 200 Culturally Agnostic samples per language.

Gemini 1.5 Flash 73.7%

Gemini 1.5 Pro 80.8%

Kavukcuoglu, Koray. 2025. “Gemini Model Updates.” Google DeepMind Blog, February. https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/.

Second, but equally important, is the price. We were cautious about the budget and tried not to go bankrupt because of the usage-based pricing model. So Gemini 1.5 Flash became our primary choice at that time. Other reasons we decided to proceed with this model are that it’s the most beginner-friendly option because of the well-documented instructions and it has a user-friendly testing environment–Gemini AI studio, which causes even less friction when deploying and scaling our project.

Now Gemini 1.5 Flash has set a strong foundation, during our first dry run, we found it has some limitations. To ensure a smooth translation and reading experience, we have evaluated a few other models as backups:

Grok-beta (xAI): In late 2024, Grok didn’t have as much fame as OpenA’s models, but what attracted us was zero content filters (This is one of the issues we observed from AI models during translation, which will be discussed later). Grok offered $20 free credits per month before 2025, which makes it an attractive, budget-friendly option for frugal users like us.
Deepseek-V3: We integrated Deeseek right after its stride into market because it has richer Chinese training data than other alternatives (They collaborated with staff from Peking University for data labeling). Another reason is its jaw-dropping low price: With the discount, it was nearly 1/100 of Grok-beta. However, the high response time was a big issue.
OpenAI GPT-4o: It has good documentation and strong performance, but we didn’t really consider this as an option because there is no free tier for low-budget constraints. We used it as a reference but did not actively use it. We will integrate it later just for testing purposes.

We also explored a hybrid solution – providers that offer multiple models:

Groq w/ Deepseek: it is first an integrated model platform to deploy Deepseek. This version is distilled from Meta’s LLM, although it’s 72B makes it less powerful but with acceptable latency. They offered a free tier but with noticeable TPM constraints
Siliconflow: A platform with many Chinese model choices, and they offered free credits.

When using those models for daily translation (mostly between languages Simplified Chinese, Japanese, and English). We found that there are many noticeable issues.

1. Inconsistent translation of proper nouns/terminology

When a word or phrase has no official translation (or has different official translations), AI models like to produce inconsistent replies in the same document.

For example, the Japanese name “Asuka” has multiple potential translations in Chinese. Human translators usually choose one based on character setting (in some cases, there is a Japanese kanji reference for it, and the translator could simply use the Chinese version). For example, a female character could be translated into “明日香”, and a male character might be translated as “飞鸟” (more meaning-based) or “阿斯卡” (more phonetical-based). However, AI output sometimes switches between different versions of the same text.

There are also many different official translations for the same noun in the Chinese-speaking regions. One example is the spell “Expecto Patronum” in Harry Potter. This has two accepted translations:

Although I specify prompts to the AI to translate to simplified Chinese, it sometimes goes back and forth between simplified and the traditional Chinese version.

2. Overuse of pronouns

One thing that Gen AI often struggles with when translating from lower context language to higher context language is adding additional pronouns.

In Chinese or Japanese literature, there are a few ways when referring to a person. Like many other languages, third-person pronouns like She/Her are commonly used. To avoid ambiguity or repetition, the 2 approaches below are also very common:

Use character names.
Descriptive phrases (“the girl”, “the teacher”).

This writing preference is the reason that the pronoun use is much less frequent in Japanese and Chinese. In Chinese literature. The pronoun during translation to Chinese is only about 20-30%, and in Japanese, this number could go lower.

What I also want to emphasize is this: There is nothing right or wrong with how frequently, when, and where to add the additional pronoun (In fact, it’s a common practice for translators), but it has risks because it can make the translated sentence unnatural and not align with reader’s reading habit, or worse, misinterpret the intended meaning and cause mistranslation.

Below is a Japanese-to-English translation:

Original Japanese sentence (pronoun omitted)

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in heart, go to conference room.

AI-generated translation (w/ incorrect pronoun)

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in his heart, he goes to the conference room.

In this case, the author intentionally avoids mentioning the pronoun, leaving room for interpretation. However, because the AI is trying to follow the grammar rules, it conflicts with the author’s design.

Better translation that preserves the original intent

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in heart, heads to the conference room.

3. Incorrect pronoun usage in AI translation

The additional pronoun would potentially lead to a higher rate of incorrect pronouns caused by biased data; often, it’s gender-based errors. In the example above, the CEO is actually a woman, so this translation is incorrect. AI often defaults to male pronouns unless explicitly prompted

Jack sees the CEO entering the building. With confidence, excitement, and strong hope in his heart, he she goes to the conference room.

Another common issue is AI overuses “I” in translations. For some reason, this issue persists across almost all models like GPT-4o, Gemini 1.5, Gemini 2.0, and Grok. GenAI models default to first-person pronouns when the subject is unclear.

4. Mix Kanji, Simplified Chinese, Traditional Chinese

Another issue we encountered was AI models mixing Simplified Chinese, Traditional Chinese, and Kanji in the output. Because of historical and linguistic reasons, many modern Kanji characters are visually similar to Chinese but have regional or semantic differences.

While some mix-use is incorrect but might be acceptable, for example:

Those three characters also look visually similar, and they share certain meanings, so it could be acceptable in some casual scenarios, but not for formal or professional communication.

However, other cases can lead to serious translation issues. Below is an example:

If AI directly uses this word when converting Japanese to Chinese (in a modern scenario), the sentence “Jane received a letter from her distant family” could end up with “Jane received a toilet paper from her distant family,” which is both incorrect and unintentionally funny.

Please note that the browser-rendered text can also have issues because of the lack of characters in the system font library.

5. Punctuation

Gen AI sometimes doesn’t do a great job of distinguishing punctuation differences between Chinese, Kanji and English. Below is one of the examples to show how different languages use distinct ways to write conversation (in modern common writing style):

This might seem minor but could impact professionalism.

6. False content filtering triggers

We also found that Gen AI content filter might be more sensitive to Japanese and Chinese (This happened when using Gemini 1.5 Flash). Even when the content was completely harmless. For example:

人並みにはできますよ！

I can do it at an average level!

Roughly speaking, there were about 2 out of 26 samples that triggered false content filters. This issue showed up randomly.

Completely out of curiosity and to better understand the Chinese/Japanese translation ability of different Gen AI models, we conducted structured testing on 10 models from 7 providers.

Testing setup

Task: Each AI model was used to translate an article written in Japanese into simplified Chinese through our translation extension. The Gen AI models were connected through API.

Sample: We selected a 30-paragraph third-person article. Each paragraph is a sample of which the character varies from 4 to 120.

Processed result: each model was tested three times, and we used the median result for analysis.

Evaluation metrics

We fully respect that the quality of translation is subjective, so we picked three metrics that are quantifiable and represent the challenges of high-context language translation.

Pronoun error rate

This metric represents the frequency of erroneous pronouns that appeared in the translated sample, which includes the following cases:

Gender pronoun incorrectness (e.g., using “he” instead of “she”).
Mistakenly switch from third-person pronoun to another perspective

A paragraph was marked as affected (+1) if any incorrect pronoun was detected.

Non-Chinese return rate

Some models randomly output Kanji, Hiragana, or Katakana in their responses. We were to count the samples that contained any of those, but every paragraph contained at least one non-Chinese character, so we adjusted our evaluation to make it more meaningful:

If the returned translation contains Hiragana, Katakana, or Kanji that affect readability, it will be counted as a translation error. For example: If the AI output 対 instead of 对, it won’t be flagged, since both are visually similar and do not affect meaning.
Our translation extension has a built-in non-Chinese characters function. If detected, the system retranslates the text up to three times. If the non-Chinese remains, it will display an error message.

Pronoun Addition Rate

If the translated sample contains any pronoun that doesn’t exist in the original paragraph, it will be flagged.

Scoring formula

All three metrics were calculated using the following formula. 𝑁 represents the number of affected paragraphs (samples). Please note, if a paragraph (sample) contains multiple same-type errors, it will be counted 1 time.

Rate=N/30*100%

Quality score: to have a better sense of overall quality. We also calculated the quality score by weighting the three metrics based on their impact on translation: Pronoun Error Rate > Non-CN Return Rate > Pronoun Addition Rate.

In the first run, we only provided a foundational prompt by specifying persona and translation tasks without adding any specific translation guidelines. The goal was to evaluate AI translation baseline performance.

Table showing AI translation results for different models in the first run using a basic prompt. Columns include Rank, Quality Score (1–10), Pronoun Error Rate, non-CN Return Rate, and Pronoun Addition Rate. Claude-3.5 Sonnet ranked highest with a quality score of 7.94 and the lowest pronoun error rate (25%). The lowest-ranked model, deepseek-r1-distill-llama-70b, had a quality score of 6.11 and the highest pronoun addition rate (76.92%).

Observation

Generally speaking, the overall translation quality is not sufficient enough to bring the audience an “optimal reading experience”.

For error return rate, even the highest-rated model, Claude 3.5 Sonnet, still got a 30% error rate. This means obvious translation deficiencies could be easily observed roughly 1 in every 4 sentences. Interestingly, we found that the incorrectly added pronouns were always first-person “I”. It might be because the distance between the word “I” is closer to the verb vectors than other pronouns in vector space.

Pronoun Addition Rates exceeded 50% in most models. This frequency is much more aligned with English writing habits than with Chinese (20–30%) or Japanese (even lower). This might stem from the AI model training data. According to OpenAI’s dataset statistics, GPT-3’s training data consists of 92.65% English, 0.11% Japanese, 0.1% Simplified Chinese, and 0.02% Traditional Chinese. The differences show training data focuses on English and revealed the potential reason for translating struggles, including the issue of mixing simplified Chinese and traditional Chinese in output, which was also observed in testing.

Language Number of words % of total words

English 181014683608 92.64708%

Japanese 217047918 0.11109%

Simplified Chinese 193517396 0.09905%

Traditional Chinese 38583893 0.01975%

(OpenAI, “Languages by Word Count in GPT-3 Dataset,” last modified 2020, https://github.com/openai/gpt-3/blob/master/dataset_statistics/languages_by_word_count.csv).

We did a few not-so-fancy solutions in order to have a consistent good translation.

Re-translation with different models

If conditions allow (budget and technical feasibility), you could use the backup models to re-translate cases that the primary model cannot translate. This applies to untranslated Japanese text (non-Chinese returns). We primarily used Grok-beta till mid-Jan 2025.

Translation guidance: pronoun

To prevent the AI from inserting subjects unnecessarily, we specifically instruct AI to ignore grammar rules. Here are the hints we use:

**Pronoun Handling Requirements:**

* **Pronoun Consistency** Follow the original text strictly.

* **Pronoun handling** Do not add subjects unless explicitly mentioned in the original text, even if it results in grammatical errors.

In the meantime, providing examples is pretty useful for AI to understand your requirements.

**Pronoun Handling**

* **Original Japanese sentence (subject omitted): ジャックは最高経営責任者が建物に入るのを見た。自信と興奮、そして強い希望を胸に、会議室へ向かった

* **Incorrect AI-generated translation (unnecessary subject added): Jack sees the CEO entering the building. With confidence, excitement, and strong hope in his heart, he goes to the conference room

* **Good example (grammatically correct without pronoun): Jack sees the CEO entering the building. With confidence, excitement, and strong hope in heart, heads to the conference room.

* **Acceptable example (omitted subject but grammatically incorrect): “Jack sees the CEO entering the building. With confidence, excitement, and strong hope in heart, go to conference room.”

Translation guidance: glossary

I also wrote a glossary list like below. This significantly reduces the appearance of erroneous pronouns and standardizes the terminology translation.

| Japanese | English | Chinese | Notes |

| シカゴ | Chicago | 芝加哥 | Official location name |

| 俺 | I | 我 | First-person pronoun, informal, bold, and rough in tone, mostly used by males | | アスカ | Asuka | 飞鸟 | A young male character name |
…

Adjusting Model Parameters

Generally speaking, lowering the parameters helps avoid randomness. As someone who likes writing prompts, AI following the prompt more strictly is much more of a priority than being creative in output. So, we lowered top-p, top-k and temperature. Deepseek AI officially recommends a temperature of 1.3 for translation, but for better prompt adherence, we adjusted it to 1.0 or lower. TopK was reduced by 20. This works pretty well. Gemini 1.5 flash was used to randomly output a full paragraph content that didn’t exist in the original article. This issue never shows again after adjusting the parameters.

This method reduces variability but is not scalable, because each model responds differently depending on their size, advancement, etc.

For the second round of the test, we apply the translation guidance as a comparison.

Observation

After applying translation guidance, the overall translation quality of all models improved significantly. Below is a detailed comparison of the performance of different AI models under these improved conditions.

Table displaying updated AI translation results for different models with performance improvement. Columns include Rank, Quality Score (1–10) with score changes in green, Pronoun Error Rate, non-CN Return Rate, and Pronoun Addition Rate. Claude-3.5 Sonnet ranks first with a quality score of 9.68 (+1.74) and 0% pronoun errors. Most models show significant quality gains compared to the previous run, with lower pronoun error and addition rates.

You can easily tell that with translation guidance the translation quality has been significantly improved.

For the primary metric Pronoun Error Rate: Claude-3.5 Sonnet, OpenAI GPT-4o, DeepSeek V3, as the front runner, showed strong accuracy. Gemini 2.0 Flash and Moonshot-V1 (Kimi) had minor issues but were sufficient for most non-professional Japanese-to-Chinese translation needs.

Based on the result of the Pronoun Addition Rate. Claude-3.5 Sonnet strictly followed translation guidance and executed accurately with only an 8% Pronoun Addition Rate. Gemini 2.0 Flash had a 20% pronoun addition rate. It’s an acceptable result as it’s aligned with Chinese writing habits.

The best model selection depends on personal needs, considering factors such as budget, request per minute (RPM) limits, and ecosystem compatibility. Choosing an AI model for English-Chinese-Japanese translation.

Comparison table of AI models showing quality scores, pricing, free tier availability, input/output pricing per million tokens, and rate/tokens per minute. Claude-3.5 Sonnet has the highest quality score (9.68) with premium pricing and no free tier. Gemini-2.0-flash and gemini-1.5-flash offer the lowest input/output costs and generous free tier limits. DeepSeek models have the lowest paid tier costs. The table includes RPM (requests per minute) and TPM (tokens per minute) for both free and paid tiers, with some models showing unlimited or undefined constraints.

For those without budget constraints, Claude-3.5 Sonnet and OpenAI GPT-4o are the strongest choices because of their overall strong performance.

For entry-level developers in North America, Gemini 2.0 Flash is an excellent choice because of its affordable price, and good response time. Another reason we chose it as the primary provider is because Google’s cloud service ecosystem (OCR, cloud storage, etc.) makes it easier to scale development projects.

For Gen AI power users looking to balance price and quality, DeepSeek offers low prices, unlimited RPMs, and open-source flexibility. This is a strong choice for cost-sensitive users who don’t want to compromise translation quality. However, when using the official API platform in North America, we experienced long response time, which can be a limitation if you have a need for real-time or long-context translations. Fortunately, there are many services integrated DeepSeek on other servers (such as Microsoft Azure, Groq, and Siliconflow, or even you could deploy into your own local servers), or using it within China can avoid these issues. Additionally, model size can significantly affect translation performance – if you could, use the full-power 671B version for best results.

We understand that those tests are not perfect. Even if we tried to ensure a diverse and right data volume, there is much room for improvement. For example, our sample size is not large enough for statistical significance. AI model performance fluctuates at any moment, issues like terminology translation inconsistency weren’t captured but might be important indicators for some audiences, and the translation quality wasn’t able to be reflected quantitatively. We provided the test just for learning and hopefully, serve as reference points for you.

We are really grateful for the advances in Generative Ai, which have helped bridge the gap of language and make knowledge more accessible for people speaking different languages and from different cultures.

However, we can still see many challenges remain to be overcome—especially for non-English languages.

There is an opinion that translation doesn’t need advanced AI models, but“good enough” is not enough. I can see that this view might be correct from a cost perspective and makes sense from an English-centric perspective. However, if the standard “good” is based on official performance reports from AI providers, it might accurately reflect the performance of non-English translation. As you can clearly see, high-context languages such as Japanese and Chinese translation still struggle with accuracy and fluency. There is still a road ahead to improve AI translation quality, better contextual understanding and cultural awareness are necessary.

Cost

Deepseek has brought more competition to the AI translation market. Pricing is still a key factor for people and sometimes has more weight than performance.

If you have mid to high-volume daily translation needs (academic reading, news, video caption, etc.), using a premium model can cost anywhere from $20 to $80 per month. For businesses dealing with localization and internationalization, these costs would increase quickly.

No way around it: prompting for better translation

Another major challenge is AI models still require users to write long, complex prompts to achieve basic readability. For example, when translating professional topics in certain niche domains, I have no choice but to write prompts of over 5000 characters in English (almost writing an entire document) just to guide the AI to an acceptable quality. Not to mention the longer prompts = higher token usage.

If AI is truly going to break language barriers, there is still a lot of room for improvement to make translation models more accurate, more context-aware, and less dependent on long prompts. There’s still a lot of work to do to make AI translation easy, cost-effective, and truly accessible to everyone, but AI has already achieved more than anyone could have imagined, and I celebrate and am grateful for these technological advancements.

Source link

Japanese-Chinese Translation with GenAI: What Works and What Doesn’t | Towards Data Science

1. Inconsistent translation of proper nouns/terminology

2. Overuse of pronouns

3. Incorrect pronoun usage in AI translation

4. Mix Kanji, Simplified Chinese, Traditional Chinese

5. Punctuation

6. False content filtering triggers

Testing setup

Evaluation metrics

Pronoun error rate

Non-Chinese return rate

Pronoun Addition Rate

Scoring formula

Observation

Re-translation with different models

Translation guidance: pronoun

Translation guidance: glossary

Adjusting Model Parameters

Observation

Share this:

Like this:

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

POPULAR CATEGORY

ABOUT US

FOLLOW US