Importing bilingual tables to a TM

Kevin Lossner

Jun 21, 2024

Many ways to go, some more sensible than others

Read →

5 Comments

José

Jun 25, 2024

Or just go here

https://porkopek.github.io/Multisearch/#/align-texts

paste the source text, paste the target text, click Align and you have the TMX ready to export in seconds

Expand full comment

Reply (1)

Kevin Lossner

Jun 25, 2024

Ah, José, you always do the coolest stuff! But these texts are already aligned; doing so again can create potential issues, particularly where the number of sentences in each row are not the same. (This is often the case for me, as I rewrite crappy German sources into better-flowing English and often combine and split sentences.)

It would be relatively trivial to write a macro or WSH script to maintain the cell content in generating the TMX from a table (or a tab-delimited list). I suppose I might do that if no one else feels like it, but I've been trying for 22 years now to quit programming, so I'm usually reluctant to play those games, because I fear there will be no end to it given all the crap in software I use that needs fixing.

In any case, your comment reminds me that I wanted to revisit some of the many things you are doing with terminology, etc. and review those on the TT Substack, because there's a lot there that users of various tools could benefit from. An overwhelming amount of stuff, actually.

Expand full comment

Reply (1)

José

Jun 25, 2024

I bet the program gets the alignment right 95% of the cases where there are such different number of sentences in each segment. Just try and see if it has the same number of segments than you have.

But for this case, I think alignment is trivial, because is just about to align only on paragraphs, and this tables, when copied, they produce paragraphs. So only split on new line character '\n' . I'll make it an option to facilitate things.

Expand full comment

Joe Zhou

Jun 22, 2024

Thank you Kevin for your detailed explanation.

What you mentioned pertains more to a special case, whereas another scenario is:

The same document contains the entire source text followed by the entire translation.

In this case, you can easily split the source text and translation into two separate documents by copying and pasting, and then align them for import into the TM. The reason for aligning is to enable post-alignment editing, such as checking and adjusting segment alignment quality by splitting or merging segments as needed. memoQ's LiveDocs supports aligning document pairs, which can function similarly to a TM. However, automatic alignment is usually not 100% accurate and still requires manual editing based on the actual content. Of course, this is not the main focus of the current discussion.

The same document contains a paragraph of the source text followed by a paragraph of the translation (arranged side-by-side or top-bottom).

Here, we are referring to paragraphs, not segments. The specific formats could be:

First paragraph of the source text - First paragraph of the translation

......

Nth paragraph of the source text - Nth paragraph of the translation

Or:

First paragraph of the source text

First paragraph of the translation

......

Nth paragraph of the source text

Nth paragraph of the translation

How should this text format be handled? Before importing into the TM, segments are usually aligned, whereas the original document is aligned by paragraphs. There may even be cases where paragraph alignment intersects with alignments of several paragraphs. The method you mentioned doesn't seem to directly address such a situation; at the very least, it requires converting paragraph alignment into segment alignment (how would that be achieved?).

If the alignment feature supports importing a single document with bilingual texts, it would be similar to aligning two documents. The source text and translation could be automatically split into segments (automatic splitting isn't 100% accurate) and then checked, edited, and modified in the alignment editor. This method is more intuitive and aligns with the existing experience of most CAT tool users.

Expand full comment

Reply (1)

Kevin Lossner

Jun 22, 2024

The case you mention here doesn't necessarily involve a table. Alternating source and target text is not a difficult problem to solve; I wrote a little WSH script years ago to deal with that, at least for sentence-level segmentation. I would have to see if it handles the paragraph structure you discuss, but if not that is a small change.

Victor Parra wrote a nice article describing how he converts paragraph-level segmented bitexts using Rainbow. I am thinking rework his approach using a few features of memoQ, in which case you will see it here.

Expand full comment

memoQuickies Substack

Importing bilingual tables to a TM