We’ve looked at English dates and related regular expressions from a couple of perspectives so far:
auto-translation rules to convert dates in English to various target languages
English long dates generated from German long and short dates
Now we’re going to consider a QA issue that I’ve run across far too often over the years: the use of the wrong date formats in an English text. How to detect this problem, and ways to fix it.
How does this happen? some may ask. A number of possible ways. It’s not uncommon with mixed teams of translators from different backgrounds translating a set of documents. Even with a styleguide clearly stating the format to be used, it is often difficult for a native of the United Kingdom to write US date formats consistently, and vice versa. This problem also shows up a lot when a translator is asked to use conventions not typical for their accustomed variant of English.
No matter how “well-trained” one’s habits of review are, it’s also not easy to notice these problems consistently when reading a longer text; I tend to get caught up in the narrative of the translation and don’t always notice inconsistencies in dates or numbers if I am concentrating on the quality of expression in the text.
That’s where bits of regex and automating checking tools can save us time and embarrassment.
You can apply filters as discussed in a previous article to see if dates in a particular format are present, but if you want a process for identifying and fixing the problems of incorrect target formats, more is necessary.
There are three possible approaches:
Use auto-translation rules that map date structures in the source text to the desired date structures in the target language. The German to UK and German to US long date rules from an earlier post are good examples of this. If checking these auto-translation rules selected in the project is enabled in the chosen QA profile, warnings will be displayed in the working grid and also when a QA check is initiated.
The Regex tab of the QA profile can also be used to perform specific checks on the target language and suggest corrections. Find & replace regexes saved in the Regex Assistant library are good candidates to use for this:
The regex used here can also be adapted to do plausibility checks on the numbers in the date. Notice that the example above strips leading zeroes for the replacement. And there is one flaw that may cause problems in some text: as written, the expression is case-sensitive. Adding
(?i)
to the front of the Find expression will fix that.
Note also that in neither the Regex Assistant editor’s replacement field, nor in the Correction (replacement) field of the QA profile’s Regex tab (nor in the memoQ find & replace dialogs) do you have any visual clues that a space is, in fact, a non-breaking space. Unfortunately, the only part of memoQ where non-breaking characters can be seen is in the translation and editing grid (where non-printing character visibility can be toggled with the pilcrow icon ¶ on the Edit ribbon).Find & replace in the translation and editing grid. This is the way to go for on-the-fly fixes.
For monolingual editing jobs where a lot of cleanup is needed for date formats, it might be helpful to have sets of English-to-English rules to cope with whatever transformations are needed. I usually keep sets of rules that convert long dates to short dates or whatever special requirements I encounter in client styleguides.
So for finding those UK long dates, the expression we’re considering now looks like this:
(?i)0?(\d{1,2})\s?((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[abcehilmnoprstuyz\.]*),?\s?(\d{4})
But what if we run into cases like 3rd June 2020? Well, that’s a simple fix. Just add the relevant character possibilities behind the first digit group, something like
(?i)0?(\d{1,2})([dhnrst]{2})?\s?((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[abcehilmnoprstuyz\.]*),?\s?(\d{4})
But this changes the group numbering, so our replacement regex needs to be rewritten as $3 $1, $5
But a better solution might be
(?i)0?(\d{1,2})[dhnrst]*\s?((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[abcehilmnoprstuyz\.]*),?\s?(\d{4})
or even
(?i)0?(\d{1,2})[dhnrstof\s]*\s?((Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[abcehilmnoprstuyz\.]*),?\s?(\d{4})
which would handle cases like the 2nd of April, 2015
though if you’ll be seeing dates like that you might want to capture “the” too for clean find & replace.
And then there are cases like this third day of August, 2023, but we’ll leave that as an exercise for the student….
As you can see, a lot of variations are typically possible with regex, which might add to the confusion for some people. It’s a good thing you don’t have to learn that lousy syntax, right? All you need instead is a good library of saved regex resources to apply!
Converting US to UK-formatted dates works similarly. Pre-configured resources for that are available in the download package at the bottom of this article.
I haven’t discussed short date conversions to other short date formats, which are the subject of a future article.
The resources in the download package below are as follows:
a Regex Assistant library export with US to UK and UK to US long date find & replace regexes
auto-translation rulesets for converting UK long dates to US long dates and vice versa.
sample test data
Keep reading with a 7-day free trial
Subscribe to memoQuickies Substack to keep reading this post and get 7 days of free access to the full post archives.