How do I? > Match dirty data

There is often a need to match real-world data that is 'dirty'. For example, you might want to match the company name in two datasets to perform a Lookup.

In the first dataset there are company names:

Smith Industries

Weyland-Yutani

ACME LTD

In the second dataset there are company names:

Smith Industries Ltd

Weyland-yutani,Inc.

Acme

You can try to match these using the fuzzy matching option in Lookup, but this is slow for big datasets and can result in false positives. So it is much better to try to clean/normalize the data as much as possible before matching.

You can do this by using a Replace transform to replace any of the following with a space:

•punctuation characters

• ‘inc’, ‘incorporated’, ‘ltd’, ‘limited’ etc (use \b to denote a word boundary in a regex, so you only replace a whole word, not part of a word)

Then use a Whitespace transform to remove leading and trailing whitespace and convert consecutive spaces into a single space.

You can then do a Lookup with exact match and case sensitive unchecked:

If there are still issues (e.g. typos in company names) you could try setting Matching to Fuzzy.

Once you are happy with the match, you can add an output to write it to file.