Easy Data Transform newsletter
17-Mar-2022

Easy Data Transform v1.26.0 released

Easy Data Transform v1.26.0 is now available for Windows and Mac. Improvements include:

Find out more and download the new release from the release page:

Download the update

This is a free upgrade for everyone with a valid license. If you don’t yet have a license, you can buy one here.

Thanks to everyone who sent us bug reports and suggestions for improvements.

Videos

Easy Data Transform YouTube channel

We’ve been adding more videos to our YouTube channel. The latest two videos are:

You can subscribe to the channel to get notifications of new videos. It would really help us with the all-powerful YouTube algorithm if you subscribe to the channel and ‘Like’ a few videos!

Easy Data Transform YouTube channel

Fuzzy matching

Often datasets will have records that are duplicates, but are not identical. For example an address might be recorded as:

100 avenue street, townsville, ohio

And:

100 avenue st., townsville, ohio

Previously the Dedupe transform used exact matching and wouldn’t match these two addresses. But now you can use ‘fuzzy’ matching and specify the percentage difference between two records that is acceptable.

Fuzzy dedupe

For example, doing a fuzzy match to:

100 avenue street, townsville, ohio

Gives:

VALUE FUZZY MATCH
100 avenue street, townsville, ohio 100%
100 avnue street, townsville, ohio 98%
100 avenue street townsville ohio 95%
100 avenue st., townsville, ohio 89%
100 avenue st, townsville 72%
100 av. st., citysville, texas 52%
townsville, ohio 46%
742 evergreen terrace, springfield, oregon 36%


The difference between 2 values is calculated as the number of character additions, substitutions or deletions required to turn one string into the other (the ‘Levenshtein Distance’), divided by the number of characters in the longest value.

You can now also choose to ignore whitespace and ignore punctuation in the Dedupe transform.

Note that fuzzy matching is significantly slower than exact matching. And the time taken increases with:

It may require a bit of experimenting to find the right value of fuzzy match. For this you can use the new Explore Duplicates feature.

Explore Duplicates window

Sometimes you still need a ‘human in the loop’. In which case you can set the new Mode option to Add duplicate information, output the dataset and modify it ‘by hand’.

Dirty dirty data

Real world data is invariably ‘dirty’, with a range of issues, including: empty values, outlier values and unexpected characters. We have now improved the Column Values window so that you can get an idea of the data in each column at a glance.

Column values profile

If it is a large dataset you can sample from it, for speed.

Profile sampling

You can also identify extaneous characters in each column in the new Characters tab.

Data cleansing

If you hover over a cell you can see up to 3 example values where that character is present.

Data profiling drilldown

You can use the Characters tab to give you instant feedback on changes made with the Replace and Whitespace transforms. Again you can sample large datasets, for speed.

And finally…

Slava Ukraini!

We have been watching the unfolding situation in Ukraine with horror. But also with huge admiration for the Ukrainians, fighting for their freedom. It is only an accident of birth that it is them, rather than us, facing down Russian tanks and artillery. If you can spare some cash, here are a couple of donation links:

https://redcross.org.ua/en/donate/
https://www.comebackalive.in.ua/

Andy Brice
Oryx Digital Ltd
17-Mar-2022

Questions or problems?

Email support@easydatatransform.com