<< Click to Display Table of Contents >> Navigation: How do I? > Dedupe a dataset |
There are 2 transforms for removing duplicate rows from datasets:
•Dedupe removes duplicate rows, keeping only the first row in each group that it considers to be duplicates of each other
•Unique creates a single aggregate row from each group that it considers to be duplicates of each other
Dedupe is simpler. Unique has more flexibility.
To keep only the first row with each Customer Id from this dataset:
To get this dataset:
Drag the dataset file onto the Center pane of Easy Data Transform. Then click the Dedupe transform in the Left pane.
Then check the Customer Id column in the Right pane.
Only the first row with each Customer Id is kept. Use Sort if you want to change the order before the Dedupe transform.
If you only want to remove rows with the same Customer Id and Product Id, check both the Customer Id and Product Id columns.
If you want to use fuzzy matching to also remove rows that are similar, but not identical, set Matching to Fuzzy.
To see what rows will be removed and experiment with different options, click the Explore Duplicates... button.
Note that Dedupe takes account of whitespace. So you might need a Whitespace transform before the Dedupe transform.
See also the Dedupe documentation.
To aggregate all rows with the same Customer Id:
To get this dataset with:
•The first Name for that Customer Id
•A comma separated list of Product Ids for that Customer Id
•The total Cost for that Customer Id
•The last Date for that Customer Id
Drag the dataset file onto the Center pane of Easy Data Transform. Then click the Unique transform in the Left pane (if you can't see Unique, then check show advanced is checked in the Left pane).
Then set the Unique options as shown below:
One aggregated row is created for each Customer Id. Use Sort if you want to change the order before the Unique transform.
If you only want to create a new aggregate row for rows with the same Customer Id and Product Id, set both the Customer Id and Product Id columns to Keep unique.
Note that Unique takes account of whitespace. So you might need a Whitespace transform before the Unique transform.
See also the Unique documentation.
See also: