Summary

<< Click to Display Table of Contents >>

Navigation:  Reference > Transforms >

Summary

Description

Summarize the values in the selected columns.

 

Example

Create a summary, including dates and statistics:

 

how to summary a data table example

 

Inputs

One.

 

Options

Select the Columns you wish to summarize.

Check include dates if you wish to check for date values using supported date formats. This can be slow for large datasets.

Check include statistics if you wish to add the sum, average, median and standard deviation for numeric values.

 

Notes

Whether values are interpreted as number or dates depends on Supported date formats and locale.

Empty values is the number of values in the column that are completely empty. Values with whitespace do not count as empty.

Non-empty values is the number of values in the column that are not completely empty. Values with whitespace do not count as empty.

Numeric values is the number of numeric of values in the column that can be interpreted as a number.

Integer values is the number of numeric of values in the column that can be interpreted as integers (whole numbers). "1.0" (depending on your locale), "0", "-1" and "1e3" are considered integers.

Real values is the number of numeric values in the column that can be interpreted as reals  (floating point numbers). "1.2345" (depending on your locale) and "1e-3" are considered reals.

Boolean values is the number of values in the column that can be interpreted as booleans  (true/false). "TRUE", "True" and "false" are considered booleans.

Date values is the number of values in the column that can be interpreted as a date. Only shown if check for dates is checked.

Text values is the number of values in the column that cannot be interpreted as empty, numeric or date.

Distinct values is the number of different values in the column. Date and numeric values are treated as text (e.g. '7' is treated as different to '7.0' and '1/1/2020' is treated as different to '01/01/2020'). Comparison between values is sensitive to case and whitespace.

Unique values is the number of values that only occur once in the column. Date and numeric values are treated as text. Comparison between values is sensitive to case and whitespace.

Duplicated values is the number of values that occur more than once in the column. Date and numeric values are treated as text. Comparison between values is sensitive to case and whitespace.

 

distinct-duplicated-unique

 

Min length is the minimum number of characters of non-empty values in the column. Whitespace is counted. Date and numeric values are treated as text.

Max length is the maximum number of characters of values in the column. Whitespace is counted. Date and numeric values are treated as text.

Average length is the arithmetic mean of the number of characters of non-empty values in the column. Whitespace is counted. Date and numeric values are treated as text.

Min numeric is the minimum numeric value in the column. Non-numeric and empty values are ignored.        

Max numeric is the maximum numeric value in the column. Non-numeric and empty values are ignored.        

Range numeric is Max numeric - Min numeric.

Negative numeric is the number of negative numeric values in the column.

Zero numeric is the number of numeric zero values in the column.

Positive numeric is the number of positive numeric values in the column.

Sum numeric is the sum of numeric values in the column. Non-numeric and empty values are ignored. Only shown if include statistics is checked.

Average numeric is the arithmetic mean of numeric values in the column. Non-numeric and empty values are ignored. Only shown if include statistics is checked.

Median numeric is the median of numeric values in the column. Non-numeric and empty values are ignored. Only shown if include statistics is checked.

Mode numeric is the mode of numeric values in the column. Non-numeric and empty values are ignored. Only shown if include statistics is checked.

Stddev numeric is the sample standard deviation (equivalent to Excel function stddev.s) of numeric values in the column. Non-numeric and empty values are ignored. Only shown if include statistics is checked.

Variance numeric is the square of the sample standard deviation. Only shown if include statistics is checked.

Q1 numeric is the first quartile (calculated using Method 1 here). Only shown if include statistics is checked.

Q3 numeric is the third quartile  (calculated using Method 1 here). Only shown if include statistics is checked.

IQR numeric is the Inter Quartile Range. Only shown if include statistics is checked.

Skew numeric is the alternative Pearson Mode Skewness  (3 * (Mean – Median) / Standard Deviation ). Only shown if include statistics is checked.

Min date is the minimum date value in the column. Only shown if include dates is checked.

Max date is the maximum date value in the column. Only shown if include dates is checked.

Most frequent lists the most common text in the column. Empty values are not counted. Date and numeric values are treated as text. Comparison between values is sensitive to case and whitespace.

 

Notes

If your dataset is large, you might want to Sample it first.

Use Num Format to change the precision of numerical results.

You can use Whitespace to remove any whitespace at the start or end of values before Summary.

If you wish to have a row displayed per column you can Transpose the table first.

Mode may not work as expected with non-integer values, due to precision issues. You can fix this by using a Num Format transform first.

 

See also

Count

Ngram

Pivot

Stats