SUM Enchanted Procedure:
The Advanced Power of PROC SUMMARY


David Beam
Systems Seminar Consultants, Inc.
Madison, WI 53711
(608) 278-9964

 

Abstract
Many SAS users are uncomfortable or unsure about the inherent power of PROC SUMMARY. The creation of summarized data sets, with emphasis on the CLASS statement will be demonstrated in this paper.

Explanations of how SAS organizes the output file will help the user take advantage of this powerful SAS tool. There are many techniques of preparing new variables before going into PROC SUMMARY that can allow the procedure to generate counters and statistics, as well as the traditional sums.

The use of PROC FORMAT together with PROC SUMMARY can also allow much flexibility in the use of this procedure.

Introduction
PROC SUMMARY is a basic 'statistical' procedure, in that numeric variables can be used to compute a variety of statistics, including averages, sums, standard deviations, etc. Often overlooked is how this procedure can save programming time when used to full potential.

This paper is not meant to be a statistics course, but will look at the procedure for very common business analysis needs.

In the simple sense, PROC SUMMARY will read in a data set and create an output data set with whatever statistics are needed. The statistics coming out can be grouped together by values of the CLASS variables, or only a single total can be created.

The two procedures, MEANS and SUMMARY, are now (with release 6 or higher of The SAS System) virtually identical. The main difference is that SUMMARY does not produce a printed report as a default, but PROC MEANS assumes a report is needed.

The basic syntax of PROC SUMMARY is shown here. Many of the optional statements and keywords will be discussed later. This is NOT meant to replace the SAS documentation, refer to the SAS Language: Reference manual for complete description of the procedure.

PROC SUMMARY <option list> <statistic-keyword-list>;

VAR variable-list;

CLASS variable-list;

FREQ variable;

WEIGHT variable;

ID variable-list;

BY variable-list;

OUTPUT <OUT=SAS-data-set>

<output-statistic-list>

<statistic(variable)=new-var-name>

<statistic(variable)=new-var-name>

...

;

RUN;

Partial List of Keywords on the PROC Statement
DATA= data set name (input)

DESCENDING

orders the output data set in order of the _TYPE_ variable, default is ascending

MISSING

treats missing values for the CLASS variable as a valid subgroup

NWAY

only the highest value for the _TYPE_ variable will be output

ORDER= DATA | EXTERNAL | FORMATTED |

FREQ | INTERNAL

specifies sorting order of the CLASS variables for output of the data set. Default is INTERNAL, which is sequential unformatted values.

Partial List of Keyword Statistics Available
N the count of observations having non-missing values for the selected variable

NMISS the count of missing values for the variable

MIN the minimum value

MAX the maximum value

SUM the total of the specified variable

MEAN the average (mean)

STD the standard deviation

If no keyword statistics are present, PROC SUMMARY produces the following default statistics:

N, MEAN, STD, MIN, MAX

Statements used with PROC SUMMARY
The following is a partial list of the statements that can be used to control the procedure:

BY variables;

Names variable(s) that will be used to obtain separate statistics when the value changes. The data set must be sorted or indexed by these variables.

CLASS variables;

Used to form sub-groups of output data. Similar to the BY statement, but does not require sorting the data first.

With multiple variables on the CLASS statement, the procedure will generate statistics for each unique value of the CLASS variable, as well as combination statistics for all combinations of all CLASS values.

OUTPUT statement

This "drives" the procedure. The OUTPUT statement can name the new data set being created, select the variables and statistics to generate, and name the new variables being created.

There is no limit to the number of OUTPUT statements in a single PROC SUMMARY.

Sample Data Set
The examples that follow will use the following set of data as input. The CONTENTS of the data set is:

# Variable Type Len Pos

-----------------------------------

1 STATE Char 2 0

2 SALESREP Char 5 2

3 MONTH Char 2 7

4 SALES Num 8 9

5 COST Num 8 17

 

PROC PRINT of the sample data set:

STATE SALESREP MONTH SALES COST

CA DONNA 01 231 180

CA FRED 02 74 83

CA MARY 01 156 148

CA MARY 02 143 97

CA MARY 02 98 65

NY DONNA 01 . .

NY FRED 02 210 155

NY SAM 02 104 96

OK FRED 02 64 72

OK SAM 01 237 184

OK SAM 01 . .

OK SAM 02 94 72

WI FRED 01 53 53

WI FRED 02 197 142

WI JERRY 01 . .

WI JERRY 01 215 184

WI JERRY 02 14 8

WV DONNA 01 315 207

WV DONNA 02 96 78

WV SAM 02 174 169

WV SAM 02 97 53

====== ======

2572 2046


Case 1 - Summary by State
For our first example, we will create a summary file for each state. We need to know total sales and the count of missing values for sales in each state, as well as the highest (MAX) sales in each state.

PROC SUMMARY DATA= sugi.sales;

CLASS state;

VAR sales;

OUTPUT OUT=example1

SUM(sales) = totsales

MAX(sales) = maxsales

NMISS(sales) = badsales

;

RUN;

 

PROC PRINT DATA=example1;

RUN;

 

OBS STATE _TYPE_ _FREQ_ TOTSALES MAXSALES BADSALES

 

1 0 21 2572 315 3

2 CA 1 5 702 231 0

3 NY 1 3 314 210 1

4 OK 1 4 395 237 1

5 WI 1 5 479 215 1

6 WV 1 4 682 315 0

 

The output data set contains one observation with total statistics for the complete incoming data set, as well as one observation for each value of the CLASS variable.

The _TYPE_ variable created by SAS shows the level of the totals on that observation:

The _TYPE_ = 0 observation are the stats for the complete data set.

The _TYPE_ = 1 level are the stats for the values of STATE, or the class variable.

When more variables are present on the CLASS statement, then SAS produces even more output observations. The value of the _TYPE_ variable will increase as well. A summary observation is produced for each possible combination of values of the CLASS variables.

Case 2 - Summary by State and Salesrep
Now we will run the same PROC SUMMARY as above, changing the CLASS statement to include two variables:

 

PROC SUMMARY DATA= sugi.sales;

CLASS state salesrep;

VAR sales;

OUTPUT OUT=example2

SUM(sales) = totsales

MAX(sales) = maxsales

NMISS(sales) = badsales

;

RUN;

 

PROC PRINT DATA=example2 NOOBS

WIDTH=MIN HEADING=V;

RUN;

 

S T M B

A O A A

L _ _ T X D

S E T F S S S

T S Y R A A A

A R P E L L L

T E E Q E E E

E P _ _ S S S

0 21 2572 315 3

DONNA 1 4 642 315 1

FRED 1 5 598 210 0

JERRY 1 3 229 215 1

MARY 1 3 397 156 0

SAM 1 6 706 237 1

CA 2 5 702 231 0

NY 2 3 314 210 1

OK 2 4 395 237 1

WI 2 5 479 215 1

WV 2 4 682 315 0

CA DONNA 3 1 231 231 0

CA FRED 3 1 74 74 0

CA MARY 3 3 397 156 0

NY DONNA 3 1 . . 1

NY FRED 3 1 210 210 0

NY SAM 3 1 104 104 0

OK FRED 3 1 64 64 0

OK SAM 3 3 331 237 1

WI FRED 3 2 250 197 0

WI JERRY 3 3 229 215 1

WV DONNA 3 2 411 315 0

WV SAM 3 2 271 174 0

As can be seen, the output now contains values of the _TYPE_ variable that range from 0 to 3:

 

_TYPE_ Description

 

0 totals for complete data set

1 totals for each SALESREP

2 totals for each STATE

3 totals for SALESREP within STATE

 

The data set produced by PROC SUMMARY can now be used to generate a variety of reports, by using the WHERE statement to pull the required data:

 

PROC PRINT DATA=example2(DROP=salesrep)

NOOBS WIDTH=MIN HEADING=V;

TITLE "State Totals Only";

WHERE _type_ = 2;

 

RUN;

 

State Totals Only

 

T M B

O A A

_ _ T X D

S T F S S S

T Y R A A A

A P E L L L

T E Q E E E

E _ _ S S S

 

CA 2 5 702 231 0

NY 2 3 314 210 1

OK 2 4 395 237 1

WI 2 5 479 215 1

WV 2 4 682 315 0

 

Case 3 - Three (or more) variables on the CLASS statement

 

Change the above program so the CLASS statement contains three variables:

 

PROC SUMMARY DATA= sugi.sales;

CLASS state salesrep month;

VAR sales;

OUTPUT OUT=example3

SUM(sales) = totsales

MAX(sales) = maxsales

NMISS(sales) = badsales

;

RUN;

 

It now becomes difficult to determine what the values of the _TYPE_ variable will represent. The following simple DATA step can be used to get the first observation for each value of the _TYPE_ variable. These can then be printed to review what the level of totals the _TYPE_ variable represent.

 

This program works because the data set produced by PROC SUMMARY is output in the order of the _TYPE_ values:

 

/* Produce data set with one observation per value */

/* of the _TYPE_ variable that is created from */

/* PROC SUMMARY earlier. */

DATA small;

SET example3;

BY _type_;

if FIRST._type_;

RUN;

 

PROC PRINT DATA = small;

TITLE "One per value of _TYPE_";

VAR _type_ state salesrep month;

run;

 

One per value of _TYPE_

 

OBS _TYPE_ STATE SALESREP MONTH

 

1 0

2 1 01

3 2 DONNA

4 3 DONNA 01

5 4 CA

6 5 CA 01

7 6 CA DONNA

8 7 CA DONNA 01

 

This report can now be used to easily see what the _TYPE_ values represent:

 

_TYPE_ Description

 

0 totals for complete data set

1 totals for each MONTH alone

2 totals for each SALESREP alone

3 totals for each combination of MONTH and SALESREP

4 totals for STATE alone

5 each MONTH and STATE combo

6 each SALESREP and STATE combo

7 final detail, totals for each combination of MONTH, SALESREP and STATE

 

Using the NWAY Option

 

Adding the keyword NWAY to the PROC statement tells SAS that the output data set should only contain the highest value of the _TYPE_ variable. So:

 

PROC SUMMARY DATA=sugi.sales NWAY;

...

will create a data set with only _TYPE_ = 7 in the above example.

 

Case 5 - No output Statistics Specified

 

If you use a CLASS statement and an OUTPUT statement but do not specify the desired statistic keywords, you will get a very different type of data set.

 

SAS produces five observations for each value of the CLASS variables, one observation for each of the five default statistics: N, MEAN, STD, MIN and MAX.

 

The name of the statistic generated is stored in the variable _STAT_ in the new data set.. The following program and print show the results:

 

PROC SUMMARY DATA= sugi.sales;

CLASS state;

VAR sales;

OUTPUT OUT=example4;

RUN;

 

PROC PRINT DATA= example4;

RUN;

 

OBS STATE _TYPE_ _FREQ_ _STAT_ SALES

 

1 0 21 N 18.000

2 0 21 MIN 14.000

3 0 21 MAX 315.000

4 0 21 MEAN 142.889

5 0 21 STD 78.926

6 CA 1 5 N 5.000

7 CA 1 5 MIN 74.000

8 CA 1 5 MAX 231.000

9 CA 1 5 MEAN 140.400

10 CA 1 5 STD 60.550

11 NY 1 3 N 2.000

12 NY 1 3 MIN 104.000

13 NY 1 3 MAX 210.000

14 NY 1 3 MEAN 157.000

15 NY 1 3 STD 74.953

16 OK 1 4 N 3.000

17 OK 1 4 MIN 64.000

18 OK 1 4 MAX 237.000

19 OK 1 4 MEAN 131.667

20 OK 1 4 STD 92.446

21 WI 1 5 N 4.000

22 WI 1 5 MIN 14.000

23 WI 1 5 MAX 215.000

24 WI 1 5 MEAN 119.750

25 WI 1 5 STD 101.125

26 WV 1 4 N 4.000

27 WV 1 4 MIN 96.000

28 WV 1 4 MAX 315.000

29 WV 1 4 MEAN 170.500

30 WV 1 4 STD 103.029

 

This data set can be used for specific kinds of reports, where each statistic is a separate observation. A common use for this data would be a PROC PLOT, with each statistic as a separate plot.

 

Case 7 - Counting a Variety of Situations

 

Frequently we need counts of activity happening with the data, but don't have the specific information needed as separate variables in the data set.

 

For example, in the sample data set, suppose we need to know the number or count of observations that had a cost/sales ratio of greater than 75%. We also need a count of SALES under $150.00.

 

One method would be to pass the data set through a DATA step, create new variables with either a value 1 or a missing value to signify the desired "true or false" conditions. These can then be counted with PROC SUMMARY, at the same time other statistics are generated.

 

DATA changed;

SET sugi.sales;

/************************************/

/* assign 1 to counter variable for */

/* profit percent over 75% */

/* and sales under $150 */

/************************************/

profit = ROUND(cost/sales, .01);

IF profit > .75 THEN over75 = 1;

IF sales < 150 THEN und150 = 1;

RUN;

 

PROC SUMMARY DATA= changed NWAY;

CLASS state;

VAR sales cost over75 und150;

OUTPUT OUT=example5(DROP=_FREQ_)

SUM(sales) = totsales

SUM(cost) = totcost

N(over75) = over75

N(und150) = und150

;

RUN;

 

PROC PRINT DATA=example5 NOOBS LABEL SPLIT='*'

WIDTH=MIN;

TITLE "Counting Computed events in data";

LABEL over75 = "Count*Ratio*> 75%"

und150 = "Count*Sales*< $150";

RUN;

 

Counting Computed events in data

 

Count Count

Ratio Sales

STATE _TYPE_ TOTSALES TOTCOST > 75% < $150

CA 1 702 573 3 3

NY 1 314 251 1 2

OK 1 395 328 3 3

WI 1 479 387 2 3

WV 1 682 507 2 2

 

Case 8 - Using Formatted CLASS Variables

 

Suppose you really want to group the STATE values together into regions for the summary totals. You could create a new variable containing the region assignments in a DATA step.

 

A better approach is to use PROC FORMAT to assign the region codes based on STATE values, and tell PROC SUMMARY to use the formatted region assignments.

 

PROC FORMAT;

VALUE $regions

'WI' = 'Midwest'

'CA', 'OK' = 'West'

'NY', 'WV' = 'East'

;

RUN;

 

PROC SUMMARY DATA= sugi.sales NWAY;

CLASS state;

FORMAT state $regions.;

VAR sales cost;

OUTPUT OUT=example6

SUM(sales) = totsales

SUM(cost) = totcost

;

RUN;

 

PROC PRINT DATA=example6 NOOBS

WIDTH=MIN;

TITLE "Proc Summary used formatted States";

RUN;

 

Proc Summary used formatted States

 

STATE _TYPE_ _FREQ_ TOTSALES TOTCOST

 

West 1 9 1097 901

East 1 7 996 758

Midwest 1 5 479 387

 

NOTE - the data set created above has a variable STATE, that contains the unformatted value of state. The $REGIONS. Format was carried with the data set and used in PROC PRINT. The value of STATE came out, as a default, in ascending order ('CA' first).

 

If you do a PROC PRINT with the following FORMAT assigned to STATE, you will see the original STATE value is passed to the data set:

 

FORMAT STATE $2.;

 

Using the ORDER= option on PROC SUMMARY

 

To change to order of the output observations to reflect the formatted values, the above PROC SUMMARY should be changed to request SAS use the formatted values instead of the unformatted values for output

(the results are shown below):

 

PROC SUMMARY DATA= sugi.sales NWAY ORDER=FORMATTED;

 

STATE _TYPE_ _FREQ_ TOTSALES TOTCOST

 

East 1 7 996 758

Midwest 1 5 479 387

West 1 9 1097 901

 

Remember that the variable STATE still contains the unformatted two character state code for the first observation in the formatted CLASS grouping..

 

Other values for the ORDER= option include:

 

DATA in original order of the incoming data set

EXTERNAL or FORMATTED

As shown above, using the assigned formats

FREQ output is in descending frequency counts (levels with most observations first)

INTERNAL

unformatted value (default)

 

Using the MAXID or MINID options

 

Suppose you want to compute the total sales by STATE, and find out the name of the SALESREP that had the highest and lowest sales in each STATE.

 

Using the MAXID and MINID options on the OUTPUT statement, name the variable that contains the value you want to see to identify the highest/lowest value for that value of the CLASS variable:

 

PROC SUMMARY DATA= sugi.sales;

CLASS state;

VAR sales;

OUTPUT OUT=example7

SUM(sales) = totsales

MAXID(sales(salesrep)) = best

MINID(sales(salesrep)) = worst

;

RUN;

 

PROC PRINT DATA=example7 NOOBS

WIDTH=MIN;

TITLE "Best/Worst Salesreps in Each State";

TITLE2 "and in Total U.S.";

RUN;

 

Best/Worst Salesreps in Each State

and in Total U.S.

 

STATE _TYPE_ _FREQ_ TOTSALES BEST WORST

 

0 21 2572 DONNA JERRY

CA 1 5 702 DONNA FRED

NY 1 3 314 FRED SAM

OK 1 4 395 SAM FRED

WI 1 5 479 JERRY JERRY

WV 1 4 682 DONNA DONNA

 

NOTE - it is important to understand that the above shows which salesrep had the highest/lowest individual observation in the data set, not the highest/lowest total sales. In 'CA', Mary had total sales higher than Donna, but Donna had the single highest sales amount in that state.

 

Missing Values were ignored in the statistics, including the selection of the lowest sales by salesrep.

 

CONCLUSION

 

PROC SUMMARY is one of my favorite tools to reduce a large set of data to meaningful and manageable totals. With some creative use of new numeric variables, formats and options in the procedure, this tool can perform very complex summarization tasks.

 

I suggest reviewing the SAS documentation for further use of this PROC, specifically the more statistically oriented features.

 

SAS is a registered trademark of SAS Institute Inc.

 

References

 

SAS Institute Inc. , SAS Procedures Guide, Version 6, Third Edition, Cary, NC: SAS Institute Inc.

 

Author

 

The author will be glad to answer questions and accept suggestions at the following address:

David Beam
Systems Seminar Consultants
2997 Yarmouth Greenway Drive
Madison, WI 53711

 

voice: (608) 278-9964
Fax:  (608) 278-0065

E-mail:  <train@sys-seminar.com>

Website:  <http://www.sys-seminar.com>