The CORR procedure computes Pearson correlation coefficients, three nonparametric measures of association, and the probabilities associated with these statistics. The correlation statistics include

Pearson product-moment correlation
Spearman rank-order correlation
Kendall's tau-b coefficient
Hoeffding's measure of dependence, D
Pearson, Spearman, and Kendall partial correlation

Pearson product-moment correlation is a parametric measure of a linear relationship between two variables. For nonparametric measures of association, Spearman rank-order correlation uses the ranks of the data values and Kendall's tau-b uses the number of concordances and discordances in paired observations. Hoeffding's measure of dependence is another nonparametric measure of association that detects more general departures from independence. A partial correlation provides a measure of the correlation between two variables after controlling the effects of other variables.
With only one set of analysis variables specified, the default correlation analysis includes descriptive statistics for each analysis variable and Pearson correlation statistics for these variables. You can also compute Cronbach's coefficient alpha for estimating reliability.
With two sets of analysis variables specified, the default correlation analysis includes descriptive statistics for each analysis variable and Pearson correlation statistics between these two sets of variables.
You can save the correlation statistics in a SAS data set for use with other statistical and reporting procedures.
For a Pearson or Spearman correlation, the Fisher's z transformation can be used to derive its confidence limits and a p-value under a specified null hypothesis $H_0\colon\rho = \rho_0$ . Either a one-sided or a two-sided alternative is used for these statistics.

Getting Started

The following statements create the data set Fitness, which has been altered to contain some missing values:

      *----------------- Data on Physical Fitness -----------------*       | These measurements were made on men involved in a physical |       | fitness course at N.C. State University.                   |       | The variables are Age (years), Weight (kg),                |       | Runtime (time to run 1.5 miles in minutes), and            |       | Oxygen (oxygen intake, ml per kg body weight per minute)   |       | Certain values were changed to missing for the analysis.   |       *------------------------------------------------------------*;       data Fitness;          input Age Weight Oxygen RunTime @@;          datalines;       44 89.47 44.609 11.37    40 75.07 45.313 10.07        44 85.84 54.297  8.65    42 68.15 59.571  8.17        38 89.02 49.874   .      47 77.45 44.811 11.63        40 75.98 45.681 11.95    43 81.19 49.091 10.85        44 81.42 39.442 13.08    38 81.87 60.055  8.63        44 73.03 50.541 10.13    45 87.66 37.388 14.03        45 66.45 44.754 11.12    47 79.15 47.273 10.60        54 83.12 51.855 10.33    49 81.42 49.156  8.95        51 69.63 40.836 10.95    51 77.91 46.672 10.00        48 91.63 46.774 10.25    49 73.37   .    10.08        57 73.37 39.407 12.63    54 79.38 46.080 11.17        52 76.32 45.441  9.63    50 70.87 54.625  8.92        51 67.25 45.118 11.08    54 91.63 39.203 12.88        51 73.71 45.790 10.47    57 59.08 50.545  9.93        49 76.32   .      .      48 61.24 47.920 11.50        52 82.78 47.467 10.50        ;

The following statements invoke the CORR procedure and request a correlation analysis:

   ods html;    ods graphics on;     proc corr data=Fitness plots;    run;     ods graphics off;    ods html close;

This graphical display is requested by specifying the experimental ODS GRAPHICS statement and the experimental PLOTS option. For general information about ODS graphics, refer to Chapter 15, "Statistical Graphics Using ODS" (SAS/STAT User's Guide). For specific information about the graphics available in the CORR procedure, see the section "ODS Graphics."

The CORR Procedure

4 Variables:	Age Weight Oxygen RunTime

Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum
Age	31	47.67742	5.21144	1478	38.00000	57.00000
Weight	31	77.44452	8.32857	2401	59.08000	91.63000
Oxygen	29	47.22721	5.47718	1370	37.38800	60.05500
RunTime	29	10.67414	1.39194	309.55000	8.17000	14.03000

Figure 1.1: Univariate Statistics
By default, all numeric variables not listed in other statements are used in the analysis. Observations with nonmissing values for each variable are used to derive the univariate statistics for that variable.

Pearson Correlation Coefficients Prob > \|r\| under H0: Rho=0 Number of Observations
	Age	Weight	Oxygen	RunTime
Age	1.00000 31	-0.23354 0.2061 31	-0.31474 0.0963 29	0.14478 0.4536 29
Weight	-0.23354 0.2061 31	1.00000 31	-0.15358 0.4264 29	0.20072 0.2965 29
Oxygen	-0.31474 0.0963 29	-0.15358 0.4264 29	1.00000 29	-0.86843 <.0001 28
RunTime	0.14478 0.4536 29	0.20072 0.2965 29	-0.86843 <.0001 28	1.00000 29

Figure 1.2: Pearson Correlation Coefficients
By default, Pearson correlation statistics are computed from observations with nonmissing values for each pair of analysis variables. With missing values in the analysis, the "Pearson Correlation Coefficients" table shown in Figure 1.2 displays the correlation, the p-value under the null hypothesis of zero correlation, and the number of nonmissing observations for each pair of variables.
The table displays a correlation of -0.86843 between Runtime and Oxygen, which is significant with a p-value less than 0.0001. That is, there exists an inverse linear relationship between these two variables. As Runtime (time to run 1.5 miles in minutes) increases, Oxygen (oxygen intake, ml per kg body weight per minute) decreases.
The experimental PLOTS option displays a symmetric matrix plot for the analysis variables. This inverse linear relationship between these two variables, Oxygen and Runtime, is also shown in Figure 1.3.

PROC CORR < options > ;: BY variables ;; FREQ variable ;; PARTIAL variables ;; VAR variables ;; WEIGHT variable ;; WITH variables ;

The BY statement specifies groups in which separate correlation analyses are performed.

The FREQ statement specifies the variable that represents the frequency of occurrence for other values in the observation.

The PARTIAL statement identifies controlling variables to compute Pearson, Spearman, or Kendall partial-correlation coefficients.

The VAR statement lists the numeric variables to be analyzed and their order in the correlation matrix. If you omit the VAR statement, all numeric variables not listed in other statements are used.

The WEIGHT statement identifies the variable whose values weight each observation to compute Pearson product-moment correlation.

The WITH statement lists the numeric variables with which correlations are to be computed.

The PROC CORR statement is the only required statement for the CORR procedure. The rest of this section provides detailed syntax information for each of these statements, beginning with the PROC CORR statement. The remaining statements are in alphabetical order.

PROC CORR Statement

BY Statement

FREQ Statement

PARTIAL Statement

VAR Statement

WEIGHT Statement

WITH Statement

PROC CORR Statement

PROC CORR < options > ;

The following table summarizes the options available in the PROC CORR statement.

Table 1.1: Summary of PROC CORR Options

Tasks		Options
Specify data sets
	Input data set	DATA=
	Output data set with Hoeffding's D statistics	OUTH=
	Output data set with Kendall correlation statistics	OUTK=
	Output data set with Pearson correlation statistics	OUTP=
	Output data set with Spearman correlation statistics	OUTS=
Control statistical analysis
	Exclude observations with nonpositive weight values	EXCLNPWGT
	from the analysis
	Exclude observations with missing analysis values	NOMISS
	from the analysis
	Request Hoeffding's measure of dependence, D	HOEFFDING
	Request Kendall's tau-b	KENDALL
	Request Pearson product-moment correlation	PEARSON
	Request Spearman rank-order correlation	SPEARMAN
	Request Pearson correlation statistics using Fisher's	FISHER PEARSON
	z transformation
	Request Spearman rank-order correlation statistics	FISHER SPEARMAN
	using Fisher's z transformation
Control Pearson correlation statistics
	Compute Cronbach's coefficient alpha	ALPHA
	Compute covariances	COV
	Compute corrected sums of squares and crossproducts	CSSCP
	Compute correlation statistics based on Fisher's	FISHER
	z transformation
	Exclude missing values	NOMISS
	Specify singularity criterion	SINGULAR=
	Compute sums of squares and crossproducts	SSCP
	Specify the divisor for variance calculations	VARDEF=
Control printed output
	Display a specified number of ordered correlation coefficients	BEST=
	Suppress Pearson correlations	NOCORR
	Suppress all printed output	NOPRINT
	Suppress p-values	NOPROB
	Suppress descriptive statistics	NOSIMPLE
	Display ordered correlation coefficients	RANK

The following options (listed in alphabetical order) can be used in the PROC CORR statement:

ALPHA

calculates and prints Cronbach's coefficient alpha. PROC CORR computes separate coefficients using raw and standardized values (scaling the variables to a unit variance of 1). For each VAR statement variable, PROC CORR computes the correlation between the variable and the total of the remaining variables. It also computes Cronbach's coefficient alpha using only the remaining variables.

If a WITH statement is specified, the ALPHA option is invalid. When you specify the ALPHA option, the Pearson correlations will also be displayed. If you specify the OUTP= option, the output data set also contains observations with Cronbach's coefficient alpha. If you use the PARTIAL statement, PROC CORR calculates Cronbach's coefficient alpha for partialled variables. See the section "Partial Correlation."

BEST=n

prints the n highest correlation coefficients for each variable, $n \geq 1$ . Correlations are ordered from highest to lowest in absolute value. Otherwise, PROC CORR prints correlations in a rectangular table using the variable names as row and column labels.

If you specify the HOEFFDING option, PROC CORR displays the D statistics in order from highest to lowest.

COV

displays the variance and covariance matrix. When you specify the COV option, the Pearson correlations will also be displayed. If you specify the OUTP= option, the output data set also contains the covariance matrix with the corresponding _TYPE_ variable value 'COV.' If you use the PARTIAL statement, PROC CORR computes a partial covariance matrix.

CSSCP

displays a table of the corrected sums of squares and crossproducts. When you specify the CSSCP option, the Pearson correlations will also be displayed. If you specify the OUTP= option, the output data set also contains a CSSCP matrix with the corresponding _TYPE_ variable value 'CSSCP.' If you use a PARTIAL statement, PROC CORR prints both an unpartial and a partial CSSCP matrix, and the output data set contains a partial CSSCP matrix.

DATA=SAS-data-set

names the SAS data set to be analyzed by PROC CORR. By default, the procedure uses the most recently created SAS data set.

EXCLNPWGT

excludes observations with nonpositive weight values from the analysis. By default, PROC CORR treats observations with negative weights like those with zero weights and counts them in the total number of observations.

FISHER < ( fisher-options ) >

requests confidence limits and p-values under a specified null hypothesis, $H_0\colon\rho = \rho_0$ , for correlation coefficients using Fisher's z transformation. These correlations include the Pearson correlations and Spearman correlations.

The following fisher-options are available:

ALPHA= $\alpha$: specifies the level of the confidence limits for the correlation, $100(1-\alpha)\%$ . The value of the ALPHA= option must be between 0 and 1, and the default is ALPHA=0.05.
BIASADJ= YES | NO: specifies whether or not the bias adjustment is used in constructing confidence limits. The BIASADJ=YES option also produces a new correlation estimate using the bias adjustment. By default, BIASADJ=YES.
RHO0= ${\rho}_{0}$: specifies the value ${\rho}_{0}$ in the null hypothesis $H_0\colon\rho = \rho_0$ , where $-1 \lt {\rho}_{0} \lt 1$ . By default, RHO0=0.
TYPE= LOWER | UPPER | TWOSIDED: specifies the type of confidence limits. The TYPE=LOWER option requests a lower confidence limit from the lower alternative $H_1\colon\rho \lt \rho_{0}$ , the TYPE=UPPER option requests an upper confidence limit from the upper alternative $H_1\colon\rho \gt \rho_{0}$ , and the default TYPE=TWOSIDED option requests two-sided confidence limits from the two-sided alternative $H_1\colon\rho \neq \rho_{0}$ .

HOEFFDING

requests a table of Hoeffding's D statistics. This D statistic is 30 times larger than the usual definition and scales the range between -0.5 and 1 so that large positive values indicate dependence. The HOEFFDING option is invalid if a WEIGHT or PARTIAL statement is used.

KENDALL

requests a table of Kendall's tau-b coefficients based on the number of concordant and discordant pairs of observations. Kendall's tau-b ranges from -1 to 1.

The KENDALL option is invalid if a WEIGHT statement is used. If you use a PARTIAL statement, probability values for Kendall's partial tau-b are not available.

NOCORR

suppresses displaying of Pearson correlations. If you specify the OUTP= option, the data set type remains CORR. To change the data set type to COV, CSSCP, or SSCP, use the TYPE= data set option.

NOMISS

excludes observations with missing values from the analysis. Otherwise, PROC CORR computes correlation statistics using all of the nonmissing pairs of variables. Using the NOMISS option is computationally more efficient.

NOPRINT

suppresses all displayed output. Use NOPRINT if you want to create an output data set only.

NOPROB

suppresses displaying the probabilities associated with each correlation coefficient.

NOSIMPLE

suppresses printing simple descriptive statistics for each variable. However, if you request an output data set, the output data set still contains simple descriptive statistics for the variables.

OUTH=output-data-set

creates an output data set containing Hoeffding's D statistics. The contents of the output data set are similar to the OUTP= data set. When you specify the OUTH= option, the Hoeffding's D statistics will be displayed, and the Pearson correlations will be displayed only if the PEARSON, ALPHA, COV, CSSCP, SSCP, or OUT= option is also specified.

OUTK=output-data-set

creates an output data set containing Kendall correlation statistics. The contents of the output data set are similar to those of the OUTP= data set. When you specify the OUTK= option, the Kendall correlation statistics will be displayed, and the Pearson correlations will be displayed only if the PEARSON, ALPHA, COV, CSSCP, SSCP, or OUT= option is also specified.

OUTP=output-data-set

OUT=output-data-set

creates an output data set containing Pearson correlation statistics. This data set also includes means, standard deviations, and the number of observations. The value of the _TYPE_ variable is 'CORR.' When you specify the OUTP= option, the Pearson correlations will also be displayed. If you specify the ALPHA option, the output data set also contains six observations with Cronbach's coefficient alpha.

OUTS=SAS-data-set

creates an output data set containing Spearman correlation coefficients. The contents of the output data set are similar to the OUTP= data set. When you specify the OUTS= option, the Spearman correlation coefficients will be displayed, and the Pearson correlations will be displayed only if the PEARSON, ALPHA, COV, CSSCP, SSCP, or OUT= option is also specified.

PEARSON

requests a table of Pearson product-moment correlations. If you do not specify the HOEFFDING, KENDALL, SPEARMAN, OUTH=, OUTK=, or OUTS= option, the CORR procedure produces Pearson product-moment correlations by default. Otherwise, you must specify the PEARSON, ALPHA, COV, CSSCP, SSCP, or OUT= option for Pearson correlations. The correlations range from -1 to 1.

RANK

displays the ordered correlation coefficients for each variable. Correlations are ordered from highest to lowest in absolute value. If you specify the HOEFFDING option, the D statistics are displayed in order from highest to lowest.

SINGULAR=p

specifies the criterion for determining the singularity of a variable if you use a PARTIAL statement. A variable is considered singular if its corresponding diagonal element after Cholesky decomposition has a value less than p times the original unpartialled value of that variable. The default value is 1E-8. The range of $\rho$ is between 0 and 1.

SPEARMAN

requests a table of Spearman correlation coefficients based on the ranks of the variables. The correlations range from -1 to 1. If you specify a WEIGHT statement, the SPEARMAN option is invalid.

SSCP

displays a table the sums of squares and crossproducts. When you specify the SSCP option, the Pearson correlations will also be displayed. If you specify the OUTP= option, the output data set contains a SSCP matrix and the corresponding _TYPE_ variable value is 'SSCP.' If you use a PARTIAL statement, the unpartial SSCP matrix is displayed, and the output data set does not contain an SSCP matrix.

VARDEF=d

specifies the variance divisor in the calculation of variances and covariances. The following table shows the possible values for the value d and associated divisors, where k is the number of PARTIAL statement variables. The default is VARDEF=DF.

Table 1.2: Possible Values for VARDEF=

Value	Divisor	Formula
DF	degrees of freedom	n - k - 1
N	number of observations	n
WDF	sum of weights minus one	$(\Sigma w_{i}) - k - 1$
WEIGHT\|WGT	sum of weights	$\Sigma w_{i}$

The variance is computed as

$\frac{1}d \, \sum_{i} (x_i- \bar{x})^2$

where $\bar{x}$ is the sample mean.

If a WEIGHT statement is used, the variance is computed as

$\frac{1}d \, \sum_{i} w_i(x_i- \bar{x}_w)^2$

where w_i is the weight for the ith observation and $\bar{x}_w$ is the weighted mean.

If you use the WEIGHT statement and VARDEF=DF, the variance is an estimate of s², where the variance of the ith observation is V(x_i)=s²/w_i. This yields an estimate of the variance of an observation with unit weight.

If you use the WEIGHT statement and VARDEF=WGT, the computed variance is asymptotically an estimate of $s^2/\bar{w}$ , where $\bar{w}$ is the average weight (for large n). This yields an asymptotic estimate of the variance of an observation with average weight.

Wednesday, March 7, 2012

Proc Corr in SAS

Getting Started

PROC CORR Statement

BY Statement

FREQ Statement

PARTIAL Statement

VAR Statement

WEIGHT Statement

WITH Statement

PROC CORR Statement

No comments:

Post a Comment

Search This Blog

Popular Posts