### IMSL_K_MEANS

The IMSL_K_MEANS function performs a K-means (centroid) cluster analysis.

The IMSL_K_MEANS function is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). This function computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K-cluster means. The IMSL_K_MEANS function allows for missing values coded as NaN (Not a Number) and for weights and frequencies.

Let p = N_ELEMENTS(x (0, *)) be the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums-of-squares. In this case, the total sums-of-squares within each cluster is computed as the sum of the centered sum-of-squares over all non-missing values of each variable. That is:

where nim denotes the row index of the m-th observation in the i-th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; d is 0 if the j-th variable on observation nim is missing, otherwise d is 1; and:

is the average of the non-missing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.

## Example

This example performs K-means cluster analysis on Fisher’s iris data, which is obtained by IMSL_STATDATA. The initial cluster seed for each iris type is an observation known to be in the iris type.

`seeds = MAKE_ARRAY(3,4)`
`x = IMSL_STATDATA(3)`
`seeds(0, *) = x(0, 1:4)`
`seeds(1, *) = x(50, 1:4)`
`seeds(2, *) = x(100, 1:4)`
`; Use Columns 1, 2, 3, and 4 of data matrix x, only.`
`cluster_group = IMSL_K_MEANS(x(*, 1:4), seeds, \$`
`  Means_Cluster = means_cluster, Ssq_Cluster= ssq_cluster, \$`
`  Counts_Cluster = counts_cluster) FORMAT = '(a, 10i4)'`
`FOR i = 0, 140, 10 DO BEGIN &\$`
`  PRINT, 'observation: ',i + INDGEN(10)+1, \$`
`  FORMAT = format &\$`
`  PRINT, 'cluster: ', cluster_group(i:i+9), \$`
`  FORMAT = format &\$`
`  PRINT &\$`
`END`
`; Print cluster membership in groups of 10.`
`observation:  1   2   3   4   5   6   7   8   9  10`
`  cluster     : 1   1   1   1   1   1   1   1   1   1`
`observation: 11  12  13  14  15  16  17  18  19  20`
`  cluster     : 1   1   1   1   1   1   1   1   1   1`
`observation: 21  22  23  24  25  26  27  28  29  30`
`  cluster     : 1   1   1   1   1   1   1   1   1   1`
`observation: 31   32  33  34  35  36  37  38  39  40`
`  cluster     : 1   1   1   1   1   1   1   1   1   1`
`observation: 41   42  43  44  45  46  47  48  49  50`
`  cluster     : 1   1   1   1   1   1   1   1   1   1`
`observation: 51   52  53  54  55  56  57  58  59  60`
`  cluster     : 2   2   2   2   2   2   2   2   2   2`
`observation: 61   62  63  64  65  66  67  68  69  70`
`  cluster     : 2   2   2   2   2   2   2   2   2   2`
`observation: 71   72  73  74  75  76  77  78  79  80`
`  cluster     : 2   2   2   2   2   2   2   2   2   2`
`observation: 81   82  83  84  85  86  87  88  89  90`
`  cluster     : 2   2   2   2   2   2   2   2   2   2`
`observation: 91   92  93  94  95  96  97  98  99 100`
`  cluster     : 2   2   2   2   2   2   2   2   2   2`
`PM, [[INDGEN(3) + 1],[means_cluster]], Title = 'Cluster Means:',\$`
`  FORMAT = '(i3, 5x, 4f8.4)'`
` `
`Cluster	Means:`
`1     5.0060  3.4280  1.4620  0.2460`
`2     5.9016  2.7484  4.3935  1.4339`
`3     6.8500  3.0737  5.7421  2.0711`
` `
`PM, [[INDGEN(3) + 1],[ssq_cluster]], \$`
`  Title = 'Cluster Sums of Squares:', FORMAT = '(i3, 5x, f8.4)'`
` `
`Cluster Sums of Squares:`
`1    15.1510`
`2    39.8210`
`3    23.8795`
` `
`PM, [[INDGEN(3) + 1],[counts_cluster]], Title = \$`
`  'Number of Observations per Cluster:'`
` `
`Number of Observations per Cluster:`
`1    50`
`2    62`
`3    38`

## Errors

### Warning Errors

STAT_NO_CONVERGENCE: Convergence did not occur.

## Syntax

Result = IMSL_K_MEANS(X, Seeds [, COUNTS_CLUSTER=variable] [, /DOUBLE] [, FREQUENCIES=array] [, ITMAX=value] [, MEANS_CLUSTER=variable] [, SSQ_CLUSTER=variable] [, VAR_COLUMNS=array] [, WEIGHTS=array])

## Return Value

The cluster membership for each observation is returned.

## Arguments

### Seeds

Two-dimensional array containing the cluster seeds, i.e., estimates for the cluster centers. The seed value for the j-th variable of the i-th seed should be in seeds (i, j).

### X

Two-dimensional array containing observations to be clustered. The data value for the i-th observation of the j-th variable should be in x(i, j).

## Keywords

### COUNTS_CLUSTER (optional)

Named variable into which an array containing the number of observations in each cluster is stored.

### DOUBLE (optional)

If present and nonzero, then double precision is used.

### FREQUENCIES (optional)

One-dimensional array containing the frequency of each observation of matrix x. Default: (*) = 1

### ITMAX (optional)

Maximum number of iterations. Default: 30

### MEANS_CLUSTER (optional)

Named variable into which a two-dimensional array containing the cluster means is stored.

### SSQ_CLUSTER (optional)

Named variable into which a one-dimensional array containing the within sum-of- squares for each cluster is stored.

### VAR_COLUMNS (optional)

One-dimensional array containing the columns of x to be used in computing the metric. Columns are numbered 0, 1, 2, ..., N_ELEMENTS(x(0, *)). Default: VARS_COLUMNS(*) = 0, 1, 2, ..., N_ELEMENTS(x(0, *)) – 1

### WEIGHTS (optional)

One-dimensional array containing the weight of each observation of matrix x. Default: 1

## Version History

 6.4 Introduced