Welcome to the Harris Geospatial product documentation center. Here you will find reference guides, help documents, and product libraries.


  >  Docs Center  >  IDL Reference  >  Advanced Math and Stats  >  IMSL_K_MEANS

IMSL_K_MEANS

IMSL_K_MEANS

The IMSL_K_MEANS function performs a K-means (centroid) cluster analysis.



Syntax


Result = IMSL_K_MEANS(x, seeds [, COUNTS_CLUSTER=variable]
[, /DOUBLE] [, FREQUENCIES=array] [, ITMAX=value] [, MEANS_CLUSTER=variable] [, SSQ_CLUSTER=variable] [, VAR_COLUMNS=array] [, WEIGHTS=array] )

Return Value


The cluster membership for each observation is returned.

Arguments


seeds

Two-dimensional array containing the cluster seeds, i.e., estimates for the cluster centers. The seed value for the j-th variable of the i-th seed should be in seeds (i, j).

x

Two-dimensional array containing observations to be clustered. The data value for the i-th observation of the j-th variable should be in x(i, j).

Keywords


COUNTS_CLUSTER

Named variable into which an array containing the number of observations in each cluster is stored.

DOUBLE

If present and nonzero, double precision is used.

FREQUENCIES

One-dimensional array containing the frequency of each observation of matrix x. Default: Frequencies(*) = 1

ITMAX

Maximum number of iterations. Default: Itmax = 30

MEANS_CLUSTER

Named variable into which a two-dimensional array containing the cluster means is stored.

SSQ_CLUSTER

Named variable into which a one-dimensional array containing the within sum-of-squares for each cluster is stored.

VAR_COLUMNS

One-dimensional array containing the columns of x to be used in computing the metric. Columns are numbered 0, 1, 2, ..., N_ELEMENTS(x(0, *)). Default: Vars_Columns(*) = 0, 1, 2, ..., N_ELEMENTS(x(0, *)) – 1

WEIGHTS

One-dimensional array containing the weight of each observation of matrix x. Default: Weights(*) = 1

Discussion


The IMSL_K_MEANS function is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). This function computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K-cluster means. The IMSL_K_MEANS function allows for missing values coded as NaN (Not a Number) and for weights and frequencies.

Let p = N_ELEMENTS(x (0, *)) be the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums-of-squares. In this case, the total sums-of-squares within each cluster is computed as the sum of the centered sum-of-squares over all non-missing values of each variable. That is:

where nim denotes the row index of the m-th observation in the i-th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; d is 0 if the j-th variable on observation nim is missing, otherwise d is 1; and:

is the average of the non-missing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease of the total within-cluster sums-of-squares. See Hartigan and Wong (1979) or Hartigan (1975) for details.

Example


This example performs K-means cluster analysis on Fisher's iris data, which is obtained by IMSL_STATDATA. The initial cluster seed for each iris type is an observation known to be in the iris type.

seeds = MAKE_ARRAY(3,4)  
x = IMSL_STATDATA(3)
seeds(0, *) = x(0, 1:4)
seeds(1, *) = x(50, 1:4)
seeds(2, *) = x(100, 1:4)
; Use Columns 1, 2, 3, and 4 of data matrix x, only.
cluster_group = IMSL_K_MEANS(x(*, 1:4), seeds, $
   Means_Cluster = means_cluster, Ssq_Cluster = ssq_cluster, $
   Counts_Cluster = counts_cluster)
FORMAT = ';(a, 10i4)'
FOR i = 0, 140, 10 DO BEGIN &$
   PRINT, ';observation: ',i + INDGEN(10)+1, $
   FORMAT = format &$
   PRINT, ';cluster: ', cluster_group(i:i+9), $
   FORMAT = format &$
   PRINT &$
END
; Print cluster membership in groups of 10.

observation:  1 2 3 4 5 6 7 8 9 10
   cluster : 1 1 1 1 1 1 1 1 1 1
observation: 11 12 13 14   15 16 17  18 19 20
   cluster : 1 1 1 1 1 1 1 1 1 1
observation: 21 22 23 24   25 26 27  28 29 30
   cluster : 1 1 1 1 1 1 1 1 1 1
observation: 31 32 33 34   35 36 37   38 39 40
   cluster : 1 1 1 1 1 1 1 1 1 1
observation: 41 42 43 44   45 46 47   48 49 50
   cluster : 1 1 1 1 1 1 1 1 1 1
observation: 51 52 53 54   55 56 57   58 59 60
   cluster : 2 2 3 2 2 2 2 2 2 2
observation: 61 62 63 64   65 66 67   68 69 70
   cluster : 2 2 2 2 2 2 2 2 2 2
observation: 71 72 73 74   75 76 77   78 79 80
   cluster : 2 2 2 2 2 2 2 3 2 2
observation: 81 82 83 84   85 86 87   88 89 90
   cluster : 2 2 2 2 2 2 2 2 2 2
observation: 91 92 93 94   95 96 97   98 99 100
   cluster : 2 2 2 2 2 2 2 2 2 2
observation: 101 102 103 104  105 106 107  108 109 110
   cluster : 3 2 3 3 3 3 2 3 3 3
observation: 111 112 113 114  115 116 117  118 119 120
   cluster : 3 3 3 2 2 3 3 3 3 2
observation: 121 122 123 124  125 126 127  128 129 130
   cluster : 3 2 3 2 3 3 2 2 3 3
observation: 131 132 133 134  135 136 137  138 139 140
   cluster : 3 3 3 2 3 3 3 3 2 3
observation: 141 142 143 144  145 146 147  148 149 150
   cluster : 3 3 2 3 3 3 2 3 3 2

PM, [[INDGEN(3) + 1],[means_cluster]], Title = ';Cluster Means:',$
   FORMAT = ';(i3, 5x, 4f8.4)'

Cluster Means:
   1 5.0060 3.4280 1.4620 0.2460
   2 5.9016 2.7484 4.3935 1.4339
   3 6.8500 3.0737 5.7421 2.0711

PM, [[INDGEN(3) + 1],[ssq_cluster]], $
   Title = ';Cluster Sums of Squares:', FORMAT = '(i3, 5x, f8.4)'

Cluster Sums of Squares:
   1 15.1510
   2 39.8210
   3 23.8795

PM, [[INDGEN(3) + 1],[counts_cluster]], Title = $
   ';Number of Observations per Cluster:'

Number of Observations per Cluster:
   1 50
   2 62
   3 38

Errors


Warning Errors

STAT_NO_CONVERGENCE—Convergence did not occur.

Version History


6.4
Introduced



© 2018 Harris Geospatial Solutions, Inc. |  Legal
My Account    |    Store    |    Contact Us