Friday, December 28, 2018

Clustering

  • Partitioned-based clustering
    • k-means, k-median, fuzzy c-means
  • Hierarchical clustering
    • Produces trees of clusters
    • Agglomerative, divisive
    • Advantages
      • It does not require the number of clusters to be specified.
      • Produces a dendrogram which helps with understanding the data.
    • Disadvantages
      • It can never undo any previous steps throughout the algorithm.
      • Sometimes difficult to identify the number of clusters by the dendrogram.
  • Density-bassed clustering
    • Produces arbitrary shaped clusters
    • Locates regions of high density, and separates outliers
    • DBSCAN
      • Does not require specification of the number of clusters
  • Time-series clustering by features 
    • Time-series clustering by features.
      • Raw data.
      • Autocorrelation.
      • Spectral density.
      • Extreme value behavior.
    • Model-based time series clustering.
      • Forecast based clustering.
      • Model with a cluster structure.
    • Time-series clustering by dependence.
  • Clustering high dimensional data
    • Many clustering algorithms deal with 1-3 dimensions
    • These methods may not work well when the number of dimensions grows to 20
  • Methods for clustering high dimensional data
    • Methods can be grouped into two categories
      • Subspace clustering
        • CLIQUE, ProClus, and bi-clustering approaches
      • Dimensionality reduction approaches
        • Spectral clustering and various dimensionality reduction methods
    • Clustering should not only consider dimensions but also attributes/features
      • Feature selection
      • Feature transformation
        • Principal component analysis, singular value decomposition

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.