Friday, December 28, 2018

LCS

Longest common subsequence

pystrgrp: https://drive.google.com/open?id=1Ig_ATnmLUJIuHbFPRGlvZdM3Xd5Yp32U

Example

from pystrgrp import Strgrp

def pystrgrp(strings):
    clusters = Strgrp(0.7)
    for string in (x.strip() for x in strings):
        seq, id = string.split(',')
        clusters.add(seq, id)
    return clusters

data = sorted(['12345,1','1234567,2','1234568,3','2345678,4',
               '2345679,5','345678,6','1234578,7','3456789,8','abcdefg,9','bcdefg,10'], reverse=0)

grps = pystrgrp(data)
grps

grps_list = [g for g in grps]
grps_list

import pandas as pd

df = pd.DataFrame()

for i in range(len(grps_list)):
    grp = [g for g in grps_list[i]]
 
    for j in range(len(grp)):
        print(i, grp[j].key(), grp[j].value())
        df = pd.concat([df, pd.DataFrame([tuple([i, grp[j].key(), grp[j].value()])],
                                         columns=['cluster','seq','id'])], ignore_index=True)

df

Clustering

  • Partitioned-based clustering
    • k-means, k-median, fuzzy c-means
  • Hierarchical clustering
    • Produces trees of clusters
    • Agglomerative, divisive
    • Advantages
      • It does not require the number of clusters to be specified.
      • Produces a dendrogram which helps with understanding the data.
    • Disadvantages
      • It can never undo any previous steps throughout the algorithm.
      • Sometimes difficult to identify the number of clusters by the dendrogram.
  • Density-bassed clustering
    • Produces arbitrary shaped clusters
    • Locates regions of high density, and separates outliers
    • DBSCAN
      • Does not require specification of the number of clusters
  • Time-series clustering by features 
    • Time-series clustering by features.
      • Raw data.
      • Autocorrelation.
      • Spectral density.
      • Extreme value behavior.
    • Model-based time series clustering.
      • Forecast based clustering.
      • Model with a cluster structure.
    • Time-series clustering by dependence.
  • Clustering high dimensional data
    • Many clustering algorithms deal with 1-3 dimensions
    • These methods may not work well when the number of dimensions grows to 20
  • Methods for clustering high dimensional data
    • Methods can be grouped into two categories
      • Subspace clustering
        • CLIQUE, ProClus, and bi-clustering approaches
      • Dimensionality reduction approaches
        • Spectral clustering and various dimensionality reduction methods
    • Clustering should not only consider dimensions but also attributes/features
      • Feature selection
      • Feature transformation
        • Principal component analysis, singular value decomposition