Mungeol Heo: 2018-12-23

Friday, December 28, 2018

LCS

Longest common subsequence

pystrgrp: https://drive.google.com/open?id=1Ig_ATnmLUJIuHbFPRGlvZdM3Xd5Yp32U

Example

from pystrgrp import Strgrp

def pystrgrp(strings):
clusters = Strgrp(0.7)
for string in (x.strip() for x in strings):
seq, id = string.split(',')
clusters.add(seq, id)
return clusters

data = sorted(['12345,1','1234567,2','1234568,3','2345678,4',
'2345679,5','345678,6','1234578,7','3456789,8','abcdefg,9','bcdefg,10'], reverse=0)

grps = pystrgrp(data)
grps

grps_list = [g for g in grps]
grps_list

import pandas as pd

df = pd.DataFrame()

for i in range(len(grps_list)):
grp = [g for g in grps_list[i]]

for j in range(len(grp)):
print(i, grp[j].key(), grp[j].value())
df = pd.concat([df, pd.DataFrame([tuple([i, grp[j].key(), grp[j].value()])],
columns=['cluster','seq','id'])], ignore_index=True)

df

Clustering

Partitioned-based clustering
- k-means, k-median, fuzzy c-means
Hierarchical clustering
- Produces trees of clusters
- Agglomerative, divisive
- Advantages
  - It does not require the number of clusters to be specified.
  - Produces a dendrogram which helps with understanding the data.
- Disadvantages
  - It can never undo any previous steps throughout the algorithm.
  - Sometimes difficult to identify the number of clusters by the dendrogram.
Density-bassed clustering
- Produces arbitrary shaped clusters
- Locates regions of high density, and separates outliers
- DBSCAN
  - Does not require specification of the number of clusters
Time-series clustering by features
- Time-series clustering by features.
  - Raw data.
  - Autocorrelation.
  - Spectral density.
  - Extreme value behavior.
- Model-based time series clustering.
  - Forecast based clustering.
  - Model with a cluster structure.
- Time-series clustering by dependence.
Clustering high dimensional data
- Many clustering algorithms deal with 1-3 dimensions
- These methods may not work well when the number of dimensions grows to 20
Methods for clustering high dimensional data
- Methods can be grouped into two categories
  - Subspace clustering
    - CLIQUE, ProClus, and bi-clustering approaches
  - Dimensionality reduction approaches
    - Spectral clustering and various dimensionality reduction methods
- Clustering should not only consider dimensions but also attributes/features
  - Feature selection
  - Feature transformation
    - Principal component analysis, singular value decomposition