Longest common subsequence
pystrgrp: https://drive.google.com/open?id=1Ig_ATnmLUJIuHbFPRGlvZdM3Xd5Yp32U
Example
from pystrgrp import Strgrp
def pystrgrp(strings):
clusters = Strgrp(0.7)
for string in (x.strip() for x in strings):
seq, id = string.split(',')
clusters.add(seq, id)
return clusters
data = sorted(['12345,1','1234567,2','1234568,3','2345678,4',
'2345679,5','345678,6','1234578,7','3456789,8','abcdefg,9','bcdefg,10'], reverse=0)
grps = pystrgrp(data)
grps
grps_list = [g for g in grps]
grps_list
import pandas as pd
df = pd.DataFrame()
for i in range(len(grps_list)):
grp = [g for g in grps_list[i]]
for j in range(len(grp)):
print(i, grp[j].key(), grp[j].value())
df = pd.concat([df, pd.DataFrame([tuple([i, grp[j].key(), grp[j].value()])],
columns=['cluster','seq','id'])], ignore_index=True)
df
Friday, December 28, 2018
Clustering
- Partitioned-based clustering
- k-means, k-median, fuzzy c-means
- Hierarchical clustering
- Produces trees of clusters
- Agglomerative, divisive
- Advantages
- It does not require the number of clusters to be specified.
- Produces a dendrogram which helps with understanding the data.
- Disadvantages
- It can never undo any previous steps throughout the algorithm.
- Sometimes difficult to identify the number of clusters by the dendrogram.
- Density-bassed clustering
- Produces arbitrary shaped clusters
- Locates regions of high density, and separates outliers
- DBSCAN
- Does not require specification of the number of clusters
- Time-series clustering by features
- Time-series clustering by features.
- Raw data.
- Autocorrelation.
- Spectral density.
- Extreme value behavior.
- Model-based time series clustering.
- Forecast based clustering.
- Model with a cluster structure.
- Time-series clustering by dependence.
- Time-series clustering by features.
- Clustering high dimensional data
- Many clustering algorithms deal with 1-3 dimensions
- These methods may not work well when the number of dimensions grows to 20
- Methods for clustering high dimensional data
- Methods can be grouped into two categories
- Subspace clustering
- CLIQUE, ProClus, and bi-clustering approaches
- Dimensionality reduction approaches
- Spectral clustering and various dimensionality reduction methods
- Subspace clustering
- Clustering should not only consider dimensions but also attributes/features
- Feature selection
- Feature transformation
- Principal component analysis, singular value decomposition
- Methods can be grouped into two categories
Subscribe to:
Posts (Atom)