Clustering Using REpresentatives: An E cient Clustering...
Transcript of Clustering Using REpresentatives: An E cient Clustering...
![Page 1: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/1.jpg)
Clustering Using REpresentatives: An EfficientClustering Algorithm for Large Databases
Bilza Marques de [email protected]
SCC5895 - Analise de Agrupamento de Dados
Seminario09 de Dezembro de 2010
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 1 / 31
![Page 2: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/2.jpg)
Conteudo
1 Introducao
2 CURE
3 Melhorias - Large Scale Databases
4 Resultados
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 2 / 31
![Page 3: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/3.jpg)
Principais Referencias
Guha, S., Rastogi, R., & Shim, K. (1998).Cure: an efficient clustering algorithm for large databases.In Proceedings of the 1998 ACM SIGMOD international conference onManagement of data, SIGMOD ’98 (pp. 73–84). New York, NY, USA:ACM.
Guha, S., Rastogi, R., & Shim, K. (2001).Cure: an efficient clustering algorithm for large databases.Information Systems, 26, 35–58.
Theodoridis, S. & Koutroumbas, K. (2009).Pattern Recognition, (pp. 683–685).Academic Press, 4th edition.
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 3 / 31
![Page 4: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/4.jpg)
1 Introducao
2 CURE
3 Melhorias - Large Scale Databases
4 Resultados
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 4 / 31
![Page 5: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/5.jpg)
Introducao
Problema de clustering: Dado um conjunto de objetos, agrupar os objetosem clusters tal que cada objetos em um cluster seja mais similares aosdemais objetos no mesmo cluster que a objetos em outros clusters.
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 5 / 31
![Page 6: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/6.jpg)
Representantes
Quando unem grupos, algoritmos aglomerativos utilizam“representantes”:
centroides, medoides: dmean
todos os objetos do grupo: dmin, dmax, dave
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 6 / 31
![Page 7: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/7.jpg)
Deficiencias abordagens classicas
Abordagens classicas favorecem clusters esfericos e tamanhossimilares ou sao frageis na presenca de outliers
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 7 / 31
![Page 8: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/8.jpg)
1 Introducao
2 CURE
3 Melhorias - Large Scale Databases
4 Resultados
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 8 / 31
![Page 9: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/9.jpg)
Clustering Using REpresentatives
Tecnica de Agrupamento de Dados Hierarquica
Cada grupo e representado por min(c, |Ci|) pontos que descrevem aforma* do grupo
Dois grupos, C1 e C2, sao unidos se ∀Ci, Cj
dcure(C1, C2) == min(dcure(Ci, Cj))
dcure(Ci, Cj) e a menor distancia entre representantes de Ci e Cj
Os grupos sao unidos ate que sejam obtidos k grupos
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 9 / 31
![Page 10: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/10.jpg)
c objetos bem distribuıdos
+ +
+ +
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 10 / 31
![Page 11: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/11.jpg)
c objetos bem distribuıdos
+ +
+ +
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 10 / 31
![Page 12: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/12.jpg)
c objetos bem distribuıdos
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 11 / 31
![Page 13: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/13.jpg)
c objetos bem distribuıdos
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 11 / 31
![Page 14: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/14.jpg)
c representantes
Cada representante e encolhidos sentido ao centroide em α
p = p+ α(x− p)
+ +
Dois grupos sao unidos de acordo com os representantes
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 12 / 31
![Page 15: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/15.jpg)
c representantes
Cada representante e encolhidos sentido ao centroide em α
p = p+ α(x− p)
+ +
Dois grupos sao unidos de acordo com os representantes
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 12 / 31
![Page 16: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/16.jpg)
Algoritmo
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 13 / 31
![Page 17: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/17.jpg)
Algoritmo
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 13 / 31
![Page 18: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/18.jpg)
1 Introducao
2 CURE
3 Melhorias - Large Scale Databases
4 Resultados
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 14 / 31
![Page 19: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/19.jpg)
Melhorias - Large Scale Databases
Bases de dados com centenas de milhares de objetos
Limitacoes de memoria
Inviabilidade Computacional O(n2 log n)
Base de dados
Elimina outliers
Agrupa grupos parciais
Rotulação dados no disco
Amostragem sobre
os dados
Particionamento das
amostras
Agrupamento parcial
cada partição
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 15 / 31
![Page 20: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/20.jpg)
Amostragem
Amostra suficientemente grande preserva caracterısticas dos grupos eelimina outliers
De acordo com Chernoff bounds, para clusters bem definidos (densosintra e esparso inter):
smin = ξkρ+ kρ log(1
δ) + kρ
√(log(
1
δ))2 + 2ξ log(
1
δ)
Nao depende do numero de objetos n, mas sim do numero de clustersbem definidos, k
Com grupos de densidades variadas e necessario assumir k de acordocom regioes densas
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 16 / 31
![Page 21: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/21.jpg)
Particionamento
Particionamento da amostra
Agrupamento hierarquico ate npq
Permite melhor tratamento de outliers*
Reduz ainda mais custo computacional
a2 + b2 < (a+ b)2
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 17 / 31
![Page 22: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/22.jpg)
Remocao de outliers
Modelo ainda susceptıvel a outliers
Amostragem e parametro α podem nao eliminar todo efeito de
Uma vez que outliers tendem a ser agrupados no final da hierarquia
Grupos com poucos objetos sao removidos da amostra
ao final de cada particaoquando atinge o numero de grupos k
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 18 / 31
![Page 23: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/23.jpg)
Rotulacao em disco
Objetos em disco sao rotulados de acordo com os representantes degrupos
Cada objeto e atribuıdo ao grupo do representante mais proximo
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 19 / 31
![Page 24: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/24.jpg)
1 Introducao
2 CURE
3 Melhorias - Large Scale Databases
4 Resultados
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 20 / 31
![Page 25: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/25.jpg)
Comparacao com BIRCH e MST
n = 100000, s = 2500, c = 10, α = 0.3, p = 1
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 21 / 31
![Page 26: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/26.jpg)
Comparacao com BIRCH e MST
n = 100000, s = 2500, c = 10, α = 0.3, p = 1
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 22 / 31
![Page 27: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/27.jpg)
Comparacao com BIRCH e MST
n = 121560, s = 3000, c = 100, α = 0.15, p = 1
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 23 / 31
![Page 28: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/28.jpg)
Fator de encolhimento α
n = 100000, s = 2500, c = 10, p = 1, α = [0.1, 0.9]
Quando α = 0 similar ao MST, quando α = 1, similar ao BIRCH
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 24 / 31
![Page 29: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/29.jpg)
Fator de encolhimento α
n = 100000, s = 2500, c = 10, p = 1, α = [0.1, 0.9]
Quando α = 0 similar ao MST, quando α = 1, similar ao BIRCH
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 24 / 31
![Page 30: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/30.jpg)
Numero de representantes, c
n = 100000, s = 2500, α = 0.3, p = 1, c = [1, 100]
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 25 / 31
![Page 31: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/31.jpg)
Numero de representantes, c
n = 100000, s = 2500, α = 0.3, p = 1, c = [1, 100]
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 25 / 31
![Page 32: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/32.jpg)
Tamanho da amostra de representantes, s
n = 100000, α = 0.3, c = 10, p = 1, s = [500, 5000]
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 26 / 31
![Page 33: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/33.jpg)
Numero de particoes, p
n = 100000, α = 0.3, c = 10, s = 2500, p = [1, 100],
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 27 / 31
![Page 34: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/34.jpg)
Performance comparado ao BIRCH
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 28 / 31
![Page 35: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/35.jpg)
Performance
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 29 / 31
![Page 36: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/36.jpg)
Obrigado pela atencao!
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 30 / 31
![Page 37: Clustering Using REpresentatives: An E cient Clustering …wiki.icmc.usp.br/images/1/1b/CURE_BilzaAraujo.pdf · 2018-09-25 · Clustering Using REpresentatives: An E cient Clustering](https://reader033.fdocuments.co/reader033/viewer/2022060409/5f1009d97e708231d4472371/html5/thumbnails/37.jpg)
Bibliografia I
Guha, S., Rastogi, R., & Shim, K. (1998).Cure: an efficient clustering algorithm for large databases.In Proceedings of the 1998 ACM SIGMOD international conference onManagement of data, SIGMOD ’98 (pp. 73–84). New York, NY, USA:ACM.
Guha, S., Rastogi, R., & Shim, K. (2001).Cure: an efficient clustering algorithm for large databases.Information Systems, 26, 35–58.
Theodoridis, S. & Koutroumbas, K. (2009).Pattern Recognition, (pp. 683–685).Academic Press, 4th edition.
Bilza Araujo (SCC5895 - AAD) Clustering Using REpresentatives: CURE 31 / 31