Funciones de densidad

Copyright © Andrew W. Moore Slide 1

Probability Densities in Data

MiningAndrew W. Moore

ProfessorSchool of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599


Contenido• Porque son importantes.• Notacion y fundamentos de PDF

continuas.• PDFs multivariadas continuas.• Combinando variables aleatorias

discretas y continuas.


Porque son importantes?• Real Numbers occur in at least 50% of

database records• Can’t always quantize them• So need to understand how to describe

where they come from• A great way of saying what’s a

reasonable range of values• A great way of saying how multiple

attributes should reasonably co-occur


Porque son importantes?• Can immediately get us Bayes

Classifiers that are sensible with real-valued data

• You’ll need to intimately understand PDFs in order to do kernel methods, clustering with Mixture Models, analysis of variance, time series and many other things

• Will introduce us to linear and non-linear regression


A PDF of American Ages in 2000


Poblacion de PR por grupo de edad

group freq midpoint freq.rela0-4 284593 2.5 0.07375165-9 301424 7.5 0.078113310-14 305025 12.5 0.079046515-19 305577 17.5 0.079189520-24 299362 22.5 0.077578925-29 277415 27.5 0.071891430-34 262959 32.5 0.068145235-39 265154 37.5 0.068714040-44 258211 42.5 0.066914745-49 239965 47.5 0.062186350-54 233597 52.5 0.060536155-59 206552 57.5 0.053527460-64 169796 62.5 0.044002265-69 141869 67.5 0.036765070-74 112416 72.5 0.029132375-79 85137 77.5 0.022063080-84 57953 82.5 0.015018485+ 51801 87.5 0.0134241


0 20 40 60 80

0.02

0.04

0.06

0.08

pobpr$midpoint

pobp

r$fre

q.re

lapdf de la edad poblacional en PR en 2000


A PDF of American Ages in 2000

Let X be a continuous random variable.If p(x) is a Probability Density Function for X then…

b

ax

dxxpbXaP )(

50

30age

age)age(50Age30 dpP

= 0.36


Properties of PDFs

That means…

h

hxXhxPxp

22)( lim0h

b

ax

dxxpbXaP )(

)(xpxXPx


)()()]2/(2/[)()22

(2/

2/whpwphxhxdttphxXhxP

hx

hx

Donde x-h/2<w<x+h/2). Luego,

)()2/2/( wph

hxXhxP

Asi p(w) tiende a p(x) cuando h tiende a cero


x

dttpxXP )()(

h

dttp

hxXPhxXP

xdxXdP

hx

x

h

)()()(lim)(

)(0

hxwxwphwhp

h

),(lim)(lim

0

Notar que p(w) tiende a p(x) cuando h tiende a 0

Funcion de distribucion acumulativa. Esta es una funcion No decreciente

Se ha mostrado que la derivada de la funcion de distribucion da la funcion de densidad


Properties of PDFs

b

ax

dxxpbXaP )(

)(xpxXPx

Therefore…

Therefore…

1)(

x

dxxp

0)(: xpxLa dcerivada de una fucnion no dcecreciente es mayor o igual que cero.


• Cual es el significado de p(x)?Si

p(5.31) = 0.06 and p(5.92) = 0.03

Entonces cuando un valor de X es muestreado de la distribucion, es dos veces mas probable que X este mas cerca a 5.31 que a 5.92.


Yet another way to view a PDF

A recipe for sampling a random age.

1. Generate a random dot from the rectangle surrounding the PDF curve. Call the dot (age,d)

2. If d < p(age) stop and return age

3. Else try again: go to Step 1.


Test your understanding• True or False:

1)(: xpx

0)(: xXPx


ExpectationsE[X] = the expected value of random variable X= the average value we’d see if we took a very large number of random samples of X

x

dxxpx )(


ExpectationsE[X] = the expected value of random variable X= the average value we’d see if we took a very large number of random samples of X

x

dxxpx )(

= the first moment of the shape formed by the axes and the blue curve= the best value to choose if you must guess an unknown person’s age and you’ll be fined the square of your error

E[age]=35.897


Expectation of a function=E[f(X)] = the expected value of f(x) where x is drawn from X’s distribution. = the average value we’d see if we took a very large number of random samples of f(X)

x

dxxpxf )()(

Note that in general:])[()]([ XEfxfE

64.1786]age[ 2 E

62.1288])age[( 2 E


Variance2 = Var[X] = the expected squared difference between x and E[X]

x

dxxpx )()( 22

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally

02.498]age[Var


Standard Deviation2 = Var[X] = the expected squared difference between x and E[X]

x

dxxpx )()( 22

= amount you’d expect to lose if you must guess an unknown person’s age and you’ll be fined the square of your error, and assuming you play optimally = Standard Deviation = “typical” deviation of X from its mean

02.498]age[Var

][Var X

32.22

222 )]([)( XEXE


Estadisticas para PR• E(edad)=35.17• Var(edad)=501.16• Desv.Est(edad)=22.38


Funciones de densidad mas conocidas


Funciones de densidad mas conocidas

• La densidad uniforme o rectangular• La densidad triangular• La densidad exponencial• La densidad Gamma y la Chi-square• La densidad Beta• La densidad Normal o Gaussiana• Las densidades t y F.


La distribucion rectangular

-w/2 0 w/2

1/w

0][ XE12

]Var[2wX

2w|x|if02w|x|if1

)( wxp


La distribucion triangular

0

w|x|

w|x|w

xwxp

if0

if||)( 2

6]Var[

2wX

0][ XE

w

1w

w


The Exponential distribution

otherwise

xexpx

if0

0if1)(

/

2]Var[ X

][XE

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

Densidad exponencial,=.1

x

0.1

* exp

(-0.1

* x)


La distribucion Normal

Estandar

2exp

21)(

2xxp

1]Var[ X

0][ XE


La distribucion Normal General

2

2

2)(exp

21)(

xxp

2]Var[ X

μXE ][

=100

=15


General Gaussian

2

2

2)(exp

21)(

xxp

2]Var[ X

μXE ][

=100

=15

Shorthand: We say X ~ N(,2) to mean “X is distributed as a Gaussian with parameters and 2”.In the above figure, X ~ N(100,152)

Also known as the normal

distribution or Bell-shaped curve


The Error FunctionAssume X ~ N(0,1)Define ERF(x) = P(X<x) = Cumulative Distribution of X

x

z

dzzpxERF )()(

x

z

dzz2

exp21 2


Using The Error FunctionAssume X ~ N(,2)P(X<x| ,2) = )( 2

xERF


The Central Limit Theorem• If (X1,X2, … Xn) are i.i.d. continuous

random variables• Then define

• As n-->infinity, p(z)--->Gaussian with mean E[Xi] and variance Var[Xi]

Somewhat of a justification for assuming Gaussian noise is common

n

iin x

nxxxfz

121

1),...,(


Estimadores de funcion de densidad

• Histograms

• K-nearest neighbors:

• Kernel density estimators

kndkf

2)(ˆ x

nhkxf )(ˆ h ancho de clase

dk es la distancia hasta el k-esimo vecino


Estimacion de funcion de densidad-histograma

x

Den

sity

0 2 4 6 8

0.00

0.05

0.10

0.15

0.20

0.25

> x=c( 7.3, 6.8, 7.1, 2.5, 7.9, 6.5, 4.2, 0.5, 5.6, 5.9)> hist(x,freq=F,main="Estimacion de funcion de densidad-histograma")> rug(x,col=2)


Estimacion de densidad por knn en 20 pts con k=1,3,5,7

0 2 4 6 8

01

23

4

x

fest

0 2 4 6 8

0.1

0.3

x

fest

0 2 4 6 8

0.05

0.20

x

fest

0 2 4 6 80.

050.

150.

25

x

fest


Estimación por kernels de una función de densidad univariada.

En el caso univariado, el estimador por kernels de la función de densidad f(x) se obtiene de la siguiente manera. Consideremos que x1,…xn es una variable aleatoria X con función de densidad f(x), definamos la función de distribución empirica por

el cual es un estimador de la función de distribución acumulada F(x) de X. Considerando que la función de densidad f(x) es la derivada de la función de distribución F y usando aproximación para derivada se tiene que

nxobsxFn

#)(


donde h es un valor positivo cercano a cero. Lo anterior es equivalente a la proporción de puntos en el intervalo (x-h, x+h) dividido por 2h. La ecuación anterior puede ser escrita como:

donde la función peso K está definida por 0 si |z|>1 K(z)= 1/2 si |z| 1

hhxFhxFxf nn

2)()()(ˆ

)(1)(ˆ1

n

i

i

hxxK

nhxf


Muestra: 6, 8, 9 12, 20, 25,18, 31hhhdepuntosenproporcionf 2/)15,15()15(ˆ

32/164/28/)19,11()15(ˆ depuntosenporporcionf

]02/1002/1000)[4*8/(1)15(ˆ f


este es llamado el kernel uniforme y h es llamado el ancho de banda el cual es un parámetro de suavización que indica cuanto contribuye cada punto muestral al estimado en el punto x. En general, K y h deben satisfacer ciertas condiciones de regularidad, tales como:

K(z) debe ser acotado y absolutamente

integrable en (-,) Usualmente, pero no siempre, K(z)0 y

simétrico, luego cualquier función de densidad simétrica puede usarse como kernel.

1)( dzzK

0)(lim

nhn


Eleccion del ancho de banda h2.006.1 snh

Donde n es el numero de datos y s la desviacion estandar de la muestra.

2.013 )(79.0 nQQh


EL KERNEL GAUSSIANO

En este caso el kernel representa una función peso más suave donde todos los puntos contribuyen al estimado de f(x) en x. Es decir,

)21exp(

21)( 2zzK

n

i

hxx i

enh

xf1

2)(

21)(ˆ

2

2


EL KERNEL TRIANGULAR K(z)=1- |z| para |z|<1, 0 en otro caso.

EL KERNEL "BIWEIGHT" 15/16(1-z2)2 para |z|<1K(z)= 0 en otro caso


EL KERNEL EPANECHNIKOV para |z|< K(z)= 0 en otro caso

5)

51(

543 2z


Estimacion de densidad en 20 pts usando kernel gaussiano con h=.5,”opt1”,”opt2”, 4

0 2 4 6 8

0.00

0.15

0.30

x

fest

0 2 4 6 8

0.02

0.08

0.14

x

fest

0 2 4 6 8

0.05

0.15

x

fest

0 2 4 6 80.

040.

07

x

fest


Variables aleatorias bidimensionales

p(x,y) = probability density of random

variables (X,Y) at location (x,y)


Estimadores de funcion de densidad bi-dimensionales

• Histogramas

• K-nearest neighbors:

• Kernel density estimators

knAkxf )(ˆ

nAkxf )(ˆ A area de la clase

Ak es el area ncluyendo hasta el k-esimo vecino


Estimacion de kernel bivariado

)||||(1)(ˆ1

2

n

i

i

hK

nhf xtt

))]()'[((1)(ˆ 2/11

121

ii

n

iHK

hnhf xtxtt

Sean xi=(x,y) los valores observados y t=(t1,t2) un punto del plano donde se desea estimar la densidad conjunta

22

21

00h

hH

Si h1=h2=h


• (a1, a2)H-12

2

2

22

21 ||||

21

ha

haa

aa

2

21

/100/1h

hH

donde


Estimacion de densidad-Kernel Gaussiano bivariado

n

i

hty

htx

ehnh

f1

2)(

2)(

21 21)(ˆ

22

22

21

21

t


10

20

30

40

2000

3000

4000

50000 e+00

1 e-05

2 e-05

3 e-05

densidad conjunta estimada por metodo kernel

f1= kde2d(autompg1$V1, autompg1$V5,n=100)persp(f1$x,f1$y,f1$z)


mpg

wei

ght

10 20 30 40

1500

2500

3500

4500

grafica de contorno de la densidad estimada

contour(f1, levels=c(8e-6,2e-5, 2.8e-5),col=c(2,3,4), xlab="mpg",ylab="weight")


In 2 dimensions

Let X,Y be a pair of continuous random variables, and let R be some region of (X,Y) space…

Ryx

dydxyxpRYXP),(

),()),((


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

P( 20<mpg<30 and 2500<weight<3000) =

volumen under the 2-d surface within the red rectangle


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

P( [(mpg-25)/10]2 + [(weight-3300)/1500]2

< 1 ) =

volumen under the 2-d surface within the red oval


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

Take the special case of region R = “everywhere”.Remember that with probability 1, (X,Y) will be drawn from “somewhere”. So..

x y

dydxyxp 1),(


In 2 dimensions


Ryx

dydxyxpRYXP),(

),()),((

20h

2222lim h

hyYhyhxXhxP

),( yxp


In m dimensions

Let (X1,X2,…Xm) be an n-tuple of continuous random variables, and let R be some region of Rm …

)),...,,(( 21 RXXXP m

Rxxx

mm

m

dxdxdxxxxp),...,,(

1221

21

,,...,),...,,(...


Independence

If X and Y are independent then knowing the value of X does not help predict the

value of Y

)()(),( :yx, iff ypxpyxpYX

mpg,weight NOT independent


Independence

If X and Y are independent then knowing the value of X does not help predict the

value of Y


the contours say that acceleration and weight

are independent


Multivariate ExpectationxxxXμX dpE )(][

E[mpg,weight] =(24.5,2600)

The centroid of the cloud


Multivariate Expectation> f1= kde2d(autompg1$mpg, autompg1$weight,n=100)> dx=f1$x[2]-f1$x[1]> dy=f1$y[2]-f1$y[1]> dx[1] 0.379798> dy[1] 35.62626> meanmpg=sum(f1$x*f1$z)*dx*dy[1] 22.48855> meanweight=sum(f1$y*f1$z)*dx*dy[1] 2848.638>#estimated mean> mean(autompg1$weight)[1] 2977.584> mean(autompg1$mpg)[1] 23.44592


Multivariate ExpectationxxxX dpffE )()()]([


Test your understanding? ][][][ does ever) (if When :Question YEXEYXE

•All the time? Siempre•Only when X and Y are independent?•It can fail even if X and Y are independent?


Bivariate Expectation

dydxyxpxXE ),(][

dydxyxpyxfyxfE ),(),()],([

dydxyxpyYE ),(][

dydxyxpyxYXE ),()(][

][][][ YEXEYXE


Bivariate Covariance)])([(],Cov[ yxxy YXEYX

])[(][],Cov[ 22xxxx XEXVarXX

])[(][],Cov[ 22yyyy YEYVarYY


Bivariate Covariance)])([(],Cov[ yxxy YXEYX

])[(][],Cov[ 22xxxx XEXVarXX

])[(][],Cov[ 22yyyy YEYVarYY

then, Write

YX

X

yxy

xyxTxx ))((E 2

2

][] [

ΣμXμXXCov


Covarianza y desviacion estandar estimadas entre mpg y weight

> cov(autompg1[,c(1,5)]) mpg weightmpg 60.91814 -5517.441weight -5517.44070 721484.709> sd(autompg1$mpg)[1] 7.805007> sd(autompg1$weight)[1] 849.4026


Covariance Intuition

E[mpg,weight] =(24.5,2600)

8mpg 8mpg

700weight

700weight


Covariance Intuition

E[mpg,weight] =(24.5,2600)

8mpg 8mpg

700weight

700weight

PrincipalEigenvectorof


Regression Line

)()/( 2 xX

xyy xxXYE

Notice that the regression line pass trough (x,y)


Regression Line>l1=lm(weight~mpg,data=autompg1)> l1

Call:lm(formula = weight ~ mpg, data = autompg1)

Coefficients:(Intercept) mpg 5101.11 -90.57

>#slope of regression line>slope= -5517.44/60.918[1] -90.571


Primer Principal component> a=cov(autompg1[,c(1,5)])> eigen(a)$values[1] 721526.90386 18.72329

$vectors [,1] [,2][1,] -0.007647317 0.999970759[2,] 0.999970759 0.007647317

#slope of primer principal component

> .99997/-.00764[1] –130.8861


Covariance Fun Facts

yxy

xyxTxx ))((E 2

2

][] [

ΣμXμXXCov

•True or False: If xy = 0 then X and Y are independent. False•True or False: If X and Y are independent then xy = 0. True•True or False: If xy = x y then X and Y are deterministically related. True•True or False: If X and Y are deterministically related then xy = x y. false

How could you prove or disprove these?


Test your understanding? ][][][ does ever) (if When :Question YVarXVarYXVar

•All the time?•Only when X and Y are independent? Cierto•It can fail even if X and Y are independent?


Marginal Distributions

y

dyyxpxp ),()(


Conditional Distributions

yYXyxp

when of p.d.f.)|(

)4600weight|mpg( p

)3200weight|mpg( p

)2000weight|mpg( p


Conditional Distributions

yYXyxp

when of p.d.f.)|(

)4600weight|mpg( p

)(),()|(

ypyxpyxp

Why?


Independence Revisited

It’s easy to prove that these statements are equivalent…


)()|( :yx,

)()|( :yx,

)()(),( :yx,

ypxyp

xpyxp

ypxpyxp


More useful stuff

BayesRule

(These can all be proved from definitions on previous slides)

1)|(

x

dxyxp

)|()|,(),|(

zypzyxpzyxp

)()()|()|(

ypxpxypyxp


Mixing discrete and continuous variables

h

vAhxXhxPvAxp

22),( lim0h

1),(1

An

v x

dxvAxp

BayesRule

BayesRule)(

)()|()|(AP

xpxAPAxp

)()()|()|(

xpAPAxpxAP


0 5 10 15

0.00

0.05

0.10

0.15

0.20

0.25

x

fest

clase 1clase 2

Estimacion de funcion de dendidad conjunta mixta

P(educacion,salario>50k)


5 10 15

0.4

0.6

0.8

1.0

1:16

b[, 1

]

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

class 1

class 2

conditional density estimation de educacion por clase

Estimation of the posterior P(Class/Education)



P(EduYears,Wealthy)



P(EduYears,Wealthy)

P(Wealthy| EduYears)



Reno

rmal

ized

Axes

P(EduYears,Wealthy)

P(Wealthy| EduYears)

P(EduYears|Wealthy)


Ejercicios• Suppose X and Y are independent real-

valued random variables distributed between 0 and 1:• What is p[min(X,Y)]? • What is E[min(X,Y)]?

• Prove that E[X] is the value u that minimizes E[(X-u)2]

• What is the value u that minimizes E[|X-u|]?

Funciones de densidad

Documents

Transcript of Funciones de densidad