Sistemas de Recomendación: Filtrado Colaborativomiguel/MLG/adjuntos/VanessaRecommender.pdf ·...

Sistemas de Recomendación:

Filtrado Colaborativo Vanesssa Gómez Verdejo

Leganés, 7 de noviembre de 2012

El contenido de la charla… �  Sistemas de recomendación

�  http://www.recommenderbook.net/

�  Tutorial: International Joint Conference on Artificial Intelligence 2011

"Recommender Systems: An Introduction”

"Recommender Systems Handbook"

Interés �  Sitios web: Amazon.com, YouTube, Netflix, Yahoo, Tripadvisor, Last.fm, and

IMDb.

�  Conferencias, workshops, revistas: �  Desde 2007: ACM Recommender Systems

�  “Special sessions” o tutoriales: ICML 2011, IJCAI 2011, KDD 2010, …

�  Special issues: AI Communications (2008); IEEE Intelligent Systems (2007); International Journal of Electronic Commerce (2006); International Journal of Computer Science and Applications (2006); ACM Transactions on Computer-Human Interaction (2005); and ACM Transactions on Information Systems (2004).

�  Libros

�  Toolboxes: �  Scout Portal Toolkit (SPT)

�  Recommender CRM Personalization Engine

�  Duine Toolkit

�  CoFE (the COllaborative Filtering Engine)

�  PREA: Personalized Recommendation Algorithms Toolkit

Sistemas de recomendación �  El SR se puede ver como una función

�  Entrada: �  Información de los usuarios (ej. puntuaciones (ratings), preferencias, variables

demográficas, …) �  Objetos (items) (con o sin descriptores de sus características)

�  Salida �  Relevancia de un objeto => ranking de objetos

§  Recomendar objetos desconocidos que a los usuarios les pueda gustar

Tipos de sistemas de recomendación

Personalized recommenda/ons

Collabora/ve: "Tell me what's popular among my peers"


Content-‐based: "Show me more of the same what I've liked"


Recomendaciones basadas en contenido

�  Explotan la información de los objetos �  Recomendar comedias a quien anteriormente le han gustado

comedias

�  La mayoría se basan en métodos de recuperación de la información

�  3 BLOQUES: �  Extracción de información caracterizando los objetos

�  Los descriptores de objetos se extraen automáticamente (keywords, term frequency, inverse term frequency, …)

�  Perfil o preferencias de usuario �  Definido explícitamente

�  Se aprende el perfil de usuario (ML: K-NN, Naïve Bayes, SVMs …)

�  Recomendación: encontrar o hacer un ranking de los objetos “similares” a las preferencias de usuario

Ventajas e inconvenientes �  VENTAJAS:

�  La recomendación sólo depende de los gustos del usuario activo �  Facilidad para explicar las recomendaciones �  Nuevos objetos: pueden recomendar objetos que no han sido

puntuados por los usuarios

�  INCONVENIENTES: �  Limitados por el análisis de contenido (características representado a

los objetos) �  Sobre especialización:

�  el sistema va a recomendar objetos parecidos a los que le gustan al usuario

�  No tiene capacidad de “sorprender” -> sobreespecialización �  Nuevos usuarios: si un usuario sólo ha puntuado pocos objetos, el

sistema no podrá dar buenas recomendaciones

Knowledge-‐based: "Tell me what fits based on my needs"


Recomendaciones basadas en conocimiento

�  El usuario proporciona de manera explicita parámetros del producto que quiere

�  Recomendaciones con restricciones: �  Encontrar los objetos que cumplan todas las restricciones �  Encontrar un conjunto de objetos que cumplan el máximo conjunto de

restricciones (ponderadas) �  Interacción con el usuario, sobre que restricciones pueden relajarse �  SALIDA: una lista de objetos ordenada según el número de restricciones que

cumplen

�  “Case-based recommender systems”: �  Se definen similitudes entre las características del objeto y las restricciones

del usuario

�  wr pondera la importancia de cada requisito

Ventajas e inconvenientes �  VENTAJAS:

�  Facilidad para explicar las recomendaciones

�  No necesita que los usuarios hayan puntuado los objetos: �  No hay “cold start problem”

�  No hay sobre especialización

�  INCONVENIENTES:

�  La obtención del conocimiento no es sencilla: �  Para recomendaciones muy precisas es necesario interactuar

(varios ciclos) con el usuario

Hybrid: combina/ons of various inputs and/or composi/on of different mechanism


Evaluación �  Se emplea el historial para evaluar las predicciones sobre las

puntuaciones

�  Medidas de error

�  Mean Absolute Error (MAE)

�  Root Mean Square Error (RMSE)

Evaluación en recuperación de la información

�  La tarea de recomendación suele verse como una tarea de recuperación de la información: �  Recuperar (recomendar) todos los objetos que han sido

considerados por el sistema de recomendación como “buenos”

�  Sólo sabemos si un objeto es relevante o no (no su puntuación)

Reality

Actually Good Actually Bad

Pred

ic/o

n Rated Good

True Posi/ve (tp) False Posi/ve (fp)

Rated Bad

False Nega/ve (fn) True Nega/ve (tn)

All good items

All recommend items

Medidas �  Precisión: fracción de objetos relevantes entre

todos los objetos recomendados

�  Recall: fracción de objetos relevantes entre todos los objetos buenos

�  F1: combina la precisón y el recall

�  La posición en la que es recomendado un objeto importa.

Rank Score

Actually good

Item 237

Item 899

Recommended (predicted as good)

Item 345

Item 237

Item 187

hit

Ri =0

21�12�1

+1

22�12�1

+0

23�12�1

= 0.5

Associative Retrieval Techniques for the Sparsity Problem • 131

that each successive item in a list is less likely to be viewed by the user withan exponential decay. The final recommendation utility score over all the testcustomers is:

R = 100!

i Ri!

i Rmaxi

, (14)

where Rmaxi is the maximum achievable utility if all future purchases of user i

had been at the top of the ranked list. In our experiments, we set the number ofrecommendations for all collaborative filtering approaches studied to 50. Thus,the recommendation list contained exactly 50 books.

To measure the degree of sparsity of the consumer–product interaction ma-trix, we used the following graph density definition in (15).

Graph density = Number of actual links present in the graphNumber of possible links in the graph

. (15)

In our experimental study, we experimented with the following 4 approachesthat represent the extant collaborative filtering approaches that do not exploretransitive associations.

—3-Hop. The 3-hop algorithm is a simple graph-based collaborative filteringalgorithm that makes recommendations based on paths with length 3 asillustrated in Section 3.2.

—User-Based (Correlation). This approach calculates the Person correlationcoefficients between the users and then recommends items based on the pur-chases of customers that are highly correlated with the target customer.2

—User-Based (Vector Similarity). This approach calculates user similaritiesusing the vector similarity function and then recommends items based onthe purchases of customers that are similar to the target customer.2

—Item-Based. This approach calculates item similarities instead of user simi-larities based on the transactional data and then recommends items that aresimilar to the target customer’s previous purchases. In our study, we appliedthe vector similarity function to calculate the item similarities.3

The 3-hop approach is the simplest of the graph-based approaches and func-tions as the comparison baseline. We decided to compare spreading-activation-based approaches with the User-based (Correlation) and User-based (VectorSimilarity) approaches because in previous studies [Breese et al. 1998], theyhad been shown to deliver excellent performance for general recommendationtasks. The item-based approach [Sarwar et al. 2001] was chosen as represen-tative of approaches specifically designed to deal with the sparsity problem.This approach has been shown to perform better than other methods in certainapplications [Karypis 2001; Sarwar et al. 2001].

We experimented with three different spreading activation algorithms in-cluding the LCM, BNB and Hopfield algorithms introduced in Section 4. Whencomparing with other collaborative filtering algorithms, we chose the Hopfield

2Specific algorithm implementation followed that in [Breese et al. 1998].3Specific algorithm implementation followed that in [Sarwar et al. 2001].

ACM Transactions on Information Systems, Vol. 22, No. 1, January 2004.

Filtrado colaborativo

Filtrado colaborativo �  IDEA

�  Usuarios con gustos similares en el pasado, tendrán gustos similares en el futuro

�  Usemos la sabiduría de la población para recomendar objetos

�  ENTRADA �  Usuarios dan puntuaciones a un catalogo de objetos

(implícita o explícitamente) �  Matriz de puntuaciones usuario-objeto

�  SALIDA �  Predicción (numérica) indicando como a un usuario le

gusta o disgusta un objeto �  Una lista con N objetos recomendados

Métodos de vecindario basados en usuarios

�  IDEA: �  Dado el ”usuario activo" (Alice) y un objeto que todavía no ha

visto �  Encontrar el conjunto de usuarios que más se parecen a Alice (les

gustan objetos similares) y han puntuado el objeto

�  Usar, por ejemplo, el promedio de sus puntuaciones para predecir si a Alice le gustará el objeto

�  Aplicar este proceso sobre todos los objetos que Alice no ha puntuado y recomendar los que tienen mayor puntuación

Item1 Item2 Item3 Item4 Item5

Alice 5 3 4 4 ? User1 3 1 2 3 3

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1

Similitud entre usuarios �  Coeficiente de correlación

a, b : usuarios ra,p : puntuación del usuario a al objeto p P : conjunto de objetos puntuados por a y b


Alice 5 3 4 4 ? User1 3 1 2 3 3

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1

sim(A,1) = 0,85

sim(A,2) = 0,70

sim(A,4) = -‐0,79

sim(A,3) = 0

Calculando predicciones

�  N es el número de vecinos: ¿cómo seleccionarlo?

�  Ponderar las puntuaciones de los diferentes usuarios => usar la similitud como peso

�  Incluir la media del usuario activo y de los usuarios vecinos �  Vecinos: usuarios 1 y 2 (sim >0)

pred(Alice, item5) = 4 +0, 85(3� 2, 4) + 0, 7(5� 3, 8)

0, 85 + 0, 7= 4, 87

Aproximaciones basadas en usuario vs. basadas en modelo

�  CF basado en usuario es un método “basado en memoria” �  Es necesario usar toda la matriz de puntuaciones para encontrar los

vecinos y luego realizar la predicción

�  En aplicaciones reales: más usuarios (decenas de millones) que objetos (millones) => Métodos poco escalables

�  m = # usuarios, n = # objetos

�  Space complexity O(m2)

�  Time complexity O(m2n) (distancia)

�  Matrices de puntuación muy dispersas => pocas puntuaciones comunes entre usuarios

�  Aproximaciones “basadas en modelo” �  Se aprende el modelo (pre-procesado or "model-learning”)

�  Sólo hay que hacer la predicción “on-line”

�  Los modelos se actualizan periódicamente

�  La construcción del modelo puede ser bastante costosa (computacionalmente)

CF basado en objetos (“item-based”)

�  IDEA: �  Usar las semejanzas entre objetos (y no entre usuarios) para hacer

las predicciones

�  EJEMPLO: �  Busquemos los objetos parecidos al Item5 �  Usar las puntuaciones de Alicia para este objeto para hacer las

predicciones


Alice 5 3 4 4 ? User1 3 1 2 3 3

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1

�  Medida de similitud: distancia del coseno �  Considerar las puntuaciones promedio de los usuarios -> se centran las

puntuaciones

�  U: conjunto de usuarios que han puntuado los objetos a y b

�  Predicción

pred(Alice, item5) = r̄Alice +

Pi2N sim(itemi, item5)rAlice,itemiP

i2N sim(itemi, item5)

CF basado en objetos (“item-based”)

Preprocesado �  Es necesario aprender el modelo “off-line” para resolver el problema de escalabilidad

�  Aproximación propuesta por Amazon.com (en 2003) (29 mill. usuarios y millones de objetos)

�  Calcular “offline” todos las similitudes entre pares de objetos

�  Calcular la predicción en tiempo real

�  El vecindario (N) suele ser bastante pequeño (el usuario puntúa pocos objetos)

�  Este preprocesado funciona en item-based CF (y no en user-based CF) porque las similitudes entre objetos suelen ser más estables que entre usuarios

�  Requerimientos de memoria (n objetos) => almacenar n2 similitudes

�  En la práctica, bastante menor: la matriz es dispersa (items with no co-ratings)

�  Reducciones �  Fijar un umbral mínimo de co-ratings (se pueden eliminar los objetos que tienen pocas

puntuaciones comunes (puntuados por al menos n’ usuarios)

�  Limitar el tamaño del vecindario (N) (puede afectar a las prestaciones)

pred(Alice, item5) = r̄Alice +

Pi2N sim(itemi, item5)rAlice,itemiP

i2N sim(itemi, item5)

Problemas de dispersión �  Cold start problem

�  ¿Cómo podemos recomendar nuevos objetos? �  ¿Qué le recomendamos a los nuevos usuarios? �  Soluciones inmediatas

�  Pedir/forzar a los usuarios a puntuar un conjunto de objetos �  Emplear otro método para la estos casos (basado en

contenido, información demográfica o recomendaciones no personalizadas)

�  En aproximaciones basadas en vecindario: �  El conjunto de usuarios u objetos similares puede ser muy

pequeño => malas predicciones �  Alternativas

�  CF recursivo �  Emplear transitivad entre vecinos

Algoritmos para bases de datos dispersas

�  CF recursivo

�  Hay un vecino muy próximo a Alice (User 1), pero todavía no ha puntuado el Item 5.

�  Idea: �  Aplicar algún método de CF para predecir la puntuación que el

User 1 daría al Item 5.

�  Usar el valor obtenido en vez del de un vecino menos parecido.


Alice 5 3 4 4 ? User1 3 1 2 3 ?

User2 4 3 4 3 5

User3 3 3 1 5 4

User4 1 5 5 2 1

sim = 0,85

Predecir la puntuación del User 1

�  Transitividad entre vecinos

�  Modelos basados en grafos �  Idea: Usar caminos de longitud

prefijada para realizar recomendaciones

�  Longitud 3: recomendar el objeto 3 al usuario 1

�  Longitud 5: también le podemos recomendar el objeto 1

Algoritmos para bases de datos dispersas

SVD

�  Las interacciones entre usuarios y objetos se modelan como productos internos en ese espacio.

�  Este nuevo espacio proporciona una nueva representación de los datos que puede ayudar a interpretar las prediciones: �  En recomendación de películas los factores como: comedia vs.

drama, cantidad de acción, …. �  También pueden ser dimensiones completamente ininterpretables

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1

Bob Mary

Alice

Sue

�  Modelos de factorización de matrices mapean los usuarios y objetos a un espac io de fac to res latentes de dimensión f.

)()(ˆ EPLVAliceUrr Tkkkuui ×Σ×+=

Tkkkk VUM ×Σ×=

Matrix factorization

VkT

Dim1 -‐0.44 -‐0.57 0.06 0.38 0.57

Dim2 0.58 -‐0.66 0.26 0.18 -‐0.36

Uk Dim1 Dim2

Alice 0.47 -‐0.30

Bob -‐0.44 0.23

Mary 0.70 -‐0.06

Sue 0.31 0.93 Dim1 Dim2

Dim1 5.63 0

Dim2 0 3.23

Tkkkk VUM ×Σ×=

kΣ

•  SVD:

•  Predic/on: = 3 + 0.84 = 3.84

)()(ˆ EPLVAliceUrr Tkkkuui ×Σ×+=

Otros métodos de FC �  Basados en reglas

�  Métodos probabilísticos


Alice 1 0 0 0 ?

User1 1 0 1 0 1

User2 1 0 1 0 1

User3 0 0 0 1 1

User4 0 1 1 0 0

REGLA: Item1 => Item5 •  support (2/4) •  confidence (2/2)


Alice 1 3 3 2 ?

User1 2 4 2 2 4

User2 1 3 3 5 1

User3 4 5 2 3 3

User4 1 1 5 2 1

X = (Item1 =1, Item2=3, Item3= … )

Advances in Collaborative Filtering

"Recommender Systems Handbook"

Yehuda Koren & Robert Bell

Objetivos Obtener un modelo: �  Que no dependa de la medida de similitud seleccionada

�  Incluya conocimiento implícito (diferentes fuentes de conocimiento)

�  Información explícita:

�  Los usuarios directamente indican su interés por los productos.

�  Start ratings

�  Información implícita:

�  Se observa el comportamiento de los usuarios para obtener información.

�  Historial de compras, historial del navegador, patrones de búsqueda, movimientos del ratón.

�  ¿Qué objetos han sido puntuados?

�  Modelo de vecindario factorizable (reducción del coste computacional)

�  Modelado temporal

�  Robusto al sobreajuste

Nuevo modelo de vecindario �  El modelo incluye pesos a optimizar

�  La estimación “baseline” de la puntuación rui viene dada por el modelo:

�  μ puntuación media global �  bu desviaciones debidas al usuario u (respecto a la media μ) �  bi desviaciones sobre el objeto i (respecto a la media μ)

�  Queremos estimar la puntuación de Joe a la película Titanic: �  μ= 3.5 (valor medio de todas las puntuaciones) �  bu= - 0.3 (Joe es crítico y puntúa por debajo de la media) �  bi= 0.5 (Titanic está puntuada por encima de la media) �  bui=3.5-0.3+0.5=3.7

Nuevo modelo de vecindario �  SVD++ (Koren,2008): permite incluir información implícita en el

modelo

�  Para cada objeto se introduce un nuevo factor cij que modela la

información implícita

�  Si hay información implícita del objeto j, cij permite refinar la estimación sobre i.

�  Es de esperar que el coeficiente cij sea mayor cuanto más relacionado estén los objetos j e i.

�  |R(u)|-1/2 permite normalizar las sumas con respecto al número de elementos puntuados por cada usuario

�  Reducir la complejidad considerando sólo los objetos similares a i

Nuevo modelo de vecindario �  Estimación de parámetros

�  Algoritmo de descenso por gradiente

�  λy γse seleccionan por CV

�  Hay que seleccionar el parámetro k (tamaño del vecindario)

Netflix

§  En 2006 comenzó la compeDción de NeGlix –  Premio de $1,000,000 por obtener una mejora del 10%

respecto a sistema de Cinematch (RMSE=0.9514 )

–  Las puntuaciones son valores enteros entre 1 y 5 –  Aproximadamente 100,000 puntuaciones en 17,770 películas puntuadas

por 480,000 users. –  Conjunto de test: 1.4 millones de puntuaciones (recientes)

§  Resultados evaluados con RMSE

§  En 2009 el equipo de “BellKor’s PragmaDc Chaos” obtuvó RMSE=0.8567

K

rrRMSE Kiu

uiui∑∈

−

= ),(

2)ˆ(

Evaluación en Netflix

Advances in Collaborative Filtering 19

Formula (16) follows from approximating the variance of a correlation by σ2i j =1/(ni j!1), the value for ρi j near 0.Notice that the literature suggests additional alternatives for a similarity measure

[27, 28].

4.2 Similarity-based interpolation

Here we describe the most popular approach to neighborhood modeling, and appar-ently also to CF in general. Our goal is to predict rui – the unobserved rating by useru for item i. Using the similarity measure, we identify the k items rated by u thatare most similar to i. This set of k neighbors is denoted by Sk(i;u). The predictedvalue of rui is taken as a weighted average of the ratings of neighboring items, whileadjusting for user and item effects through the baseline predictors

r̂ui = bui+∑ j"Sk(i;u) si j(ru j!bu j)

∑ j"Sk(i;u) si j. (17)

Note the dual use of the similarities for both identification of nearest neighbors andas the interpolation weights in equation (17).Sometimes, instead of relying directly on the similarity weights as interpolation

coefficients, one can achieve better results by transforming these weights. For exam-ple, we have found at several datasets that squaring the correlation-based similarities

is helpful. This leads to a rule like: r̂ui = bui+∑ j"Sk(i;u) s

2i j(ru j!bu j)

∑ j"Sk(i;u) s2i j

. Toscher et al. [31]

discuss more sophisticated transformations of these weights.Similarity-based methods became very popular because they are intuitive and

relatively simple to implement. They also offer the following two useful properties:

1. Explainability. The importance of explaining automated recommendations iswidely recognized [13, 30]; see also Chapter ??. Users expect a system to givea reason for its predictions, rather than present “black box” recommendations.Explanations not only enrich the user experience, but also encourage users tointeract with the system, fix wrong impressions and improve long-term accu-racy. The neighborhood framework allows identifying which of the past useractions are most influential on the computed prediction.

2. New ratings. Item-item neighborhood models can provide updated recommen-dations immediately after users enter new ratings. This includes handling newusers as soon as they provide feedback to the system, without needing to re-train the model and estimate new parameters. This assumes that relationshipsbetween items (the si j values) are stable and barely change on a daily basis.Notice that for items new to the system we do have to learn new parameters. In-terestingly, this asymmetry between users and items meshes well with commonpractices: systems need to provide immediate recommendations to new users(or new ratings by old users) who expect quality service. On the other hand, it

Evaluación en Netflix �  Tiempo de entrenamiento:

�  Space complexity:

�  Reducir la complejidad, reduce las prestaciones

Modelos basado en la factorización de matrices

�  Dificultad para aplicar descomposiciones SVD a la matriz de puntuaciones usuarios-objetos debido al elevado número de puntuaciones incompletas.

�  Calcular las proyecciones latentes con muy pocas entradas en la matriz es propenso a proporcionar modelos sobreajustados

�  Los primeros trabajos se basan en la imputación de valores, completando la matriz de puntuaciones: �  Muy costoso computacionalmente:

�  El calculo de los valores

�  Manejar la matriz

�  Si el método de imputación no es preciso, la matriz de puntuaciones puede verse distorsionada

Factorización del modelo �  Factorizamos los coeficientes que relacionan los objetos

�  El término

sólo depende del usuario u (no el objeto i) => facilita ajustar los parámetros del modelo.

wij = qTi xj

Factorización del modelo �  Tiempo de entrenamiento:

�  Space complexity:

�  Resultados en Netflix

Extensión a user-user �  Ahora se usan pesos midiendo las relaciones entre usuarios

�  R(i) es el conjunto de usuarios que puntuó el item

�  UTILIDAD: aplicaciones en las que los objetos varían rápidamente (e.j. noticias)

�  INCONVENIENTE: complejidad (m>>n)

�  Tiempo de entrenamiento

�  Space complexity

�  SOLUCIÓN: factorización

�  Tiempo de entrenamiento

�  Space complexity

�  No tiene en cuenta el conocimiento implícito

O(n + mf)

Evaluación en Netflix �  Modelo item-item

�  Modelo user-user

�  El modelo user-user es más rápido

�  Peores préstaciones, pero mejora al modelo sin conocimiento implicito (>0.91)

Fusión de modelos �  Fusiones clásicas, entrenan los modelos y luego fusionan

�  Optimización conjunta de los modelos:

�  RESULTADOS: �  100 factores: RMSE= 0.8966

�  200 factores: RMSE= 0.8953

�  También pueden combinarse modelos de vecindario con modelos de factores latentes

�  Modelo item-item + SVD++: RMSE 0.887

�  Ahora el modelo de partida sería:

�  El sesgo de los objetos se modela con una parte estacionaria y otra dependiente del tiempo (modelada por bins)

�  El sesgo de los usuarios suele modelarse con

�  En ocasiones ocurren picos de comportamiento anómalos que se modelan incluyendo un parámetro bu,t, para que sea capaz de absorber esa anomalía.


We start with our choice of time-changing item biases bi(t). We found it adequateto split the item biases into time-based bins, using a constant item bias for each timeperiod. The decision of how to split the timeline into bins should balance the desireto achieve finer resolution (hence, smaller bins) with the need for enough ratings perbin (hence, larger bins). For the movie rating data, there is a wide variety of bin sizesthat yield about the same accuracy. In our implementation, each bin corresponds toroughly ten consecutive weeks of data, leading to 30 bins spanning all days in thedataset. A day t is associated with an integer Bin(t) (a number between 1 and 30 inour data), such that the movie bias is split into a stationary part and a time changingpart

bi(t) = bi+bi,Bin(t) . (6)

While binning the parameters works well on the items, it is more of a challengeon the users side. On the one hand, we would like a finer resolution for users todetect very short lived temporal effects. On the other hand, we do not expect enoughratings per user to produce reliable estimates for isolated bins. Different functionalforms can be considered for parameterizing temporal user behavior, with varyingcomplexity and accuracy.One simple modeling choice uses a linear function to capture a possible gradual

drift of user bias. For each user u, we denote the mean date of rating by tu. Now, if urated a movie on day t, then the associated time deviation of this rating is defined as

devu(t) = sign(t! tu) · |t! tu|β .

Here |t! tu| measures the number of days between dates t and tu. We set the valueof β by cross validation; in our implementation β = 0.4. We introduce a singlenew parameter for each user called αu so that we get our first definition of a time-dependent user-bias

b(1)u (t) = bu+αu ·devu(t) . (7)

This simple linear model for approximating a drifting behavior requires learningtwo parameters per user: bu and αu.A more flexible parameterization is offered by splines. Let u be a user associated

with nu ratings. We designate ku time points – {tu1 , . . . , tuku} – spaced uniformly acrossthe dates of u’s ratings as kernels that control the following function:

b(2)u (t) = bu+∑kul=1 e

!σ |t!tul |butl∑kul=1 e

!σ |t!tul |(8)

The parameters butl are associated with the control points (or, kernels), and are auto-matically learned from the data. This way the user bias is formed as a time-weightedcombination of those parameters. The number of control points, ku, balances flexi-bility and computational efficiency. In our application we set ku=n0.25u , letting it growwith the number of available ratings. The constant σ determines the smoothness ofthe spline; we set σ=0.3 by cross validation.








b(1)u (t) = bu+αu ·devu(t) . (7)



b(2)u (t) = bu+∑kul=1 e


!σ |t!tul |(8)









b(1)u (t) = bu+αu ·devu(t) . (7)



b(2)u (t) = bu+∑kul=1 e


!σ |t!tul |(8)









b(1)u (t) = bu+αu ·devu(t) . (7)



b(2)u (t) = bu+∑kul=1 e


!σ |t!tul |(8)


Modelado temporal

�  Luego el modelo final es:

�  Y se aprende minimizando

�  Resultados en Netflix

�  RMSE=0.8885

Modelado temporal

36 Yehuda Koren and Robert Bell

with a latent factor model (SVD++), thereby achieving improved prediction accu-racy with RMSE below 0.887. Therefore, other possibilities with potentially betteraccuracy should be explored before considering the integration of item-item anduser-user models.

5.3 Temporal dynamics at neighborhood models

One of the advantages of the item-itemmodel based on global optimization (Subsec-tion 5.1), is that it enables us to capture temporal dynamics in a principled manner.As we commented earlier, user preferences are drifting over time, and hence it isimportant to introduce temporal aspects into CF models.When adapting rule (34) to address temporal dynamics, two components should

be considered separately. First component, µ + bi + bu, corresponds to the base-line predictor portion. Typically, this component explains most variability in theobserved signal. Second component, |R(u)|! 12 ∑ j"R(u)(ru j! bu j)wi j+ ci j, capturesthe more informative signal, which deals with user-item interaction. As for thebaseline part, nothing changes from the factor model, and we replace it withµ+bi(tui)+bu(tui), according to (6) and (9). However, capturing temporal dynam-ics within the interaction part requires a different strategy.Item-item weights (wi j and ci j) reflect inherent item characteristics and are not

expected to drift over time. The learning process should capture unbiased long termvalues, without being too affected from drifting aspects. Indeed, the time changingnature of the data can mask much of the longer term item-item relationships if nottreated adequately. For instance, a user rating both items i and j high within a shorttime period, is a good indicator for relating them, thereby pushing higher the valueof wi j. On the other hand, if those two ratings are given five years apart, whilethe user’s taste (if not her identity) could considerably change, this provides lessevidence of any relation between the items. On top of this, we would argue that thoseconsiderations are pretty much user-dependent; some users are more consistent thanothers and allow relating their longer term actions.Our goal here is to distill accurate values for the item-item weights, despite the

interfering temporal effects. First we need to parameterize the decaying relationsbetween two items rated by user u. We adopt exponential decay formed by thefunction e!βu·Δ t , where βu > 0 controls the user specific decay rate and should belearned from the data. We also experimented with other decay forms, like the morecomputationally-friendly (1+βuΔ t)!1, which resulted in about the same accuracy,with an improved running time.This leads to the prediction rule

r̂ui = µ+bi(tui)+bu(tui)+ |R(u)|!12 ∑j"R(u)

e!βu·|tui!tu j |((ru j!bu j)wi j+ ci j) . (44)

The involved parameters, bi(tui) = bi + bi,Bin(tui), bu(tui) = bu + αu · devu(tui) +bu,tui , βu, wi j and ci j, are learned by minimizing the associated regularized squared


error

∑(u,i)!K

!

rui"µ"bi"bi,Bin(tui)"bu"αudevu(tui)"bu,tui"

|R(u)|"12 ∑j!R(u)

e"βu·|tui"tu j |((ru j"bu j)wi j+ ci j)"2

+

λ12

!

b2i +b2i,Bin(tui) +b2u+α2u +b2u,t +w2i j+ c2i j"

. (45)

Minimization is performed by stochastic gradient descent. We run the process for25 iterations, with λ12 = 0.002, and step size (learning rate) of 0.005. An exceptionis the update of the exponent βu, where we are using a much smaller step size of10"7. Training time complexity is the same as the original algorithm, which is:O(∑u |R(u)|2). One can tradeoff complexity with accuracy by sparsifying the set ofitem-item relations as explained in Subsection 5.1.As in the factor case, properly considering temporal dynamics improves the ac-

curacy of the neighborhood model within the movie ratings dataset. The RMSEdecreases from 0.9002 [17] to 0.8885. To our best knowledge, this is significantlybetter than previously known results by neighborhood methods. To put this insome perspective, this result is even better than those reported by using hybridapproaches such as applying a neighborhood approach on residuals of other algo-rithms [2, 23, 31]. A lesson is that addressing temporal dynamics in the data canhave a more significant impact on accuracy than designing more complex learningalgorithms.We would like to highlight an interesting point. Let u be a user whose preferences

are quickly drifting (βu is large). Hence, old ratings by u should not be very influen-tial on his status at the current time t. One could be tempted to decay the weight ofu’s older ratings, leading to “instance weighting” through a cost function like

∑(u,i)!K

e"βu·|t"tui|!

rui"µ"bi"bi,Bin(tui)"bu"αudevu(tui)"

bu,tui " |R(u)|"12 ∑j!R(u)

((ru j"bu j)wi j+ ci j)"2

+λ12(· · ·) .

Such a function is focused at the current state of the user (at time t), while de-emphasizing past actions. We would argue against this choice, and opt for equallyweighting the prediction error at all past ratings as in (45), thereby modeling allpast user behavior. Therefore, equal-weighting allows us to exploit the signal ateach of the past ratings, a signal that is extracted as item-item weights. Learningthose weights would equally benefit from all ratings by a user. In other words, wecan deduce that two items are related if users rated them similarly within a shorttime frame, even if this happened long ago.

Sistemas de Recomendación: Filtrado Colaborativomiguel/MLG/adjuntos/VanessaRecommender.pdf ·...

Documents

Transcript of Sistemas de Recomendación: Filtrado Colaborativomiguel/MLG/adjuntos/VanessaRecommender.pdf ·...