Sunday, February 23, 2014

Recommender systems vocabulary: What is a user preference?

Over the last couple of months, I played around building some small recommender systems based on GitHub and Twitter data. The idea of a recommender is to predict the preference (or taste) of a user  for „something“ (e.g. a GitHub repo or a Twitter user) based on some given data. This data is usually related to the user (e.g. past purchases in Amazon), or to the item (e.g. other peoples ratings of the item, or some features of the content itself such as a genre). Since I find the terms „preference“ and „taste" a bit vague, I wanted to do some research into their meaning and usage. 
Surprisingly, I could not find any definition of the terms when I surveyed the recommender system literature. 

First of all, „taste", even after looking at the Wikipedia page, remains a rather nebulous term with a lot of different meanings , where the one that seems to be closest related to recommender systems comes from sociology and describes taste as "an individual's personal and cultural patterns of choice and preference“. So it seems to indicate something similar to preference, but more long-term or broad.

Looking at the Wikipedia article for „preference“, it seems that the term is primarily used in economics, but also has a definition in the field of psychology. The two definitions are related, but do not mean exactly  the same thing. In Psychology, a preference is a (not necessary stable) judgement of an object by a person (in the sense of liking or disliking the object). In economics, the term is used to indicate that something is preferred over something else, which implies some order of liking a set of objects from the most liked to the least liked object. The ordering can be based on the relative happiness, satisfaction, utility or enjoyment that the item provides to the person. 

To sum up, both in psychology and economics preferences relate to some subjective evaluation of an object -  liking or disliking - however, in economics, the term has the additional meaning of indicating an order of liking from most to least. Related to recommender systems, predicting how much a user will like an item in isolation is more in line with the psychology definition, whereas recommending a „top n“ list of items the user will like most is more along the economics definition. 

There are two primary ways to measure preferences: Either explicitly,  by explicitly asking the user (e.g. by asking him to rate his preference on a scale or asking them to rank items by preference), or by gathering the preference based on the behavior of the person (e.g. purchase or clicking data). In psychology, there is a similar concept of "attitude", which describes a persons "expression of favor or disfavor toward a person, place, thing, or event" and is also measured both explicitly and implicitly. Studies correlating the two have shown only moderate link between a persons explicit (stated) and implicit (revealed) attitudes, with the latter being better predictors of a persons behavior (S. Stürmer: Einführung in die Sozialpsychologie I, Fernuniversität Hagen, 2011)

Friday, December 20, 2013

Link: Glossary of fundamental machine learning terms

I just stumbled on something useful I have been compiling for myself and meaning to publish for a while: A glossary of fundamental machine learning terms at

Wednesday, November 6, 2013

New study project: A recommender for GitHub projects

This is something I started to develop with two other women, Lina and Julia, at the Berlin Geekettes Hackathon, to get some experience with a real-world recommendation project. Right now, it is a content-based recommender, but the project is perfect for playing around with other recommendation approaches.
The recommender allows users to browse a list of top 5 recommendations of GitHub repositories on topics they are interested in. Interests can be entered manually or fetched from the users XING profile. A demo site (currently only uses a small subset of GitHub repositories, approx. 2500, so it only works for common interests like Ruby or JavaScript) is hosted on Heroku.

We used Java and Spring MVC, the REST-APIs are accessed via Scribe (XING) and and Apache HttpClient (GitHub). The current recommender algorithm is a self-developed content-based approach.

There are three main parts:

  • Fetching the GitHub repository meta data (description and first 1000 characters of the readme file) and cache in a (mongodb) database. Currently this is done once initially for a subset of the first approx. 2500 repositories, a scheduled job to update the repo meta data and add newly created repos will be developed later, when the open problems (see below) are resolved.
  • Fetching the interests from the user directly and/or his XING profile (uses the „haves“ field of the profile) 
  • Calculating the recommendations (computes a score based the overlap between the interest keywords and the concatenated readme and description, and keeps the five highest scoring repos in a list that is returned after all repos have been analyzed) 
 The source code is available on GitHub

Open problem
According to Wikipedia there are over 5 million repositories on GitHub. Retrieving, storing and processing them all is pretty unrealistic. E.g. it takes us 2 API requests per repository to fetch the description and readme, plus we need to paginate through the list of all repositories in increments of 100 to fetch the initial list of URL of the repositories. The API use for an authenticated user is limited to 5000 calls per hour. So fetching the repo data alone (without using any tricks to circumvent the rate limits) would take more than 5,000,000 / 2500 = 2000 hours, that is 83 days or almost 3 months. Continuing from there, the „full text search“ approach used right now to identify repo candidates runs in O(number of repositories), i.e. scales linear, which means it also takes significant computational effort to search through each repository metadata object to for each user request.

Currently, I see two avenues from here: Either move into parallelization of fetching and calculating recommendations, or find a way to identify a meaningful subset of the repositories to operate on. Parallelization is technically more interesting, but also much more challenging (meaning time-consuming for a hobby project) and more significantly requires more advanced computing infrastructure (meaning expensive for a hobby project). So I would prefer the second approach, which could be based on the GitHub Search API, which allows filtering of repos based on e.g. popularity, recent updates or content keywords.

Thursday, August 8, 2013

Mining personality clues from digital data: Some more thoughts

I would like to elaborate a little on the generalized approach to personality mining described in my last blogpost. That approach, found in a  paper by Kartelj, Filipovic and Milutinović consists in three steps. For my own understanding, I want to add some comments on these steps.
  1. Picking and compiling the data from which personality traits should be identified: This data usually consists of digital traces of some sort of interactive behavior (e.g. written communication, expressions of preference or interest through Facebook likes, Twitter follows), which differs from traditional survey-based personality testing where personality is measured based on self-reported survey answers, that is, based on the introspection of the surveyed person. It is not entirely clear if these two approaches measure exactly the same thing, they might be different in the same way that explicitly and implicitly measured attitudes are not exactly the same thing and are only correlated with medium strength.
  2. Determining the personality traits to be identified: The default model used by all of the studies I have seen so far seems to be the "Big Five" personality  factors model. While the "Big five", thanks to being a well-respected personality model, seems like an obvious choice for the sake of showing that personality can indeed be measured from online behavioral data, for the sake of applying the measured personality to different practical problems I would love to see approaches that use different, more specialized personality models. E.g. Wang et. al. (2009) used the Big Five model to suggest friends to bloggers based on personality similarities, where I am not sure if the Big Five model has ever been validated as a predictor of friendship potential. Likewise for predicting hotel choices as has been done by Roshchina et. al., 2011.
  3. Building a model of personality traits based on the selected data: Here, a standard supervised machine learning approach  is usually used: 
    1. First, the participants are selected and their personality is measured using an established, validated survey format
    2. Second, the machine learning system (typical choices are regression or SVMs) is trained based on a training subset of the gathered input data and the personality trait scores for each participant to obtain the personality model
    3. This obtained personality model is validated using either the correlation scores of the traditional measurement results to the new approaches results, or (which is the better choice, but frequently not done) using a validation set (a  new subset of input data /trait score pairs)

Sunday, May 19, 2013

Cool survey paper: "Novel Approaches to Automated Personality Classification: Ideas and Their Potentials"

I chose to read this paper by the mathematicians and engineers Kartelj, Filipovic and Milutinović mainly because it surveys several approaches to what they call the "problem of automatic personality classification". Additionally, they also suggest some applications of automatic personality classification and improvements to the algorithms used in each of the approaches. I found it to be a very useful introductory paper for understanding data mining approaches to personality testing. 

First, the authors describe the general steps of automatic personality classification: 
  1. Gathering corpus data: Selecting and collecting the material to be analyzed, e.g. blog posts, Facebook profiles or (more traditionally) student essays
  2. Determining the personality characteristics: Defining the personality model to use (the "Big Five" is the model of choice for all approaches surveyed in the paper, but other models are possible)
  3. Building a model: Defining the  indicators (independent variables) for the personality traits (dependent variables), and the algorithms used to compute the personality traits based on the measurements for the indicators. Indicators might be word choice and frequency in blog post, essays or Facebook profiles, friends and follower counts, or Facebook likes. For word choice and frequency, there exist databases of words that have been established to correlate with certain Big-Five personality traits, such as the LWIC (Linguistic Inquiry and Word Count)  and the MRC (Medical Research Council )Psycholinguistic Database. Algorithms used to compute the personality traits include linear regression, clustering, ranking, support vector machines (SVMs) and others. 
In the second section, the authors survey several studies undertaken to identify personality traits from different types internet corpus data, plus one paper that uses automatic methods on student essays. All except for the last one of these use linguistic approaches to identifying personality traits and some sort of machine-learning algorithm to compute them.  
  • Identifying personality traits from Tripadvisor comments based on LWIC and MRC (Roshchina et. al., 2011): The mined personality traits are feed into a recommender system that identifies hotel recommendations based on users with similar personality. The paper describes the architecture of the Tripadvisor system and performs a comparison of the performance of different machine-learning algorithms regarding the APC problem.
  • Identifying personality through computational analysis of student essays (Mairesse et. al., 2007). Paper that compares the performance of three different statistical methods for identifying personality traits from student essays using linguistic cues. The developed "Personality Recognizer Algorithm" is for example also used in the Tripadvisor paper.
  • Suggesting friends to bloggers based on automatic personality assessment of blog posts Wang et. al, (2009) present an approach to suggesting other bloggers as potential friends to bloggers based on the similarity of the personality of the bloggers. 
  • Lexical Predictors of Personality Type (Sushant et. al., 2005) is a study in different linguistic approaches to automatic personality identification. Differing from the common word database approaches, they consider text stylistics and usage of function words to identify the traits extraversion and neuroticism. 
  • In Our Twitter Profiles, Our Selves: Predicting Personality with Twitter (Quercia et. al., 2011), the authors describe an approach to mine personality data from Twitter profiles based on three statistics: "following", "followers", and "listed" counts.  

Wednesday, March 27, 2013

Gathering Data: Connecting to Facebook using Ruby

One of my target sources for gathering psychological data to analyze is unsurprisingly Facebook. So this week-end I set out to learn how to read and write data using the Facebook API. Since I also wanted to practice my Ruby, i decided to use Ruby and Rails rather than the standard PHP to accomplish this. Very luckily for me, an "Advanced Workshop" run by the wonderful Rails Girls Berlin took place this Saturday, which, as luck would have it, featured Alex Koppel,  the creator of a Ruby-based toolkit to use the Facebook Graph API amongst the coaches. He tutored me to set up my first Rails-based Facebook application. Here, I want to share how I accomplished this, it turned out to be much easier and more fun than I imagined :)

There are three main steps needed to connect to Facebook from Ruby on Rails:
  1. Register your new application on the Facebook Developer Site. This simply means filling out a short form on the site to  acquire the identification data, most importantly the App ID/ API key and App secret. 
  2. Authentication: Provide a mechanism for a user to log into his profile and grant you the necessary permissions. Facebook uses the standard OAuth for this. OAuth works roughly in this order:
    1. The app sends a request (read or write) on behalf of the user to the website he wants to access (e.g. Facebook). The app is commonly identified to the website by using the "App ID/API key". 
    2. The website presents a login screen to the user, requesting the necessary permissions on behalf of the app. 
    3. When the user has logged in successfully and granted the permissions, the website returns a token identifying the user to the app using the callback URL that the app provided when requesting the login. 
    4. The app can now perform the allowed requests (read or write) on behalf of the user using the token in the communication with the website. 
  3. Use the token with the read data and post functions of the Facebook Graph API to interact with the user profile. Facebook provides a rich and well-documented list of API functions with the new Graph API (older, deprecated versions of the API also exist such as the "REST API"). In terms of the Graph API, any object existing on Facebook such as a user, a page or a post is considered a node in the Graph, with the edges ("connections") representing the relationships between objects (e.g. user A being connected to page B via a like) 
While the first of these three steps is dead simple, I want to explain in more detail how I achieved the last two steps using Ruby on Rails. For experimentation purposes, I generated an empty application called "fbapitest" running on localhost. 


I used the OmniAuth gem along with the Facebook strategy provider gem  to establish the session to Facebook. The  installation and general usage of these gems is well-explained in the respective ReadMes on Github. Other than putting the gem in your Gemfie and installing the bundle, the most important part of the setup is creating a configuration file called
which contains your API key and secret, and optionally the permissions you want to request from Facebook.

Rails.application.config.middleware.use OmniAuth::Builder do
  provider :facebook, ENV['FACEBOOK_KEY'], ENV['FACEBOOK_SECRET']
  :scope => 'publish_actions, user_likes'

See the ReadMe section "Usage" for details on how to configure this, and the  "Facebook API documentation under Permissions". 

To use the gems after installation, create a link or button in your desired view pointing to the omniauth-pseudo-URL ''auth/facebook' (relative link on your server), Example:
 <%=link_to 'Facebook Login', 'auth/facebook' %>. 
Calling this URL will perform a redirect to Facebook requesting a login and possibly additional permissions if needed (if the user is already logged in and no additional permissions are needed, nothing will visibly happen). After a successful authentication with Facebook, two things will happen
  1. The public user data of the authenticated Facebook user is in the array
    ,in particular the token is in
    . This token needs to be stored somewhere, usually in the session.
  2. The user is redirected to the URL
    which needs to be configured in routes.rb to point to the controller action that should be run after a successful login (usually one that calls the API using the token). Example:
    match '/auth/:provider/callback', :to  => 'sessions#create'
An important bit to keep in mind is that if you decide you want to request more or other permissions from Facebook, you need to modify the 'config/initializers/omniauth.rb' config file. This file is only read during the Rails server startup, so you will need to restart the server to request the changed permissions.

Querying the Facebook API
I used the "Koala" gem  to query the Facebook API, since I was lucky enough to be coached directly by the Alex, the author of the gem. Koala provides a object-oriented interface for the API that I found quite simple to use. It follows the graph metaphor by providing methods such as "@graph.getObject()", "@graph.getConnection()" and "@graph.putConnection()" to interact with object of the Facebook data graph. The data returned by Facebook is provided in hash format (to access the hash elements, use strings rather than symbols as keys, e.g. like["name"]). The ReadMe provides the basic syntax for calling the API, and the Facebook API documentation  all the possible options for querying and posting. To translate the URL format used in the API documentation into koala method calls, see the following examples:

Type of request
API syntax
Koala syntax
Requesting an object
profile = @graph.get_object("me")
profile = @graph.get_object("ProfileOrPageIdNumber")
Requesting connection data
friends = @graph.get_connections("me", "friends")
Requesting relational data
@graph.get_connections("me", "mutualfriends/#{friend_id}")
Posting data
POST request to[USER_ID]/feed with token and message as parameters
@graph.put_connections("me", "feed", :message => "I am writing on my wall!")

There are more advanced functions in the Koala toolkit that I did not need so far, but are also documented in the ReadMe. A cool tool for exploring the options of the Graph API is the online API explorer.

Sunday, March 17, 2013

Cool Research: Prediction of personal traits and attributes from Facebook likes

Michal Kosinski, David Stillwell, (Cambridge Psychometrics Centre) and Thore Graepel (Microsoft Research) managed to predict personal traits such as gender, sexual orientation, intelligence using regression models from the person's Facebook likes with 60% -  90% confidence (depending on the trait). The authors gathered the data of over 58,000 volunteers from the US who used the Cambridge-developed Facebook app "MyPersonality" that rates the users "Five Factor" personality traits based on the open source item pool IPIP and used this data plus the data users provided through their Facebook profile (such as gender, age, sexual orientation) and through additional surveys (e.g. did the parents divorce, alcohol and drug consumption). A "like" might include anything the user gave a "thumbs up" on Facebook, including friends' photos, status updates, but also Facebook pages, books, movies or songs. 

For the binary attributes (predicted with logistic regression), the highest prediction was achieved for ethnicity (Caucasian vs. African American) and  gender (0.93 success rate), the lowest for drug use (0.65 success rate). The other predicted traits include political preference (Democrat vs. Republican), sexual orientation, religion (Christianity vs. Islam), alcohol and cigarettes consumption, relationship status (single vs. in a relationship) and wether the parents had separated by the user's age of  21.

For the numerical attributes (predicted with linear regression), the highest prediction was achieved for age (correlation between actual and predicted values 0.75), the lowest for "satisfaction with life" (0.17 correlation). Somewhere in-between (correlation range 0.3 - 0.53) fall the predictions for intelligence, emotional stability, agreeableness, extraversion, conscientiousness, openness, density of friendship network, number of Facebook friends. Compared to the predictive power of the original surveys used to measure these traits (Big five model, intelligence and satisfaction with life), the Facebook-like-based prediction performs worse - between 1/3 of the power (satisfaction with life) and 4/5 of the predictive power (openness), for most traits about half of the predictive power. 

Unfortunately the authors do not provide access to the actual regression models they obtained, making it hard to evaluate the quality of the models, replicate the findings or build on the results. For example, the high classification success for male sexual orientation might simply result from the fact that within the sample, around 95% of the users were heterosexual, so assigning a default value of "heterosexual" would result in 95% predictive success. 

Follow-up research pointers: The authors mention an array of related studies that predict similar traits based on other types of digital data collections such as web browsing history, the contents of personal Web sites, music collections, the number of friends, the density of friendship networks, the location within the friendship network or the language used on social networking sites (see intro section of the article for references to the respective papers).

Kosinski, M., Stillwell D.J., Graepel, T. (2013) Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences (PNAS)