Labeling Subgraphs

Mon, Nov 14, 2011 - 3:53am -- Isaac Sukin

On Friday I went to a series of talks for the kickoff of the Marketing and Social Systems Engineering (MKSE) department at UPenn, chaired by one of my professors. There were four speakers -- I missed the second -- and the talks ranged from monetizing people's browsing behavior on the internet to proving that humans exist.

I've been building an open-source suite of social networking software as a hobby since 2007 so the underlying topic of network science has interested me for a long time. I have mainly focused on building user engagement. Towards that end I've thought a lot lately about the right way to represent relationships on a social networking website given varied successes and failures of Facebook's Friend Lists and new Smart Lists as well as Google+'s Circles. My expertise is in Drupal and the choices there all require tediously, manually building your network. I'm convinced that the best approach for usability is to automatically identify people's friend groups. In general it's a good principle not to make users do anything that is not directly related to accomplishing their task of participating on your site, and manually maintaining friend lists is awkward anyway.

The problem of automatically identifying friend groups can essentially be thought of as a subgraph problem -- that is, if we present a social network as a graph and select a node to represent you, how do we identify related cliques of which you are a part? And once we've found those cliques, how do we label them (i.e. how do we know what kind of relationship you have with those people)?

Andrew Tomkins, director of engineering at Google, was the first speaker. Among other things, he talked about algorithms for completing a subgraph with "missing" links -- basically if you're friends with two people they are something like 80% likely to know each other -- which he said Google was using to suggest people you might know on Google+. He also talked about Google+'s Circles, which are of course manually built. So after the talk I came up and said that sounds a lot like you're working on automatically constructing Circles like Facebook's Smart Lists. He wouldn't say whether Google was working on that but he did offer some tips about subgraph labeling which I'll get to in a second.

The third talk by Jon Kleinberg addressed the subgraph labeling problem directly. Dr. Kleinberg talked about research he did on geotags on Flickr photos and trying to identify whether they could be matched with textual tags to identify major landmarks around the globe (spoiler: they can). He brought up two labeling techniques: time and geography. Namely if you have identified two groups a user thinks is interesting, and that user has communicated with one of the groups more frequently, that group may be her work colleagues whereas the other group might be high school friends with which she hasn't talked in awhile. Additionally if two people have been in the same places within 30 minutes of each other, and those places are a certain large distance apart, it is extremely likely that those two people know each other, and depending on their age might often be family.

My problem is that Drupal is a Content Management System, and any kind of system can be plugged into it as an extension or "module." So even if I have a graph of the social network on a site, I might not know that a certain property means "age" or "geographic location" or "job" simply because the system is generic and those properties are created by an administrator in different ways (or not at all) for each site. This creates two problems: it makes finding relevant subgraphs harder, and it makes labeling them much harder.

Dr. Tomkins pointed out that one solution to the labeling problem is simply to have site administrators be able to explicitly tell the automatic-friend-group module what each of the properties mean. This would require knowing how properties could be defined (typically by the core Profile module) and having a predefined list of properties our new module would look at. Approaches for identifying the subgroups could also include a learning system where various decision metrics would be tested automatically and the most successful ones would be used for that site. Additionally there are tests that don't require knowing what specific properties mean like recency of communication, various graph properties (size of subgroup, clustering coefficient, etc.) that could be useful for both problems. Finally, if we know how user properties can be defined (such as by the core Profile module) but the administrator hasn't told us what each one means, we can still offer suggestions from the list of values users have entered to any of the properties and learn from the results, or even try to parse the property names for common values.

Ultimately, a social network has to be pretty gigantic for any of this to work or make a difference in people's experience on the site. But subgraph data can also be used to show more relevant content on the site, for example by bubbling up content on the activity stream that is more relevant to the people and groups you care most about. Hopefully some of these lessons will turn into code in the near future!