keeping promises

I said I'd have a similarity matrix for you, didn't I? Well, here it is.

congressional similarity matrix

The Sunlight Foundation is holding a mashup contest, and I've been screwing around with the data they've made available on opencongress.org to see if there's anything cool that I can do with it.

I started off by scraping every member of congress's voting record, then running a comparison of how similar they were (this photo wasn't taken just because I'm pretentious — I had another reason, too). This graph is sort of my way of checking my work so far: it orders members by party, then state, then district. So the block in the upper left represents the democrats, and the one in the lower right is the republicans. The diagonal white line represents where each member is plotted against themselves (they'll always have an identical record, of course, so it's worth blanking this out to show that it isn't valuable data). The brighter the color, the more similar the voting record.

Of course, some members are too new to have cast enough votes to make analyzing them worthwhile, and I haven't bothered to filter them out yet (hence the scattered black dots). And this isn't the interface I intend to use for this data — it's static and visually boring, and unlikely to wow any judges.

But it's still kind of interesting, and you can draw a few conclusions from it (many of which are easier to see in the black & white version, behind the jump). For one thing, you can see that the republicans vote like one another more consistently than the democrats do — as a result, their block is darker (on the black & white graph; it's brighter on the color version, but it's a little hard to compare red pixels to blue ones directly due to how our eyes work). For another, you can see that the bands of dissimilar color are often a few pixels wide (again, particularly among the democrats). These correlate to particular states' delegations that tend to vote unlike their parties. Unsurprisingly, these tend to be folks like Iowa democrats and Texas republicans.

Those are just my initial impressions, though, and I could be misinterpreting something. Tomorrow I'll put up a simple DHTML browser that lets you see which column belongs to which legislator. That should make it easier to see whatever glaring mistakes I've made.

congressional similarity matrix

Comments

Next step -- clustering!

Hierarchical clustering would probably be the most fun. (If you want to do it real easy-like, you could go here:
http://rana.lbl.gov/EisenSoftware.htm
and grab both the 'Cluster' software that does the clustering, and 'TreeView', which gives you a reasonable tree-based view of the hierarchical results. Of course, there are a lot of other freely-available clustering tools out there, but Eisen's stuff is kinda standard among certain subsets of the bioinformatics community.) And then you could see if certain branches of the hierarchical clustering correspond with well-known groups or sub-groups of congress-people.

Try it! It's easy and fun.

You could also cluster the columns of your vote-matrix, the votes themselves. Sort of an instant-classification of legislation, based on how similarly different people vote on them.

Of course, the next step after clustering is classification, too. Can you hold-out a congressman or two, and then classify them based on their votes?

 

Cool. But what sort of input do those programs take? I'm on a mac and my copy of Parallels has been misbehaving.

I'm definitely interested in ways of presenting this information in coherent (and neat-looking) manners. I think I'm probably going to throw it into some kind of dynamic tree-view in Processing, grouping legislators by those they're most similar to. Next up, in terms of adding new information: adding senators, segmenting legislation by issue (although recalculating similarity for everything will take forever unless I can figure out a way to massively optimize my code), and maybe figuring each representative's distance from their party's modal vector (if that's makes sense).

 

Well, the Eisen programs work from tab-delimited text files. And I think that there's a version of Cluster that runs on Macs, but maybe not TreeView. And parsing the hierarchical clustering results of 'Cluster' isn't the most fun thing in the world... so maybe my suggestion wasn't so useful.

But when you're talking about 'grouping legislators by those they're most similar to,' then we're thinking along the same lines. This is clustering, and standard clustering usually comes in one of a few basic flavors: 'hierarchical', 'k-means', and some 'spectral' approaches.

A good intro to clustering in general, and k-means clustering in particular, is in Chap. 20 of this freely-available, very-good book. K-means clustering tries to divide up your total set of datapoints (legislators) into k groups, each of which is summarized by a single composite datapoint (what you call your 'modal vector,' I think). Most variations on the k-means technique come from how you summarize a group of datapoints, and how you calculate the distance of a datapoint to that summary.

Hierarchical clustering still requires you to specify a distance function between datapoints, and a means of summarizing two or more datapoints into a new summary datapoint, but it doesn't require the number of clusters as an input. Instead, it builds a tree structure on the total set of datapoints: first, you find the two points that are closest, according to your distance function. Join them in your tree, and remove them from your dataset. Replace them with their 'summary', and repeat. Each time you find a new closest pair of points (which each might be summaries of earlier sets of points), you join their sub-trees and replace them all with a new (total) summary point. When you've joined the last two datapoints, you've built a total tree whose structure represents the ordering of the closest pairs that you found and joined since you started. Presumably (you hope) the trees and sub-trees in your hierarchical clustering correspond to natural divisions and sub-divisions in your datapoints.

Finally, 'spectral' clustering methods work directly on graph representations of your datapoints. These techniques are related to both matrix-representations of graphs (and finding the connected components of those graphs), as well as some methods that find the factorization of a matrix into other matrices with certain properties. For instance, PCA (Principle Components Analysis) on the covariance matrix of a dataset that looks like yours is pretty standard.

There are a lot of reasonable Wikipedia articles on all this stuff too, of course.

Anyway, this is all a really long-winded way of saying, "Yeah, if you put the legislator-voting data through some kind of clustering method, and then presented the results in an interactive environment that let a user browse the clusters as well as the individual legislators... that'd be All Kinds of Great."

 

Post a comment