Archive for the ‘tech’ Category

big data and doing big things

Spencer is right: this Wired piece about DARPA’s Nexus 7 initiative is very good. Nexus 7 is an ambitious data processing effort meant to synthesize both traditional signals (e.g. vehicle tracking data) and unorthodox signals (market fruit prices seem to be their favorite example) into useful intelligence through sophisticated analytic techniques taken from the social sciences.

And it’s a pretty good reminder of why I’m wary of the Big Data movement.  These were my two favorite bits:

On the surface, there wasn’t much to it: just a graph of violence in the Jalalabad region, and a plot of those fruit prices. When the level of violence was stable — reliably low, or reliably high — so were those prices. Fruit sellers knew what to expect. But when there were sudden swings in the number of attacks, the prices shot up.

Therefore, the Nexus 7 team said, you could use the fruit as an indirect indicator of instability.

The reaction was less than rapturous.

“Right from the start, I’m like: Oh. My. God,” one of the people who attended a Nexus 7 presentation tells Danger Room. “A high school kid could do that.”

Afterward, Dugan presented the pilot as a triumph — a “big breakthrough” that impressed a bevy of four-star generals.

Privately, she was underwhelmed. Dugan was looking for projects that could save troops’ lives, and maybe even bend the direction of the war. By that standard, fruit-price swings seemed pretty inconsequential.

But the presenters maintained an aura of confidence. Oh, this is just a test. Give us more data sources, they said, and we’ll make better connections. We’ve got the hardware: a cloud computing platform that would soak up all kinds of classified and open source intelligence data. We’ve got the software: these social science PhDs and counterinsurgency veterans, who can figure out how to apply that data to rebuild Afghanistan.

and:

“One assumed there was some secret mound of data to be exploited. But it’s just not true.”

I’ve fallen prey to this temptation: thinking that your mastery of awesome tools means you’re about to do some awesome stuff (perhaps via some cleverly counterintuitive Freakonomic insight). Unfortunately, it’s not that easy. You actually need to have a great idea before great things will happen, and it’s difficult to come up with great ideas unless you both know and care — deeply — about the topic you’re planning to examine.

It’s important to acknowledge that the story of Nexus 7 seems to be told, to some extent, from the perspective of people in the military establishment who feel insulted or threatened by the project.  But that in itself is telling: it’s never a good idea to enter a field of inquiry with the assumption that those who preceded you were well-meaning simpletons — particularly when your reasons for thinking so boil down to a difference in the complexity of your tools.

I think this same story is about to unfold in the tech industry, albeit with a more cheerful tone.  Consider this recent post from Read Write Web about the explosion in job listings mentioning the phrase “data scientist”:

“Right now, everybody with data knows that there’s value in there, that they should be doing something,” says Edd Dumbill, program chair for Strata, O’Reilly’s new conference on Data. ”Trouble is, nobody’s entirely clear on the next steps, but they do know that a data scientist can help frame questions and transform data into useful insight.”

They don’t “know” this. They’re assuming it.  And this leaves me worried, because the ability to draw meaning from mountains of information is almost always going to depend on the specific question being examined more so than the tools being used or the investigator’s level of enthusiasm for the idea of quantitative analysis.

It’s not that I don’t believe in the techniques and tools that have these folks so excited.  It’s not even that I think nothing will come of data-rich firms applying quantitative analytic techniques. These things have got me excited, too!  I’m trying to make sure we take advantage of the same kinds of tools at work.  Still, there’s no substitute for good ideas.

To me, this wave of hype doesn’t seem much different from the one that occurred at the start of the last decade.  ”Look at the power of webservers and online payment processing!” we exclaimed. “Can you imagine the benefits they’ll yield when applied to the problem of selling pet food?”

Those things are powerful. But that’s beside the point.

Wakemate

About three weeks ago I finally received my Wakemate. A part of the burgeoning quantified self movement and yet another example of a product made possible by the last half-decade’s debut of cheap silicon accelerometers, it’s exactly the kind of thing you’d expect me to buy.

Wakemate is built on three ideas borrowed from sleep research. First: we experience a recurring cycle of sleep states during a night’s rest. Pretty much everyone’s aware of this, if only because it was part of an episode of Star Trek. Over the course of a night you spend progressively less time in a deep sleep state, and more in light states where dreaming occurs.

Second: these sleep states are measurable using a technique called actigraphy. As this paper explains, during sleep the motion of your non-dominant wrist seems to correlate pretty well with more precise measures of sleep state. You can get a decent measurement of sleep state just by tracking what your left hand is up to.

Third: your level of grogginess upon waking varies depending on which part of your sleep cycle you’re in when your alarm goes off. This is known as sleep inertia, and the WM’s creators have a few paper excerpts about it here.

The Wakemate folks took these three ideas and combined them — in a way sure to elicit much (potentially justified) tongue-clucking from sleep researchers — into a product. Put on a wristband, load a program on your phone, and set a twenty-minute window during which you’d like to wake up. The device keeps watch during that time period for moments when you seem to be in a light sleep state, doing its best to find one and rouse you in a way that minimizes grogginess (if it doesn’t find one, it’ll wake you up at the end of the time window). The idea’s so clever that I barely care whether it works.

Snakebit

I first heard about all of this from my colleague Kevin back in February of last year. It sounded like an interesting idea, and for just $5 you could reserve your place in line for the device (it ultimately cost me $50; it’s now selling for $60). Wakemate is a Y Combinator startup, and its founders went through a semi-hilarious series of problems as they tried to ship their first product. Bad wristbands. Delayed electronics. Problems with Apple certification. The thing finally arrived, months late; the next day I got an email warning me that the included power adapter might burn my house down. And for the first week or so, the app only woke me up at the end of the 20-minute window — at the fail-safe point — seemingly because it wasn’t able to communicate with the wristband (I had to reboot the latter unit multiple times to get the night’s data downloaded). With the exception of the charger (any USB adapter will do), all of these problems have been fixed. But it was a bumpy ride. Kevin still hasn’t received his.

Surprisingly Plausible

Here’s the source data from last Thursday’s sleep, and Wakemate’s classification of that data into sleep states.

This seems kind of reasonable! Check out the huge spike at the beginning of the accelerometer time series. That’s when I was still awake and reading. Over the course of the night I went through about four cycles, spending less time in deep sleep each iteration. You can see four clusters of movement data, too. This isn’t the cleanest night’s worth of data — I didn’t feel like clicking through all of them to find the tidiest — but as I’ve looked at these over the past few weeks, I haven’t yet seen any patterns that seemed implausible either in terms of the reported sleep cycle pattern or its correlation to the underlying movement data.

Does It Work?

At first I was a bit disappointed: the central gimmick of the WM didn’t seem to be working. If anything, I seemed to be groggier than usual when I woke up. But as I already mentioned, I eventually realized that the alarm was only going off at the end of the twenty minute window. I emailed WM’s extremely responsive support line and was told that the issue had already been fixed in software and was just waiting on Apple certification. Happily enough, I was able to download the update by that evening. And although the days since have seen a suspicious number of wakings during the first minute of the alarm period, I’m actually surprised to report that it might be working. I’m still plenty groggy during the minute or two when I futz with the alarm (and report my level of alertness using the software slider). But I’ll be damned if I don’t seem to snap out of it sooner than usual.

On the other hand, this may not have anything to do with the timing of the alarm: it might just be that I’m getting more sleep. Which brings me to the best thing about Wakemate.

Data Porn

I was most excited for the alarm functionality, but the analytics package that WM provides has proven to be its most compelling feature. Your nightly sleep data is uploaded each morning and placed into an attractive interface. You can easily find information about time spent asleep, how long it took you to fall asleep, and how many times you woke up in the night. It’ll also show you how your recent performance in these areas compares to your career average, and to that of the entire population of WM users.

You can also tag each night’s sleep when you set the alarm — did you read before bed? go to the gym? drink alcohol? — and perform comparisons between tags.

Perhaps less helpfully, WM provides a “Sleep Score”. I can’t find any detailed information about how this is calculated — I suspect that this opacity is intentional, both to allow the formula to be tweaked and to keep users from trying to game it. And while it’s sort of amusing to have competitive sleeping leaderboards (how does Justin Sweetman sleep so virtuosically?), the scores seem to me to be basically bullshit. I tend to score highest when I’ve gone to bed late and with alcohol in my system; as you might guess, my scores don’t correlate very well with how rested I feel. You seem to be penalized for “low quality” sleep, even if it means more sleep — in other words, collapsing from exhaustion and sleeping like a corpse for three hours might earn you a higher sleep score than getting a normal night’s rest.

Since I’m on a bit of an Excel kick, here’s a plot of my sleep scores versus minutes asleep (WM recently added the ability to download your data as a CSV, which is nice of them).

Admittedly, I don’t yet really have enough data for that trend line to be meaningful. But I have my suspicions.

Still, I’ve actually found the product to be worthwhile, not just as an interesting exercise in navel-gazing. For instance, it turns out there’s a reason my Sundays aren’t very productive:

I honestly had no idea I was getting so little rest on weekends.

In general, I’d say that it’s been surprising and useful to have the amount of time I spend asleep quantified. I’ve always needed a relatively large amount of rest in order to function. I have nothing but admiration (and jealousy) for those of you who get five hours a night, hop out of bed, write a thousand words and run a half marathon. But I just can’t do it. At the absolute depths of puberty/hibernation my body, when left to its own devices, was helping itself to twelve or thirteen hours of sleep a night. That’s thankfully not necessary any more, but I’m certainly not at my best when I get less than eight hours.

Wakemate has actually been useful for telling me when I’m not taking very good care of myself, and has provided a small but real incentive for paying attention to when I should call it a night. Admittedly, you can see that incentive diminishing in the above graph as the novelty of the WM wears off. Still, I’ve found the information useful.

Anyway, if it sounds appealing, you might want to give it a try — although until I’m more convinced of the alarm’s utility, I’d suggest considering the FitBit as well. I haven’t tried FB, but in addition to sleep analysis it quantifies your activity during the day, which might be interesting. It hasn’t got any anti-sleep-inertia alarm functionality, but perhaps that’ll be added later.

shell scripts work better than giant puppets

There were two good pieces on NPR this morning discussing the reaction to Wikileaks.

First, there’s this story, which briefly gets at some of the concerns about Twitter as a platform for activism that I tried to express in this piece. The point Shirky makes is an important one: so much vital expression now happens in privately-controlled mediums that nominal speech rights are often a bit beside the point. This, combined with the fact that we seem to be relitigating the “should shield laws apply to bloggers?” question, makes me think that we’re collectively more confused about how you can and ought to be able to speak on the internet than I would’ve expected. Hopefully we’ll muddle our way through to a productive conclusion.

Second, they did a piece discussing the costs of prosecuting participants in Operation Payback, and it’s also worth a read. I’m sympathetic to the basic dilemma: going after the participants is a lot like calling in Interpol for a vandalism case. Individual actors are only responsible for substantial damage when considered together; and nobody’s very happy about the idea of locking up bored rich kids. Still, it’s worth keeping in mind that this is what a hard computer crime problem looks like. It’s not thrilling to me to read that the FBI and DOJ are basically shrugging their shoulders and saying, “Eh, that sounds too hard.” This is all the more galling when these agencies are bothering to pursue computer crime enforcement agendas around intellectual property. Like it or not, distributed international attacks are the class of problem most in need of solving.

One other thing I’d add: keep an eye on how useful the “cybersecurity” community makes itself during this process. My guess is that they’ll keep their traps shut (or at least spend their time gleefully fretting about what kind of funding requests Stuxnet will necessitate) while the “computer security” community does the hard work of grappling with the Operation Payback DDoS. On the other hand, I suppose the cybersec guys were probably the ones behind DDoSing Wikileaks, so, y’know, your tax dollars at work.

Third: about that DDoS. I’m not sure what to say about it, really. I find the argument that it can be considered civil disobedience to be more compelling than I would’ve expected. I’m somewhat sympathetic to arguments like the one Tim makes here (or that’s discussed in this comment). But I think the idea that past instances of civil disobedience were done in an orderly manner by uniformly thoughtful people who did their best not to inconvenience anyone else is probably wishful thinking. I’m no expert on this stuff, but that strikes me as the sort of perceptual shift that happens after a movement is vindicated by history. You’re going to have to piss some people off. Otherwise you might as well go schedule a protest march for all the good it’ll do you.

And although I’m not particularly sympathetic to Payback’s targeting of Amazon — cowardly though the firm’s behavior may have been, too many others rely on the AWS infrastructure, and there are plenty of hosts out there — the payment-processing and DNS control points now being counter-attacked by Operation Payback really do represent worrying concentrations of power. These systems are controlled by entities that are immune to public oversight, yet seem to be completely compliant when state agents ask them to restrict their customers’ liberty. That’s a recipe that should worry libertarians a lot more than it seems to.

On the other hand, there’s Gawker. And although it’s probably a mistake to conflate all of these actors, there do seem to be some connections. The “Anonymous” community is becoming a sort of petulant digital Fight Club that’s going to be very difficult to combat, and which behaves in a chaotic and unprincipled manner. That they’re targeting those who dare to talk about them is pretty dismaying; that the only way to respond seems likely to be a combination of quiet state cooption of ISPs and throwing children in jail — that’s incredibly depressing.

a charlatan-friendly ecosystem

Alex Payne points to a blog post by Ben Laurie that discusses Diaspora and Haystack, and how projects like these can attract huge amounts of press, only to flame out as their charismatic founders’ incompetence is revealed.

I agree with Ben’s post, but it’s worth being a bit more explicit about what allows these situations to arise: the quality of most tech journalism is abysmal. I mean really inexcusably bad. Mainstream publications regularly assign writers to cover the software industry that have a level of understanding regarding the field that would be unacceptable in an intern. The most esteemed practitioners in the tech press are either focused on the consumer electronic user experience or are building personal brands around faith-based tech triumphalist movements.

In this sort of environment, it should be no surprise that an embarrassing hype cycle can emerge — one that talented self-promoters will use to enhance their status and wealth. I find it difficult to assign all that much blame to those self-promoters: the whole problem is that they don’t know any better. What more can we expect? Besides, it’s very easy to start believing your own bullshit once people with seemingly-meaningful professional credentials start validating it. Self-promoters will self promote; it’s not realistic to expect them to be the ones providing diligence.

I suspect that the problem may have to do with the structure of the industry: if you know much about it, you’re probably going to be able to make more money participating in it than writing about it. I don’t know enough about finance to really judge, but it seems as though that press sector suffers from a similar systemic disability — certainly all can agree that the financial press didn’t cover itself in glory in advance of the recent financial crisis. Once that story became big enough, talented generalist journalists filtered in and did the job properly.

But unless and until the skill premium for the software industry diminishes relative to journalism I’m not sure there’s a good way to align incentives in a way that fixes this problem. The best we can do is to recognize that the journalists who wrote excitedly about Haystack and Diaspora made a mistake; they were fooled, and they wasted our time. There’s no need to tar and feather anyone, but their credibility needs to suffer if we want this situation to improve.

Maybe we don’t need it to improve! It’s not that important, to be perfectly honest. But it sure does bug the hell out of me.

I could use some GMaps help

Way back when, I wrote a Google Maps application for DCist that overlaid the DC Metro system on the usual GMaps tiles. People found it useful — me especially, since I think it helped me land a job at EchoDitto.  Its only real innovation was some simple, hacked-up geometry that would horrify a cartographer, but which allowed me to make an attractive map that recalled the more stylized WMATA map.  It wasn’t rocket science, but I still occasionally get emails from developers asking me how I did it (which is slightly bizarre, given that the code is right there for them to see).

In 2007 the GMaps API got an update, and I converted the project into something called a mapplet. I had to rewrite a few things, but it was more or less the same.  The main difference was that mapplets were used through the maps.google.com interface — you could add a bunch at the same time, but you could still use Local Search and permalinking and comments about businesses and other Googly innovations from within the interface.  I didn’t have to implement any of that stuff!  Instead, users could simply have their polished Google Maps experience supplemented by my modest mapplet.  Handy.

Unfortunately, over the last few weeks I’ve started receiving reports that the mapplet’s behaving weirdly.  Load the mapplet, then do a search for something — the station markers will disappear, and sometimes some of the lines that are supposed to connect them will, too.  It looked to me like an event handler had started working differently, so I went to investigate.

Alas!  It turns out that v2 of the API has been deprecated.  They’re on to v3 (not so bad) and they’ve discontinued the mapplet platform entirely (bad)!

I can still make the lines appear on a Google Map.  But I don’t think I can do it on the maps.google.com interface.  This is a drag: I don’t think the thing’s half as useful as a standalone product as it is when it supplements search functionality.  And I really don’t want to reimplement the entire maps.google.com interface (even though, yes, they expose the API for their local search stuff).

So! Developers! Anyone out there dealt with this? I’m not eager to dump a huge amount of time back into this project — a project that’s increasingly unnecessary thanks to Google Transit and the addition of transit stations to the GMaps tileset, but which is still useful when you’re working at a modestly wide zoom level.  But it would be nice to get things working again.

browser warnings

Kevin Drum elects to take security advice from Microsoft. This is not a good idea!

In a nutshell: MS says that users ignore warnings about unsigned encryption keys, which makes those warnings useless.  Some browsers, like Firefox, make it really difficult to ignore unsigned keys, but that’s annoying, and MS says we should abandon such efforts.

This is wrong.  Those warnings are saying: “The URL you entered means that you’ve asked for a snoop-proof connection, so okay, your connection to this server is encrypted; however, I can’t verify that this server is who it’s claiming to be.”  Your conversation with the server will be private, but you could be subject to a so-called man-in-the-middle attack, whereby someone hijacks the local network segment you’re on and starts speaking on behalf of, say, bankofamerica.com.  The magic of the certificate authority system means that they can’t do this without generating a warning.

There is one caveat: if they simply don’t try to use encryption, the warning won’t be generated.  This is one of the reasons why you’re supposed to check for https:// in the URL whenever you submit sensitive information; not just so that your password isn’t available to everyone else on the Bolt Bus (though that’s a good reason, too), but also so that anyone pretending to be a server they’re not will get caught.  Unfortunately, people aren’t very good at checking for that little s in the URL or that little key icon or whatever other little security indicator your browser provides, so the bad guys just direct their victims to unsecured sites.

That’s too bad, but it isn’t a sign that warnings about unsigned keys are bad ideas.  Actually, the fact that phishers have drifted away from that link in the security chain means that it’s working properly.  Weakening it isn’t any kind of solution.  MS security researchers would be better-served by spending more time thinking about how to get people to notice when they ought to be using a secure site.

engineering only seems cool after the fact

The American Prospect on Trivium!  Worlds colliding!  It’s great!

I read the article, though, and found myself disagreeing with its author, Marisa Meltzer.  How can you provide an accounting of Tumblr’s success and not use the word “Gawker” once?  The service was launched in New York in the right kind of scene, was pitched in the right kind of way, attracted the right kind of writers, and grew from there. Like so many successful products, you can’t simply run Tumblr’s featureset backward through the perfectly-deterministic mechanism of the imagined market and arrive at a proper accounting of why it succeeded and its competitors failed.

I agree that it’s worth talking about the subtle mechanics of the site — how there’s a thumbs-up mechanism but no thumbs-down; I think the consolidation of reader and blog into a single form is a genuinely interesting idea. But Meltzer’s account seemed to me a bit like explaining the success of the iPod via a discussion of its clickwheel.

making mintyboost

Kriston and I built Lady Ada’s Game of Life kit over at HacDC a while back, and it roused the interest of a few folks. Sommer asked that I be on the lookout for another kit that might be a good candidate for learning to solder. I eventually suggested another Lady Ada kit: the MintyBoost, a simple circuit that lets you top up your USB gadgets’ batteries through the use of a pair of AAs. By the time I got my act together, six(!) people had expressed interest in building the kit, pushing supplies of my tools (somewhat scarce) and electronics knowledge (decidedly meager) to their limits.

But it went surprisingly well! Everyone’s kit worked immediately; in fact, they even seemed to work with the iPhone 3GSes that were on hand — something that the MintyBoost can’t necessarily be relied upon to do.

glamor shot

I think that my favorite part was watching everyone mess around with the Dremel. That’s some advanced nerdery for you.

I think that folks had fun; we might do this again, perhaps with more of an Arduino bent. If you want in, let me know.

fly, EAGLE, fly

I made my first EAGLE schematic!  I’m sure it’s horribly broken, but I’m still feeling pretty good about it.


If you’re interested in getting the source file, head over to the post on Sunlight Labs.  And if you do, please be gentle.  Advice on what I’ve gotten wrong would be welcome, though.

ALSO: Thanks to the diligent outreach efforts of Sunlight’s own Nicko Margolies, the original post has now been picked up by Hack-a-Day and MAKE. Neat! And better still, educational: I’ve already had a number of revisions suggested to me by the Hack-a-Day commenters.

ALSO ALSO: Engadget, too! Though I’ve gotta say: man are the commenters there morons.

it'd be so money, bro

The wages of internet success, I suppose.

AAAAAND: Gizmodo. That’ll just about do it, I think. Their commenters are mean. Which I like. I want the apparent EE to offer some more information, though.  I think he’s at least partly wrong.

fear of a black hat

Earlier this week my old boss, JP, sent me a note saying that the full-text RSS script I’d written was being shut down.  EchoDitto was nice enough to continue hosting it on their servers, but it had been consuming an increasing share of resources, and finally it needed to be killed.  Sad.

Then, horrifying.  JP had mentioned it idly, but until I saw the Google Alert for this I didn’t quite realize what had been going on.  The self-described black-hat search engine optimization crowd — the folks who assemble sites peppered with ads that are designed to attract search engine traffic, aka “link farms” — had been using my script to steal other people’s content and republish it on their own sites.  Using this sort of genuine content helped them snag traffic more effectively than they could with the gibberish that you sometimes see in spam, so they were sorry to see the script go.

Well, I appreciate the attention, but it’s not exactly what I had in mind for my work when I released it.  So I explained the situation, and bid them adieu:

blackhat_seo_lg

I was at least somewhat aware of this danger, and consequently didn’t open-source the code (thank goodness).  It was still a bit of a shock to see the reality of what I’d wrought, though.

I still believe what I wrote in that initial post: it’s pointless for online publishers to try to control how their readers consume content.  If you wish to publish digitally, you need to accept the realities that come with doing so.  Pretending otherwise is just going to inconvenience your readers and slow your business’s necessary evolution.  Prolonging that process seems likely to increase the painfulness of the process, not eliminate it.

For those curious, the algorithm was exactly what many in comments had guessed.  I used some regular expressions to build a hierarchical structure representing all the <div> and <td> elements in each page associated with an RSS item.  These tags are the ones most often used to provide a page’s layout, and the full text of an entry can usually be found in a single instance of such a container.  I then traversed this structure, looking for the text excerpt from the original RSS item.  Once found, the rest of that container’s contents could be pulled out.  It’s a simple idea, though the realities of HTML — and the difficulty of preserving byte offsets between a sanitized working copy and the original — made the actual implementation require quite a bit more cleverness (and caching) than it may sound like.

The result worked pretty well. Still, there were a few problems with the approach.  For one thing, comments were frequently included in the same container as the main entry.  For another, the script would fail if the RSS entry text was a summary of the item rather than an excerpt. I think that both of these are surmountable problems: a better approach would examine the “textiness” of each container using a variety of scoring metrics.  Something similar could be used for detecting the start of comments (which tend to be peppered with timestamps, quotations of the original text, and occur after a big <h[1-6]> containing the word “comment”).  I took a stab at a new, Pythonic implementation using Adrian Holovaty’s templatemaker and a few other tools, but a lack of immediate success (and much higher computational demands than the original script) made the project fall by the wayside.  Now that I know how it might be used, I’m even less likely to pick it back up.

But I’d still love to see my algorithm adapted by the people who make RSS readers, and would be happy to talk to any interested and qualified parties about making that happen.  Those SEO morons don’t appear to be particularly technically proficient — I’m not too worried about them managing to steal content via a client-side app (though certainly some thought should be given to the matter before giving the store away, say via Applescript).