Thursday, 8 October 2009

Investigating Terms of Service

During the project advisory board meeting we briefly discussed the legal issues involved in archiving web2.0 sites. I’ve been doing a bit of investigating already, looking at what the various service providers’ Terms of Service say and thought I’d share what I’ve found.

Each Terms of Service is basically the same, though a few are a bit more specific about what is and is not allowed. Here’s a basic table where you can see briefly what each ToS contains (sorry it's a bit small):

As you can see, all the ToS agree that the account holder owns their content, which is good news for archivists as well as account holders, but they also agree that the site provider owns all copyright, trademarks, logos and any other intellectual property. This means that if an archive wants to harvest a site interface, not just a user’s data, then the site provider’s permission needs obtaining.

A second problem is that most sites restrict data harvesting. Facebook bans it outright, however Twitter only prohibits scraping; crawling is allowed “if done in accordance with the provisions of the robots.txt file” (which aren't stated). Also, Myspace only prohibits automated harvesting data “for the purposes of sending unsolicited or unauthorised material”. This implies that harvesting data for archival purposes is allowed. However, this isn’t stated directly, and since some stipulations are quite specific I’d be inclined to check with the service provider rather than rely on assumptions.

Interestingly, Twitter used to have a rather vague ToS which said nothing about other people using their logos and trademarks. However, they updated their terms on 18th September and now restrictions on using Twitter’s intellectual property are written in.

So altogether it looks like an archivist can’t do much with a web2.0 account without the service provider’s permission. Now it just depends how amenable they’d be to granting it.


Kevin Ashley said...

This isn't the whole story, though. Some of these services provide access to their underlying data via other means. Twitter is an obvious example, where the API is designed to do exactly the sort of things with the ToS for the web interface prohibit. Automated retrieval of twitter data via the API is encouraged, albeit rate-limited. For archival purposes, this will almost always be sufficient, unless one is specifically concerned with the appearance of twitter's web interface as opposed to the content it is delivering.

The same is true in a different way of blogs on Wordpress, accessible via RSS feeds. That alternative approach is the key behind the ArchivePress project (

Susan Thomas said...

We need to look more closely at the APIs different services make available for getting stuff out (and judge whether we care about properties that might be lost in the process).

This post by Matt Asay on moving data into/between/out of cloud services is interesting:

pixelatedpete said...

Cool post! I love that most of 'em pretty much say "Yeah, you own everything, but it is also ours - thanks!". While we dream of freedom, it still remains the case that you don't get sommit for nuffin! :-)

Does seem odd to have a bunch of restrictions on the Web pages and yet give everything away via RSS or other mechanical means. Seems like a tangled legal mess that no one is quite sure about so I'd be inclined to grab the bits worth grabbing, keep them safe somewhere (it that really *using* it?), and wait to see what happens.

To my knowledge no one has (yet) sued Google for indexing their Web content (or indeed resurfacing it out of the Google cache). Archiving a Web page strikes me as being akin to indexing a Web page.


ps. Google, I hearby revoke your non-exclusive, royalty-free, world-wide license to use this comment. Not even read it. If you're an employee of Google, stop reading now. Oh. Damn. Too late... :-)

dnt said...

Good discussion. The practicality of archiving blogs and other contributary websites is problematic. One of the few times that permission for me to archive a blog was not granted was when the blog owner felt unable to grant me permission to archive other peoples material. This was one of the very few occassions when a site owner was sufficiently aware of rights issues to raise concerns.

I wonder how publishing a blog under a Creative Commons Licence - accepted by all contributors - would affect things?