Skip to content

Should twitter have its own data storage?

Now that twitter OAuth feature is in public beta, some people argue that twitter should also provide a data storage facility through their API. This would allow developers to add payloads to different twitter objects but would create another information silo, instead of leveraging existing data storage providers.

I agree with Kevin Marks when he asks ”Why should the storage and the event stream come from the same provider?”.

Quoting Dave Winer’s article “Twitter and OAuth, interesting brew“:

Maybe it won’t be Twitter or Facebook, but whoever builds the next consensus platform will have open data storage APIs in addition to identity. It’s a vital part of identity. We’ve been waiting too damned long for this.

Let’s see what happens.

Do you agree with twitter following limits?

Quoting the twitter status blog post  ”Twitter Status - A note about per day following limits.”:

While there are technical reasons behind having some limit on following activites, this per-day limit exists to discourage spamminess. Also, it is unlikely that anyone can actually read tweets from thousands of accounts which makes the mass following activity disingenuous.

What do you think?

Testing the tarpipe Wordpress plugin

Testing

The real time Web?

Much has been said about friendfeed’s latest UI redesign and how it enables a real time view of content from across the Web. Is it really real time? I mean, content is pulled periodically from other applications into friendfeed so that it can be displayed to the end user.

This post was triggered in part by a tweet from Ian Mikutel:

Read your preso on Activity Streams & Context. Does new FriendFeed with Real Time everywhere ruin your “middleman” argument now?

Ian is talking about slide 12 from my presentation “Activity Streams and Contexts” prepared for Google’s first MiniBarcamp on 4/2/09 — note that this was actually before friendfeed launched the redesign — in Lisbon, Portugal (original here):

I claim that a data propagation architecture based on a middleman (friendfeed, for instance) pulling information from different Web applications is better in terms of scaling than letting everyone pull data from each other without any type of agreement. I also say that it can’t be real time, because it needs to obtain data periodically from different end points, thus wasting time on that process.

So, where is the real time Web? Is this approach that friendfeed’s presenting us the best we can do? I think we can do much better.

How to share a secret

How to Share a Secret” is the title of paper written in 1979 by Adi Shamir (best known for his work on the RSA algorithm). The paper describes a method for dividing information in smaller pieces so that the knowledge of all but one of the pieces gives absolutely no information about the original information. Quoting the author:

This technique enables the construction of robust key management schemes for cryptographic systems that can function securely and reliably even when misfortunes destroy half the pieces and security breaches expose all but one of the remaining pieces.

What’s so fascinating about this technique? Here’s a list of properties taken from its article on Wikipedia:

  • The scheme is “information-theoretically secure”, meaning that its security isn’t bounded to computing power;
  • It’s also “perfectly secure”, which means that its output gives no information whatsoever about its input;
  • It’s minimal, because the size of each piece doesn’t exceed the size of the original information;
  • It’s extensible, allowing the addition of new pieces without having to change existing ones;
  • It’s dynamic, meaning that you can change all the pieces without changing the original data;
  • It’s also flexible, allowing each party to receive a different number of pieces, related to their power within the whole chain.

If you read through the paper you’ll find an example describing a situation where a number of signatures is required to pay a check. Again, quoting the original:

If each executive is given a copy of the company’s secret signature key, the system is convenient but easy to misuse. If the cooperation of all the company’s executives is necessary in order to sign each check, the system is safe but inconvenient. The standard solution requires at least three signatures per check, and it is easy to implement with a (3, n) threshold scheme. (…) An unfaithful executive must have at least two accomplices in order to forge the company’s signature in this scheme.

Now, suppose we’re not talking about some company’s executives but instead  about Web Services, and instead of a secret signature, we’re talking about users’ credentials. The bank will become some Web application where the user is registered and the money will become the user’s data on that application.

As a side note, I think this analogy describes the problems we’ve been having lately with the password anti-pattern, more specifically with third-party applications asking for your twitter credentials. Not only this situation occurs when you give your password to a third-party application, but also if you’re using other authentication mechanisms — if the third-party application is hijacked, your data can be compromised.

Now, back to our solution: suppose you build your application separating particular objects physically by using Web Services. Also, suppose these Web Services are invoked in a secure way. In this scenario, whenever you want to execute a specific task, a set of Web Services must be called in a specific order.

Web Services interaction

If each Web Service is given access to a copy of the credential, the system will be very easy to misuse. A possible solution is to divide a secret that can later on decrypt the credential — I’m not getting into details about this process right now — into different tokens that are spread across all the Web Services involved in the process.

Now, even if one part of the system is hijacked or in some way compromised, there’s still no way to decrypt the credential and use it in unexpected ways. Even if the attackers gain access to the Decrypt Web Service, they must also reproduce the list of tokens in the correct order.

You can argue that if someone eavesdrop the connection to the application where the credentials are being sent, e.g. twitter, one could still see them in plain text. That’s a whole different problem not addressed by this solution — OAuth, for instance, solves that problem by requiring the use of signatures based on one time values (nonces) known only by both sender and receiver.

Cloud Balancer

The latest Amazon S3 blackout made me think what could be done to try to solve this problem, when your application depends on S3 or any other cloud service.

Does this mean that you should all stop using cloud services and go back to your own data center? No way! There are better and more reasonable solutions. They just need a bit of thought and some experimentation.

Some people, like Dave Winer, can even see that a new service can emerge from this need. On his article “Amazon S3 down all day“, Dave proposes a possible solution:

It seems there is a business opportunity here — it would be easy to hook up an external service to S3, and for a fee, keep a mirror on another server. Then it would be a matter of redirecting domains to point at the other server when S3 goes down.

So, here’s my proposed solution for this specific problem: use as many equivalent storage solutions as redundacy. Technicaly speaking, you should write your data to multiple services at once and create a read procedure that selects the fastest service and uses it.

Cloud Balancer diagram

This solution can use tarpipe to let you write data to multiple destinations at once, and Gnip can inform your application about the best service to read from.

Lifestreaming aggregators

Lifestreaming aggregators became popular as the number of different applications where you could participate — either by updating your status or by uploading something — increased. The aggregators are here to relieve you from the burden of going to multiple locations to find out what your friends or contacts are up to.

Services like plaxo, spokeo, friendfeed or socialthing (lifestream blog has a more compreheensible list) start by asking your identification on different services. Then they aggregate information you update on those services and let you — and your contacts — access that information from a central location.

This obviously means that the aggregators wouldn’t exist if the services they’re collecting information from weren’t popular. Those lifestreaming services opened the way to this new wave of applications. So, what is lifestreaming, anyway?

Lifestreaming itself is the ability to publish quick updates about what you’re doing or thinking at the moment. Those posts can be very short and textual, or they can contain other media such as pictures, or video. There is a considerable number of available lifestreaming services, with twitter being the most popular.

Some people are questioning the true value of these lifestreaming services. Major concerns are a) the increasing volume of information you’ll have to process; and b) the decreasing willingness to participate by writing more extensive thoughts.

Some evidence of these concerns can be found at the Micro Persuasion blog, where Steve Rubel explains that:

We are reaching a point where the number of inputs we have as individuals is beginning to exceed what we are capable as humans of managing. The demands for our attention are becoming so great, and the problem so widespread, that it will cause people to crash and curtail these drains. Human attention does not obey Moore’s Law.

This is also affecting blogging, as Sarah Perez from ReadWriteWeb, believes:

When people post an article on a blog these days, the conversations are occurring offsite. The blog link could be submitted to Digg, Mixx, and/or FriendFeed, and conversations may occur around the topic on those sites instead. The original blog post, meanwhile, has 0 comments.

So, how can lifestreaming and its aggregators be a good thing if they’re disrupting the way you’re used to interact on the Web? I believe we’re in the beginning of a much wider paradigm shift where the interaction will move from localized items — like blog posts — and start spreading all across the Web. What will matter in the future is not the place where you posted your thought or your comment but instead its context.

How will it evolve? Probably microformats will play an important role, as they allow you to refer to disperse pieces of information and define the context of the information you’re publishing. Aggregators and search engines will also play an important role, clustering information according to their context.

Sending errors to your ticketing system

I’ve been thinking about developing a PHP Logger that will talk to your favorite ticketing system. The idea is to capture code and application generated errors and create tickets accordingly.

At a first glance, I could use the Reflection API to get meta-information about the code. Tickets could then be created and assigned to the appropriate person based on the @author doc tag.

The ticket Logger could talk to existing Web project management and ticketing application like Basecamp, Goplan and Hiveminder (from the creators of RT).

Do you think you would benefit from this type of Logger? What ticketing system are you using on your own projects?

Collaborative filtering

Are you tired of your feed reader? Do you wish you could find more interesting posts, or perhaps new blogs related to your current tastes and preferences?

Apparently Dave Winer feels the same way:

I want rating services to provide clues about what I should be subscribing to. I want them to find not what’s popular with the masses but what will be valuable to me.

He then touches the sweet spot:

It’s a simple matter to apply collaborative filtering to this problem, we’ve even done it in SYO. These ideas need revisiting now that everyone else seems to have caught on that this is a problem worth solving.

Paolo Avesani, who’s already been studying this subject for some time, understands that tags alone are not enough to propose recommendations. Quoting the paper “An Analysis of the Use of Tags in a Blog Recommender System” [Hayes et al., 2007] (PDF):

In the blog domain, however, we find that tags are rather poor at partitioning blog data. Using content-based clustering, we observe that a small proportion of users in every cluster have independently used the same tag tokens to describe his/her posts.

We definitely need something new. What about using collaborative filtering algorithms to gain knowledge about the users’ tastes and eventually recommend them interesting content? The Pearson correlation algorithm is probably a good candidate.

Pearson's correlation formula

I suggest watching Tayfun Şen’s excellent presentation about collaborative filtering.

Adegga and AVIN

Do you enjoy wine? You’re gonna love Adegga, the place where you can discover new wines by looking at other people’s tastings and findings.

Adegga logo

Adegga just came out of a closed beta and right now anyone can start using it as long as they get invited by another user. That’s right, during an initial phase, registration is only available to invited users. This is a way of controlling the growth and making any necessary adjustments as the user base expands.

Adegga features

So, what can you do at Adegga? First, you can use it as your cellar organizer, keeping a history of all the wines you buy and taste. You can also attach notes to any wine, making it easier to contextualize it afterwards.

The breakthrough for me, was the “watchlist” concept. You can actually follow your friends’ tastings as they update them. You can even access the list through RSS and you favorite feed reader. Here’s some of my tasted wines:

bpedro’s tasted wines at Adegga

Every wine in Adegga is identified by a code crafted with attention to details like the country of origin, the region, the wine type, etc. This code is called AVIN and pretends to be the ISBN of the wine world.

What a great concept. Imagine being able to correctly identify a wine by reading its AVIN. Imagine the possibilities when wine producers begin labeling the AVIN code on their bottles.