[rdfweb-dev] The Scutter Volume problem

Julian Bond julian_bond at voidstar.com
Tue Mar 16 20:35:29 UTC 2004


There's a class of applications that need to aggregate large quantities 
of RDF and particularly FOAF. PLINK is a good example of this. With the 
entry of Livejournal, I think we just jumped the shark.

This raises a series of problems. As far as I can see, even if you start 
by only storing subsets of the data you immediately hit the problem of 
what happens when the source data changes. This means that on the second 
scutter run you have to delete old data from a particular feed and 
refresh it with the newly found data. This in turn means that you just 
about have to store triples with a timestamp and source URL against each 
one.

Now let's say there are 100K feeds with 100 triples each. That's 10M 
triples. A typical triple might be 500 bytes, so our store is going to 
be 5 Gb or so. Now a 100K people in FOAF is nothing. It could be more 
like 2-3M now and 10M in 6 months. I could be wrong but is there any 
RDQL store that can handle that? So now we're into SQL queries. Doing 
any sort of real world work with this in any sort of real time is tough 
as the SQL queries get complex quickly with lots of self joins.

So we'll probably have to smush the data into a relational model as 
well. And we'll have to recreate this frequently because we're now 2 
stages removed from the original. Say the Smushed version is 1/10 the 
size of the triple store. That's still a lot of data.

This looks like a Technorati size problem now and a Google sized problem 
within 2 years. Not the sort of thing you can run on a laptop!

Am I being unnecessarily alarmist?
Are there any shortcuts here?
Does it mean that FOAF is only really useful as a data transfer protocol 
where the data is discarded immediately after being used?

-- 
Julian Bond Email&MSM: julian.bond at voidstar.com
Webmaster:                 http://www.ecademy.com/
Personal WebLog:          http://www.voidstar.com/
M: +44 (0)77 5907 2173      T: +44 (0)192 0412 433



More information about the foaf-dev mailing list