[rdfweb-dev] The Scutter Volume problem
julian_bond at voidstar.com
Tue Mar 16 20:35:29 UTC 2004
There's a class of applications that need to aggregate large quantities
of RDF and particularly FOAF. PLINK is a good example of this. With the
entry of Livejournal, I think we just jumped the shark.
This raises a series of problems. As far as I can see, even if you start
by only storing subsets of the data you immediately hit the problem of
what happens when the source data changes. This means that on the second
scutter run you have to delete old data from a particular feed and
refresh it with the newly found data. This in turn means that you just
about have to store triples with a timestamp and source URL against each
Now let's say there are 100K feeds with 100 triples each. That's 10M
triples. A typical triple might be 500 bytes, so our store is going to
be 5 Gb or so. Now a 100K people in FOAF is nothing. It could be more
like 2-3M now and 10M in 6 months. I could be wrong but is there any
RDQL store that can handle that? So now we're into SQL queries. Doing
any sort of real world work with this in any sort of real time is tough
as the SQL queries get complex quickly with lots of self joins.
So we'll probably have to smush the data into a relational model as
well. And we'll have to recreate this frequently because we're now 2
stages removed from the original. Say the Smushed version is 1/10 the
size of the triple store. That's still a lot of data.
This looks like a Technorati size problem now and a Google sized problem
within 2 years. Not the sort of thing you can run on a laptop!
Am I being unnecessarily alarmist?
Are there any shortcuts here?
Does it mean that FOAF is only really useful as a data transfer protocol
where the data is discarded immediately after being used?
Julian Bond Email&MSM: julian.bond at voidstar.com
Personal WebLog: http://www.voidstar.com/
M: +44 (0)77 5907 2173 T: +44 (0)192 0412 433
More information about the foaf-dev