[rdfweb-dev] The Scutter Volume problem

Morten Frederiksen mof-rdf at mfd-consult.dk
Tue Mar 16 23:00:38 UTC 2004


On Tuesday 16 March 2004 21:35, Julian Bond wrote:
> There's a class of applications that need to aggregate large quantities
> of RDF and particularly FOAF. PLINK is a good example of this. With the
> entry of Livejournal, I think we just jumped the shark.
Agreed, it's not a trivial problem by any means.

> This raises a series of problems. As far as I can see, even if you start
> by only storing subsets of the data you immediately hit the problem of
> what happens when the source data changes. This means that on the second
> scutter run you have to delete old data from a particular feed and
> refresh it with the newly found data. This in turn means that you just
> about have to store triples with a timestamp and source URL against each
> one.
I use the Redland [1] library for storing triples, and Redland supports a 
concept called contexts, that you can "tag" each triple with. This "tag" node 
can be a URI, and as such can be a part of the graph stored. This way it is 
possible to not only store timestamps but also other scutter-related 
information. See a work-in-progress vocabulary on the wiki [2].

> Now let's say there are 100K feeds with 100 triples each. That's 10M
> triples. A typical triple might be 500 bytes, so our store is going to
> be 5 Gb or so. Now a 100K people in FOAF is nothing. It could be more
> like 2-3M now and 10M in 6 months. I could be wrong but is there any
> RDQL store that can handle that?
How do you define an "RDQL Store"?
I don't suppose you're confusing the query language with the storage 
mechanism?
Redland has a triple querying API, backed by several different storage type 
such as BDB, plain files and MySQL (written by yours truly), and an 
RDQL/Squish-like interface is in the works as we speak. As part of that, RDQL 
queries will be rewritten to SQL queries.

On your size estimates, I think you're right about the number of files, but I 
think the triplecount is higher, especially since LJ are putting out 2k+ 
triples per user.

In any case, I imported Jim's (very) partial scutter dump from a few weeks 
ago, about 6.7M triples, and the raw table files take up only about 0.5G, 
with indices using a little more for a total of 1.2G. Each triple actually 
only takes up 41 bytes, with the nodes of the graph (resources, bnodes and 
literals) stored in separate tables.

A side note: Because of this normalized database structure, it would help 
enormously if LJ put URIs on the user's interests. Why? Because as it is, all 
interests are separate, and they each have a title. If the users referred to 
a common interest instead, it would only be necessary to store one title for 
each unique interest, and even only one URI. That's almost a reduction of the 
total triplecount by 2/3 right there...

> So now we're into SQL queries. Doing
> any sort of real world work with this in any sort of real time is tough
> as the SQL queries get complex quickly with lots of self joins.
Indeed it will.

> So we'll probably have to smush the data into a relational model as
> well. And we'll have to recreate this frequently because we're now 2
> stages removed from the original. Say the Smushed version is 1/10 the
> size of the triple store. That's still a lot of data.
Yes, having a (separate) query-friendly database structure is one solution, 
but one that will only work for a specific limited set of information, e.g. 
FOAF data. One advantage of RDF model is the ability to handle arbitrary 
information. However, there's also work in progress on combining the two, 
rewriting queries at run-time, depending on the location of the requested 
information.

> Am I being unnecessarily alarmist?
No, I don't think so. A simple brain-dead scutter is now a thing of the past. 
Scutters have to be more sophisticated, possibly by limiting the stored part 
of the retrieved data as you suggest, or by limiting the web-space to be 
scuttered, see the wiki for an outline [3].

> Are there any shortcuts here?
I don't see any, but please let us know if you find any!

> Does it mean that FOAF is only really useful as a data transfer protocol
> where the data is discarded immediately after being used?
Partially yes, nobody is interested in having stale and unused data lying 
around. But this is not much different from the old web, your browser may 
have a cache, but in general all you save are the links.


 [1] http://www.redland.opensource.ac.uk/notes/contexts.html
[2] http://rdfweb.org/topic/ScutterVocab
[3] http://rdfweb.org/topic/ScutterStrategies

Please feel free to add to the above wiki pages...


Regards,
Morten



More information about the foaf-dev mailing list