[rdfweb-dev] The Scutter Volume problem
Morten Frederiksen
mof-rdf at mfd-consult.dk
Tue Mar 16 23:00:38 UTC 2004
On Tuesday 16 March 2004 21:35, Julian Bond wrote:
> There's a class of applications that need to aggregate large quantities
> of RDF and particularly FOAF. PLINK is a good example of this. With the
> entry of Livejournal, I think we just jumped the shark.
Agreed, it's not a trivial problem by any means.
> This raises a series of problems. As far as I can see, even if you start
> by only storing subsets of the data you immediately hit the problem of
> what happens when the source data changes. This means that on the second
> scutter run you have to delete old data from a particular feed and
> refresh it with the newly found data. This in turn means that you just
> about have to store triples with a timestamp and source URL against each
> one.
I use the Redland [1] library for storing triples, and Redland supports a
concept called contexts, that you can "tag" each triple with. This "tag" node
can be a URI, and as such can be a part of the graph stored. This way it is
possible to not only store timestamps but also other scutter-related
information. See a work-in-progress vocabulary on the wiki [2].
> Now let's say there are 100K feeds with 100 triples each. That's 10M
> triples. A typical triple might be 500 bytes, so our store is going to
> be 5 Gb or so. Now a 100K people in FOAF is nothing. It could be more
> like 2-3M now and 10M in 6 months. I could be wrong but is there any
> RDQL store that can handle that?
How do you define an "RDQL Store"?
I don't suppose you're confusing the query language with the storage
mechanism?
Redland has a triple querying API, backed by several different storage type
such as BDB, plain files and MySQL (written by yours truly), and an
RDQL/Squish-like interface is in the works as we speak. As part of that, RDQL
queries will be rewritten to SQL queries.
On your size estimates, I think you're right about the number of files, but I
think the triplecount is higher, especially since LJ are putting out 2k+
triples per user.
In any case, I imported Jim's (very) partial scutter dump from a few weeks
ago, about 6.7M triples, and the raw table files take up only about 0.5G,
with indices using a little more for a total of 1.2G. Each triple actually
only takes up 41 bytes, with the nodes of the graph (resources, bnodes and
literals) stored in separate tables.
A side note: Because of this normalized database structure, it would help
enormously if LJ put URIs on the user's interests. Why? Because as it is, all
interests are separate, and they each have a title. If the users referred to
a common interest instead, it would only be necessary to store one title for
each unique interest, and even only one URI. That's almost a reduction of the
total triplecount by 2/3 right there...
> So now we're into SQL queries. Doing
> any sort of real world work with this in any sort of real time is tough
> as the SQL queries get complex quickly with lots of self joins.
Indeed it will.
> So we'll probably have to smush the data into a relational model as
> well. And we'll have to recreate this frequently because we're now 2
> stages removed from the original. Say the Smushed version is 1/10 the
> size of the triple store. That's still a lot of data.
Yes, having a (separate) query-friendly database structure is one solution,
but one that will only work for a specific limited set of information, e.g.
FOAF data. One advantage of RDF model is the ability to handle arbitrary
information. However, there's also work in progress on combining the two,
rewriting queries at run-time, depending on the location of the requested
information.
> Am I being unnecessarily alarmist?
No, I don't think so. A simple brain-dead scutter is now a thing of the past.
Scutters have to be more sophisticated, possibly by limiting the stored part
of the retrieved data as you suggest, or by limiting the web-space to be
scuttered, see the wiki for an outline [3].
> Are there any shortcuts here?
I don't see any, but please let us know if you find any!
> Does it mean that FOAF is only really useful as a data transfer protocol
> where the data is discarded immediately after being used?
Partially yes, nobody is interested in having stale and unused data lying
around. But this is not much different from the old web, your browser may
have a cache, but in general all you save are the links.
[1] http://www.redland.opensource.ac.uk/notes/contexts.html
[2] http://rdfweb.org/topic/ScutterVocab
[3] http://rdfweb.org/topic/ScutterStrategies
Please feel free to add to the above wiki pages...
Regards,
Morten
More information about the foaf-dev
mailing list