[rdfweb-dev] Re: Syntactic profiling (FOAF document formats)

Julian Bond julian_bond at voidstar.com
Thu Aug 28 10:47:17 UTC 2003


Edd Dumbill <edd at usefulinc.com> wrote:
>Using an RDF parser to process FOAF is *easier* then regex land.
>Especially if you go further than just using a parser and use a toolkit
>like Redland, Drive or Jena because then your data model is done for you
>as well.

As discussed earlier none of these three are suitable for me. The best I 
can find for my platform is RAP and that's barfing on large files.

>I say let's write some more
>complex FOAF consuming applications and then see what we did or did not
>require of the syntax.

That's what led me to this.

>I'd like to see us take a break from this argument and talk about issues
>related to processing the data once it's parsed.  There are some
>meaningful and deep issues we need to figure out.

Let's talk about those then[1].

A common approach currently used in scutters is to grab all the RDF you 
can and dump it into a triple store. Usually the triple store also has 
some secondary data such as where found and time found. I don't think 
this is scalable unless you have lots of processing power and disk space 
available and in the extreme case might require Google-sized resource. 
Even with the current universe of RDF containing FOAF this is leading to 
triple stores with a million or so triples.

One solution to this is to do more processing at collection time and 
store data-modelled information with links back to the source.

That leads us to the problem of the source of the data and what right 
the author of the data has to make the statements they're making. RDF 
constructs like foaf:maker, foaf:made foaf:Document with added DC all 
require additional indirection. It's only code. But it's code that it's 
not immediately obvious you need to write. It's all too easy to write 
some simpler queries into your triple store that miss these.

I don't know, maybe I'm not explaining myself well enough. *I am not* 
arguing for a non-RDF syntactic profile that is no longer parsable by 
RDF tools. Perhaps all I'm looking for is a best practice document or a 
busy developers guide or something. Some way of saying to new people 
aiming to auto-generate FOAF from code "look at this and do it like 
that". Maybe it's as trivial as saying "copy the output of foaf-a-matic 
while putting in your own data"[2][3].

Returning to processing strategies. Let's say I have a collection of SQL 
tables for Person, mbox, feed with person_mbox, person_feed and 
person_person link tables. I shouldn't have to store very much info 
against Person because I can always go back to the source feed. Except 
that without some additional metadata this approach has numerous 
problems with data ageing, validity of source, and so on.

I find myself going round and round this and bouncing back and forth 
between simply storing every triple I ever find and dealing with the 
processing problems late, or processing early and then having bad data 
that I can't rectify because I've thrown away too much too early. The 
whole syntactic profile thing is an attempt to cut this gordian knot by 
trying to force a bit more structure onto the source data. The push back 
on this is substantial(!) so consensus would appear to be that this is 
impossible/wrong/misguided/displays a woeful misunderstanding.

[1]This is a threaded mailing list. Feel free to start new threads ;-)
[2]Danbri. Your work on the spec is superb. Something that would make it 
better is a few more examples.
[3]If the idea of syntactic profiles takes root, there's a next step 
which is to put into the file a reference as to which profile this 
particular feed is supposed to follow.

-- 
Julian Bond Email&MSM: julian.bond at voidstar.com
Webmaster:              http://www.ecademy.com/
Personal WebLog:       http://www.voidstar.com/
M: +44 (0)77 5907 2173   T: +44 (0)192 0412 433



More information about the foaf-dev mailing list