[foaf-protocols] P2P FOAF search

Dan Brickley danbri at danbri.org
Fri Jun 5 10:45:49 CEST 2009


On 5/6/09 10:08, Luke Maurits wrote:

>> So I encourage you to explore this but try to have a design which from
>> day 1 doesn't exclude these inter-related types of data. For example
>> searching in this way for information about opensource software could
>> use DOAP, an RDF vocab which links FOAF to the world of opensource
>> collaboration. Or SIOC, which covers in more detail the description of
>> content and discussion in online fora.
>
> A very valid point.  It makes me wonder if it wouldn't actually make more sense to build a P2P network for generic RDF searching and just develop (in addition to general clients, obviously), FOAF-specific clients that hide unecessary generality from users for the sake of friendliness.  The downside to this is that such a network would probably get a lot more traffic on it.

It's a delicate balancing act! People in the RDF/SW community can 
occasionally tend towards generalisation at the expense of completeness 
of a concrete app. I'd rather see a useful P2P people-search app than an 
"interesting but not very useful" completely general app. On the other 
hand, some of the most useful and interesting descriptions of people are 
around the things they make and do, as well as topics, interests etc., 
so the general purpose aspect can't be kept out completely.

Re "downside ... more traffic", I assume each node would be able to 
control the kinds of queries it serviced vs ignored? One thing I noticed 
when running Jabber/XMPP RDF queries between two laptops, was that if I 
sent a big SPARQL query across, there could be a quite noticable impact 
on the speed of the receiving machine as it dealt with the query (it was 
a kind of SELECT * FROM ... a media collection example). So restricting 
queries to the kind that can be handled quickly and efficiently is 
needed. I'd guess that means full-text based queries rather than those 
with lots of unconstrained variable bindings, and then the SPARQL stuff 
can be used for poking around to find properties of some specific object 
once it has been identified, perhaps.

>> Did you take a look at SPARQL yet?
>
> I have but I get the distinct impression that a few more pennies need to drop.  My current gloss of SPARQL is basically
> "SQL for RDF", i.e. a query language defined over triples rather than tables.

A fine summary. One of the precursor languages was called SquishQL, ie. 
"SQL-ish...". The original proposal in that tradition of RDF query is 
here: http://www.w3.org/TandS/QL/QL98/pp/enabling.html

> There's obviously a lot more to it than that, though, since people are throwing around terms like "SPARQL endpoint" with more gravity than would seem appropriate given my current level of understanding.  Admittedly I need to do a lot more homework in this area.

OK, a quick gloss. The SPARQL (W3C Data Access WG) standardised several 
things. SPARQL the query language. A SPARQL protocol, defined in the 
abstract and with reference to a specific HTTP "binding". And a SPARQL 
tabular results set format in XML. There is also a JSON version of that 
same spec. When we talk about a SPARQL endpoint, we simply mean some 
database that speaks SPARQL and is visible through the Web from some 
base URI, eg. http://danbri.example.com/myphotos or 
http://danbri.example.com/mycontacts. The HTTP-binding of the SPARQL 
protocol defines how you can compose SPARQL queries into long HTTP URIs 
that, when de-referenced, give you a response from the server, typically 
in the SPARQL XML or JSON result formats (or an error code).

So a SPARQL endpoint fills a role similar to JDBC/ODBC, except the 
standard protocol and format mean the client libraries can be pretty 
thin. (It's no coincidence that a company like OpenLink (who were big 
into ODBC) show up and become very active in the SPARQL scene...).

My jqbus experiment explored the idea of attaching these same data 
access methods to an XMPP/Jabber account, since then the queries can 
pass into more personal data zones, like laptops, home media servers 
etc., without needing publically accessible web-servers. Jabber/XMPP 
deals with getting through firewalls/NAT, and then the fact that queries 
can be associated with human and organizational XMPP account holders 
means that we also have a nice hook for doing access control.

A couple more points.

1. - SPARQL queries can also return RDF triples, rather than tables of 
bindings. The "CONSTRUCT" mechanism in the query language allows you to 
specify a simple template into which each row of triples is re-bound. 
While a bit crude (no conditionals etc) this is enough for some simple 
data transformation tricks. The other SPARQL query form is to ask 
whether some pattern matches in the data or not, for which you get a 
nice concise boolean response.

2. SPARQL databases / endpoints aren't simple flat triple stores, there 
is a notion of data grouping and layering associated with the "GRAPH" 
keyword in the query language. Each triple in the store can be 
associated with a URI, which effectively groups it into a sub-set of 
data alongside other triples in the same graph.  SPARQL queries can 
include constraints that talk about these "GRAPH"s too, providing a hook 
for applications smart enough to care not only about data merging, but 
data provenance/sourcing too.

See http://svn.foaf-project.org/foaftown/2009/layers/ for a picture of 
this, and some sample RDF and queries. I've been meaning to blog this 
since scribbling it on a plane ride, so here goes with a first-cut 
explanation:

Picture of layers:
http://svn.foaf-project.org/foaftown/2009/layers/layercake2.jpg

Layer 1:
http://svn.foaf-project.org/foaftown/2009/layers/layer1.rdf
This is the top layer in the picture. It says that Alice knows Bob, and 
that her school-homepage is http://lookingglass.example.org and that the 
organization with that schoolhomepage has the name "LookingGlass 
School".  In this story, Alice is the author of the layer1 document, 
although the document doesn't claim that explicitly.

Layer 2:
This is the middle layer in the picture. It says that there is a person 
called Bob, and that Bob knows this person called Alice. It claims the 
same school-homepage for Bob, and says that the organization with that 
schoolhomepage has the name "The LookingGlass School". In this story, 
Bob is the author of the layer2 document, although the document doesn't 
claim that explicitly.

Layer 3:
This is the lower layer in the picture. It says that there is a person 
called Alice whose school-homepage is http://lookingglass.example.org. 
It gives an identifier (short form 'lcc' but this would expand to a full 
URI) for the school which has that as its homepage, and says that the 
name for this organization is "LookingGlass Community College (formerly 
LookingGlass School)". It also mentions two other documents that may 
contain related information.  In this story, the School is the 
author/publisher of this layer3 document, although that is also not 
stated explicitly in the data.

Layer 4: (not shown in the picture, except as the attribution images 
along the right hand column)
layerlist.rdf states that Alice made layer1.rdf, Bob made layer2.rdf and 
the lcc school made layer3.rdf. In this story, layer 4 is effectively 
our background knowledge, the administrative or "table of contents" 
layer which we believe naively or tentatively so as to have some basis 
for asking questions of the rest of the data.

This simple-minded picture has some subtlety. It shows that RDF can 
happily represent partial information, as well as its overlaps and 
sources. That things are easier when each source uses the same URIs for 
the same entities, but that we can even deal with the lack of direct 
identifiers, through reasoning about things in terms of identifying 
properties such as "homepage". It also shows some of the challenges this 
kind of real-world complexity can pose for implementations:

Q: Does the school confirm Alice's claim to have attended? A: Yes
Q: Does the school confirm Bob's claim to have attended? A: No
Q: Did Alice attend the school? A: We don't know.
Q: Did Bob attend the school? A: We don't know.
Q: Which claims of school attendance are made both by the student and 
the school? A. Alice's but not Bob's.

Ok, maybe this is more than enough to be going on with. If this 
explanation makes any sense to someone I'll blog it! See the file 
http://svn.foaf-project.org/foaftown/2009/layers/notes.txt for some 
example SPARQL queries, but note that you'll probably want to tweak the 
files somewhat to use different URIs if running this on your own SPARQL 
installation.


> Nevertheless, it still seems like a no-brainer even to me that SPARQL could/should be used as part of the hypothetical P2P network under discussion, from the points of view of (i) not re-inventing wheels that work well and (ii) enabling usage of non-FOAF vocabularies in search, as mentioned earlier.  This makes your jqbus project all the more relevant since the core of the P2P system would really just be passing SPARQL query strings around via XMPP.  The extra work would really just be having each node re-distribute queries it couldn't answer to other nodes on its XMPP roster, and adding a time-to-live counter (to prevent endless propagation of stale queries).  I'd need to refresh my understanding of XMPP to be sure there's not more required than this, though.  Anyway, if it does turn out to be such a relatively simple extension it feels like it would make more sense to add to your codebase than reimplement from scratch, at least for testing the idea out.

I'm not sure how healthy the codebase is, but it should give something 
that can be played with fairly easily, eg. as a 2nd implimentation of 
some proposed protocol. I updated it a year or so back to use more 
recent versions of the XMPP and RDF libraries, but as I mentioned (i) 
the specific binding of SPARQL to XMPP needs some more thought, and (ii) 
passing around full SPARQL queries might be a bit much, perhaps a 
simpler profile/subset is needed, or a full-text oriented alternative. 
Perhaps allow direct friends to run more expensive SPARQL queries, but 
strangers only get lighter/cheaper options...?

cheers,

Dan


More information about the foaf-protocols mailing list