[foaf-protocols] P2P FOAF search
Dan Brickley
danbri at danbri.org
Fri Jun 5 10:45:49 CEST 2009
On 5/6/09 10:08, Luke Maurits wrote:
>> So I encourage you to explore this but try to have a design which from
>> day 1 doesn't exclude these inter-related types of data. For example
>> searching in this way for information about opensource software could
>> use DOAP, an RDF vocab which links FOAF to the world of opensource
>> collaboration. Or SIOC, which covers in more detail the description of
>> content and discussion in online fora.
>
> A very valid point. It makes me wonder if it wouldn't actually make more sense to build a P2P network for generic RDF searching and just develop (in addition to general clients, obviously), FOAF-specific clients that hide unecessary generality from users for the sake of friendliness. The downside to this is that such a network would probably get a lot more traffic on it.
It's a delicate balancing act! People in the RDF/SW community can
occasionally tend towards generalisation at the expense of completeness
of a concrete app. I'd rather see a useful P2P people-search app than an
"interesting but not very useful" completely general app. On the other
hand, some of the most useful and interesting descriptions of people are
around the things they make and do, as well as topics, interests etc.,
so the general purpose aspect can't be kept out completely.
Re "downside ... more traffic", I assume each node would be able to
control the kinds of queries it serviced vs ignored? One thing I noticed
when running Jabber/XMPP RDF queries between two laptops, was that if I
sent a big SPARQL query across, there could be a quite noticable impact
on the speed of the receiving machine as it dealt with the query (it was
a kind of SELECT * FROM ... a media collection example). So restricting
queries to the kind that can be handled quickly and efficiently is
needed. I'd guess that means full-text based queries rather than those
with lots of unconstrained variable bindings, and then the SPARQL stuff
can be used for poking around to find properties of some specific object
once it has been identified, perhaps.
>> Did you take a look at SPARQL yet?
>
> I have but I get the distinct impression that a few more pennies need to drop. My current gloss of SPARQL is basically
> "SQL for RDF", i.e. a query language defined over triples rather than tables.
A fine summary. One of the precursor languages was called SquishQL, ie.
"SQL-ish...". The original proposal in that tradition of RDF query is
here: http://www.w3.org/TandS/QL/QL98/pp/enabling.html
> There's obviously a lot more to it than that, though, since people are throwing around terms like "SPARQL endpoint" with more gravity than would seem appropriate given my current level of understanding. Admittedly I need to do a lot more homework in this area.
OK, a quick gloss. The SPARQL (W3C Data Access WG) standardised several
things. SPARQL the query language. A SPARQL protocol, defined in the
abstract and with reference to a specific HTTP "binding". And a SPARQL
tabular results set format in XML. There is also a JSON version of that
same spec. When we talk about a SPARQL endpoint, we simply mean some
database that speaks SPARQL and is visible through the Web from some
base URI, eg. http://danbri.example.com/myphotos or
http://danbri.example.com/mycontacts. The HTTP-binding of the SPARQL
protocol defines how you can compose SPARQL queries into long HTTP URIs
that, when de-referenced, give you a response from the server, typically
in the SPARQL XML or JSON result formats (or an error code).
So a SPARQL endpoint fills a role similar to JDBC/ODBC, except the
standard protocol and format mean the client libraries can be pretty
thin. (It's no coincidence that a company like OpenLink (who were big
into ODBC) show up and become very active in the SPARQL scene...).
My jqbus experiment explored the idea of attaching these same data
access methods to an XMPP/Jabber account, since then the queries can
pass into more personal data zones, like laptops, home media servers
etc., without needing publically accessible web-servers. Jabber/XMPP
deals with getting through firewalls/NAT, and then the fact that queries
can be associated with human and organizational XMPP account holders
means that we also have a nice hook for doing access control.
A couple more points.
1. - SPARQL queries can also return RDF triples, rather than tables of
bindings. The "CONSTRUCT" mechanism in the query language allows you to
specify a simple template into which each row of triples is re-bound.
While a bit crude (no conditionals etc) this is enough for some simple
data transformation tricks. The other SPARQL query form is to ask
whether some pattern matches in the data or not, for which you get a
nice concise boolean response.
2. SPARQL databases / endpoints aren't simple flat triple stores, there
is a notion of data grouping and layering associated with the "GRAPH"
keyword in the query language. Each triple in the store can be
associated with a URI, which effectively groups it into a sub-set of
data alongside other triples in the same graph. SPARQL queries can
include constraints that talk about these "GRAPH"s too, providing a hook
for applications smart enough to care not only about data merging, but
data provenance/sourcing too.
See http://svn.foaf-project.org/foaftown/2009/layers/ for a picture of
this, and some sample RDF and queries. I've been meaning to blog this
since scribbling it on a plane ride, so here goes with a first-cut
explanation:
Picture of layers:
http://svn.foaf-project.org/foaftown/2009/layers/layercake2.jpg
Layer 1:
http://svn.foaf-project.org/foaftown/2009/layers/layer1.rdf
This is the top layer in the picture. It says that Alice knows Bob, and
that her school-homepage is http://lookingglass.example.org and that the
organization with that schoolhomepage has the name "LookingGlass
School". In this story, Alice is the author of the layer1 document,
although the document doesn't claim that explicitly.
Layer 2:
This is the middle layer in the picture. It says that there is a person
called Bob, and that Bob knows this person called Alice. It claims the
same school-homepage for Bob, and says that the organization with that
schoolhomepage has the name "The LookingGlass School". In this story,
Bob is the author of the layer2 document, although the document doesn't
claim that explicitly.
Layer 3:
This is the lower layer in the picture. It says that there is a person
called Alice whose school-homepage is http://lookingglass.example.org.
It gives an identifier (short form 'lcc' but this would expand to a full
URI) for the school which has that as its homepage, and says that the
name for this organization is "LookingGlass Community College (formerly
LookingGlass School)". It also mentions two other documents that may
contain related information. In this story, the School is the
author/publisher of this layer3 document, although that is also not
stated explicitly in the data.
Layer 4: (not shown in the picture, except as the attribution images
along the right hand column)
layerlist.rdf states that Alice made layer1.rdf, Bob made layer2.rdf and
the lcc school made layer3.rdf. In this story, layer 4 is effectively
our background knowledge, the administrative or "table of contents"
layer which we believe naively or tentatively so as to have some basis
for asking questions of the rest of the data.
This simple-minded picture has some subtlety. It shows that RDF can
happily represent partial information, as well as its overlaps and
sources. That things are easier when each source uses the same URIs for
the same entities, but that we can even deal with the lack of direct
identifiers, through reasoning about things in terms of identifying
properties such as "homepage". It also shows some of the challenges this
kind of real-world complexity can pose for implementations:
Q: Does the school confirm Alice's claim to have attended? A: Yes
Q: Does the school confirm Bob's claim to have attended? A: No
Q: Did Alice attend the school? A: We don't know.
Q: Did Bob attend the school? A: We don't know.
Q: Which claims of school attendance are made both by the student and
the school? A. Alice's but not Bob's.
Ok, maybe this is more than enough to be going on with. If this
explanation makes any sense to someone I'll blog it! See the file
http://svn.foaf-project.org/foaftown/2009/layers/notes.txt for some
example SPARQL queries, but note that you'll probably want to tweak the
files somewhat to use different URIs if running this on your own SPARQL
installation.
> Nevertheless, it still seems like a no-brainer even to me that SPARQL could/should be used as part of the hypothetical P2P network under discussion, from the points of view of (i) not re-inventing wheels that work well and (ii) enabling usage of non-FOAF vocabularies in search, as mentioned earlier. This makes your jqbus project all the more relevant since the core of the P2P system would really just be passing SPARQL query strings around via XMPP. The extra work would really just be having each node re-distribute queries it couldn't answer to other nodes on its XMPP roster, and adding a time-to-live counter (to prevent endless propagation of stale queries). I'd need to refresh my understanding of XMPP to be sure there's not more required than this, though. Anyway, if it does turn out to be such a relatively simple extension it feels like it would make more sense to add to your codebase than reimplement from scratch, at least for testing the idea out.
I'm not sure how healthy the codebase is, but it should give something
that can be played with fairly easily, eg. as a 2nd implimentation of
some proposed protocol. I updated it a year or so back to use more
recent versions of the XMPP and RDF libraries, but as I mentioned (i)
the specific binding of SPARQL to XMPP needs some more thought, and (ii)
passing around full SPARQL queries might be a bit much, perhaps a
simpler profile/subset is needed, or a full-text oriented alternative.
Perhaps allow direct friends to run more expensive SPARQL queries, but
strangers only get lighter/cheaper options...?
cheers,
Dan
More information about the foaf-protocols
mailing list