[rdfweb-dev] Smushing the Semantic Web / Invalid data in IFP's

Morten Frederiksen mof-rdf at mfd-consult.dk
Tue Mar 23 22:55:44 UTC 2004


Hi all,

On Monday 22 March 2004 12:12, Perry Lorier wrote:
> For the last two weeks I've been smushing JibberJims Scutter of FOAF
> data.  Most of this is due to my tinkering as I'm going along, and
> another large chunk was due to treating foaf:mbox_sha1 as case
> insensitive (I suspect a bug with my optimisations).  I believe that if
> I was to start again tomorrow, I'd get it done in about a week.
Hmm, methinks that sounds like a long time...

With a revised schema [1] for Redland's MySQL storage, I managed to smush the 
dataset in just over 16 hours:

rdf-smush [2004-03-22T20:46:56Z]: Model 'jim' contains 6733112 statements.
rdf-smush [2004-03-22T20:46:56Z]: Smushing on 
[http://xmlns.com/foaf/0.1/mbox]...
rdf-smush [2004-03-22T20:47:24Z]: Performing 829 rewrites for 
[http://xmlns.com/foaf/0.1/mbox]...
rdf-smush [2004-03-22T20:48:59Z]: Smushing on 
[http://xmlns.com/foaf/0.1/mbox_sha1sum]...
rdf-smush [2004-03-22T20:56:33Z]: Performing 4417 rewrites for 
[http://xmlns.com/foaf/0.1/mbox_sha1sum]...
rdf-smush [2004-03-22T21:07:43Z]: Smushing on 
[http://xmlns.com/foaf/0.1/jabberID]...
rdf-smush [2004-03-22T21:07:54Z]: Performing 91 rewrites for 
[http://xmlns.com/foaf/0.1/jabberID]...
rdf-smush [2004-03-22T21:08:24Z]: Smushing on 
[http://xmlns.com/foaf/0.1/aimChatID]...
rdf-smush [2004-03-22T21:10:45Z]: Performing 105 rewrites for 
[http://xmlns.com/foaf/0.1/aimChatID]...
rdf-smush [2004-03-22T21:11:23Z]: Smushing on 
[http://xmlns.com/foaf/0.1/icqChatID]...
rdf-smush [2004-03-22T21:11:45Z]: Performing 36 rewrites for 
[http://xmlns.com/foaf/0.1/icqChatID]...
rdf-smush [2004-03-22T21:11:58Z]: Smushing on 
[http://xmlns.com/foaf/0.1/yahooChatID]...
rdf-smush [2004-03-22T21:12:33Z]: Performing 19 rewrites for 
[http://xmlns.com/foaf/0.1/yahooChatID]...
rdf-smush [2004-03-22T21:12:40Z]: Smushing on 
[http://xmlns.com/foaf/0.1/msnChatID]...
rdf-smush [2004-03-22T21:13:01Z]: Performing 18 rewrites for 
[http://xmlns.com/foaf/0.1/msnChatID]...
rdf-smush [2004-03-22T21:14:08Z]: Smushing on 
[http://xmlns.com/foaf/0.1/homepage]...
rdf-smush [2004-03-22T21:15:18Z]: Performing 521 rewrites for 
[http://xmlns.com/foaf/0.1/homepage]...
rdf-smush [2004-03-22T21:18:26Z]: Smushing on 
[http://xmlns.com/foaf/0.1/weblog]...
rdf-smush [2004-03-22T22:11:03Z]: Performing 611188 rewrites for 
[http://xmlns.com/foaf/0.1/weblog]...
rdf-smush [2004-03-23T12:58:55Z]: Done.

The smushing code [2] isn't really geared towards this situation, it's much 
better at somewhat incremental smushing (plus, there was a whole lot of 
swapping going on). I'm trying to figure out how to optimize this smush-a-lot 
situation, probably by creating a rewrite graph, reducing it (some extraneous 
rewrites are performed when smushing on several properties), and then 
applying it. By splitting up the process, it may be possible to have indices 
turned off when doing the actual rewrites - after all, the load times went 
from hours to minutes when rebuilding the indices afterwards instead of 
maintaining them on the run...

> I also noticed, peering into the scroll, that there was a lot of rather
> invalid data in IFP's ...
Yep, that's a problem, not sure how to handle that in an intelligent and 
managable fashion.

However, I disagree with your numbers. The highest count for a value of 
foaf:jabberID is 15 for "wocky", with "Wocky" in second place with 8 
occurances (and danbri tied for third with "wtf?" with 7 occurances!).
I've confirmed these numbers with crep, cut, sort and uniq on the original 
input file. How did you arrive at your multi-hundred numbers?


[1] http://www.wasab.dk/morten/2004/03/redland-mysql-schema.sql
[2] http://www.wasab.dk/morten/2004/03/rdf-smush.c


Regards,
Morten



More information about the foaf-dev mailing list