[rdfweb-dev] fwd: new wordnet utilities

Dan Brickley danbri at w3.org
Fri Aug 1 02:20:12 UTC 2003


More wordnet fun...

----- Forwarded message from ted pedersen <tpederse at D.UMN.EDU> -----

From: ted pedersen <tpederse at D.UMN.EDU>
Date: Thu, 31 Jul 2003 19:04:35 -0500
To: WN-USERS at Princeton.EDU
Subject: new wordnet utilities
Message-ID: <200308010004.h7104ZN21885 at csdev01.d.umn.edu>
Reply-To: ted pedersen <tpederse at D.UMN.EDU>

We have a few new WordNet utility programs available at:

http://www.d.umn.edu/~tpederse/wordnet.html

compounds.pl : a Perl program that lists all of the compounds known to
WordNet. There are about 60,000 of those in WordNet 1.7.1.

compoundSval2.pl: a Perl program that takes a list of compounds as input
(could be the one created above, or anything else) and finds all the
compounds in a Senseval-2 formatted text.

glossExract.pl : a Perl program that lists all of the glosses of WordNet
and into a single plain text file.

----------------

Now for some fun. :)

It turns out that glossExtract.pl creates a 1.3 million token corpus with
WordNet 1.7.1. I refer to this (informally at least) as the WordNet Gloss
Corpus.

So, I decided to do a few experiments on the Gloss Corpus using the Ngram
Statistics Package. (http://www.d.umn.edu/~tpederse/nsp.html)

I ran NSP on the Wordnet Gloss Corpus, and I found the top 50 most
significantly associated word bigrams, at least according to the
log-likelihood ratio. They are listed below in the usual NSP format...

w1<>w2<>rank score freq(w1,w2) freq(w1) freq(w2)

United<>States<>1 22689.2655 2414 2544 2420
type<>genus<>2 5656.5382 596 615 1247
North<>America<>3 5317.7833 694 1353 1073
Old<>World<>4 3551.0287 332 443 461
trade<>name<>5 3097.9315 269 344 350
North<>American<>6 2991.3092 455 1353 938
language<>spoken<>7 2841.5560 250 323 349
white<>flowers<>8 2053.0638 363 842 1707
computer<>science<>9 2045.7477 170 301 189
basic<>unit<>10 2020.0912 172 202 305
New<>Zealand<>11 1403.2408 124 496 124
yellow<>flowers<>12 1401.4121 252 590 1707
New<>York<>13 1391.6266 123 496 123
classification<>systems<>14 1375.3708 90 97 98
flower<>heads<>15 1287.8173 103 162 142
tropical<>American<>16 1267.1410 215 741 938
eastern<>North<>17 1227.0490 171 520 591
West<>Indies<>18 1224.4626 97 186 115
World<>War<>19 1216.8101 134 394 272
South<>America<>20 1202.5175 202 595 1073
Roman<>Catholic<>21 1156.1573 90 248 90
drug<>trade<>22 1048.7548 97 160 235
southwestern<>United<>23 1004.8809 136 283 803
western<>North<>24 954.0079 141 497 591
Civil<>War<>25 929.3860 77 82 272
Old<>Testament<>26 904.2413 93 443 134
Great<>Britain<>27 892.2307 63 104 71
eastern<>United<>28 886.3835 146 520 803
perennial<>herbs<>29 881.7449 114 378 401
nervous<>system<>30 878.1139 83 103 382
people<>living<>31 871.7389 82 202 159
flowers<>followed<>32 871.5022 90 605 109
northern<>hemisphere<>33 853.5015 96 436 184
temperate<>regions<>34 836.0081 83 158 259
southeastern<>United<>35 826.9463 123 329 803
feet<>high<>36 825.5220 65 99 116
spinal<>cord<>37 788.0139 55 81 69
human<>beings<>38 772.1660 63 228 69
Catholic<>Church<>39 752.3068 57 86 95
south<>central<>40 738.4469 67 101 226
blood<>vessels<>41 731.3608 65 347 72
counting<>order<>42 727.2843 55 55 195
purple<>flowers<>43 724.0645 123 247 1707
pink<>flowers<>44 718.8534 104 148 1707
reddish<>brown<>45 714.8954 72 132 282
low<>growing<>46 706.0062 75 287 164
ordinal<>number<>47 700.9192 55 55 240
national<>park<>48 699.9874 47 74 53
surgical<>removal<>49 699.0123 49 100 53
States<>writer<>50 688.6361 106 1174 225

I have not mulled over this data long enough to draw any conclusions, but
it's interesting I think.

Please let us know if you have any questions or comments about any of
these programs!

Enjoy!
Ted

--
# Ted Pedersen                              http://www.umn.edu/~tpederse #
# Department of Computer Science                        tpederse at umn.edu #
# University of Minnesota, Duluth                                        #
# Duluth, MN 55812                                        (218) 726-8770 #

----- End forwarded message -----



More information about the foaf-dev mailing list