Building a Weighted Keyword List


Published on December 28, 2004 at 2:27 PM EST
Last updated on October 9, 2006 at 12:02 AM EST
In the Tutorials category.

Similar to the weighted category list, the weighted keyword list builds a list of keywords used on your site and uses varying font sizes to show how popular they are. This list might be more representative of what your site…

Similar to the weighted category list, the weighted keyword list builds a list of keywords used on your site and uses varying font sizes to show how popular they are. This list might be more representative of what your site is about than the weighted category list is. Here’s the list for this site:

adobe backup book camera camporee car cars cat category cd christmas collection collector color com computer cow danandsherree detail digital dpi dvd elsewhere file files film fire florida image life light links monitor movable movie news nikon page paper park photo photos photoshop php pinball print profile radio raw river scan scanner scout scouts shot show shows site steering tamiya trip troop truck type upload vacation web west work year years

I used PHP to put this list together. The code is below; put it into an Index Template to use:

<?php

// Build the list of words and convert everything to lowercase.
$string = strtolower('<MTEntries lastn="10000"><MTEntryCategory remove_html="1" encode_php="q"> <MTEntryTitle remove_html="1" encode_php="q"> <MTEntryBody remove_html="1" encode_php="q"> <MTEntryMore remove_html="1" encode_php="q"> </MTEntries><MTOtherBlog blog_id="7"><MTEntries lastn="10000"><MTEntryCategory remove_html="1" encode_php="q"> <MTEntryTitle remove_html="1" encode_php="q"> <MTEntryBody remove_html="1" encode_php="q"> </MTEntries></MTOtherBlog>');

// Remove punctuation.
$wordlist = preg_split('/\s*[\s+\.|\?|,|(|)|\-+|\'|\"|=|;|&#0215;|\$|\/|:|{|}]\s*/i', $string);

// Build an array of the unique words and number of times they occur.
$a = array_count_values( $wordlist );

//Remove words that don't matter--"stop words."
$overusedwords = array( '', 'a', 'an', 'the', 'and', 'of', 'i', 'to', 'is', 'in', 'with', 'for', 'as', 'that', 'on', 'at', 'this', 'my', 'was', 'our', 'it', 'you', 'we', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '10', 'about', 'after', 'all', 'almost', 'along', 'also', 'amp', 'another', 'any', 'are', 'area', 'around', 'available', 'back', 'be', 'because', 'been', 'being', 'best', 'better', 'big', 'bit', 'both', 'but', 'by', 'c', 'came', 'can', 'capable', 'control', 'could', 'course', 'd', 'dan', 'day', 'decided', 'did', 'didn', 'different', 'div', 'do', 'doesn', 'don', 'down', 'drive', 'e', 'each', 'easily', 'easy', 'edition', 'end', 'enough', 'even', 'every', 'example', 'few', 'find', 'first', 'found', 'from', 'get', 'go', 'going', 'good', 'got', 'gt', 'had', 'hard', 'has', 'have', 'he', 'her', 'here', 'how', 'if', 'into', 'isn', 'just', 'know', 'last', 'left', 'li', 'like', 'little', 'll', 'long', 'look', 'lot', 'lt', 'm', 'made', 'make', 'many', 'mb', 'me', 'menu', 'might', 'mm', 'more', 'most', 'much', 'name', 'nbsp', 'need', 'new', 'no', 'not', 'now', 'number', 'off', 'old', 'one', 'only', 'or', 'original', 'other', 'out', 'over', 'part', 'place', 'point', 'pretty', 'probably', 'problem', 'put', 'quite', 'quot', 'r', 're', 'really', 'results', 'right', 's', 'same', 'saw', 'see', 'set', 'several', 'she', 'sherree', 'should', 'since', 'size', 'small', 'so', 'some', 'something', 'special', 'still', 'stuff', 'such', 'sure', 'system', 't', 'take', 'than', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'those', 'though', 'through', 'time', 'today', 'together', 'too', 'took', 'two', 'up', 'us', 'use', 'used', 'using', 've', 'very', 'want', 'way', 'well', 'went', 'were', 'what', 'when', 'where', 'which', 'while', 'white', 'who', 'will', 'would', 'your');

// Remove the stop words from the list.
foreach ($overusedwords as $word) {
 unset( $a[$word] ); }

// Sort the keys alphabetically.
ksort( $a );

// Print the data.
echo '<p class="noindent">';

// Assign a font-size to the word based on frequency of use.
foreach ($a as $word => $count) {
 if ($count <= 35) { $size = 75;
 } elseif ($count <= 50) { $size = 100;
 } elseif ($count <= 65) { $size = 125;
 } elseif ($count <= 80) { $size = 150;
 } elseif ($count <= 95) { $size = 175;
 } elseif ($count <= 110) { $size = 200;
 } elseif ($count <= 125) { $size = 225;
 } elseif ($count <= 140) { $size = 250;
 } elseif ($count <= 155) { $size = 275;
 } elseif ($count <= 170) { $size = 300;
 } elseif ($count <= 200) { $size = 340; }

// The keyword needs to be referenced 30 or more times to register.
 if ($count >= 30) {
   echo ' <span style="font-size: ' . $size . '%;"><acronym title="This keyword occurs ' . $count . ' times.">' . $word . '</acronym></span> '; }
}

echo '</p>';

?>

You might want to change the captured data in the $string variable—perhaps to include MTEntryAuthor, MTEntryKeywords, or MTEntryDate.

The “stop list” is probably best edited after you have a list of generated keywords, to determine exactly what you want to exclude. I more-or-less aimed to generate a list of nouns. The list above is probably a good starting point for you, though.

A weighted list of keywords that have 30 or more occurences get listed. For me, that made quite a few results. If it doesn’t create enough (or too many) for you, edit that number as well as the “frequency of use” numbers. It’s probably a good starting point, though.

This article is tagged as: Keywords, Tutorials

If you found this article useful, please consider supporting this site through a donation.

Comments

So far, there are 17 comments and Trackbacks on this entry. Add yours!

1

Hummm… yet another reason I should learn php since this looks very cool

2

Would there be a way to limit the listing to the top X tags, rather than any tags that occur X times? I’m thinking of seeing if I can use this in addition to the keyword index we’re discussing over on my site, and it seems that it might be a little easier just to limit the weighted list to the top 50 or 100 keywords.

Just brainstorming…:)

3

Hmmm. Another thing: looking over the code above, it looks like rather than pulling from the keywords field, you’re pulling everything from the entry (Title, Entry, and Extended Entry) and processing that text. Seems like what I’ve got in mind for my site might be a bit simpler: it’d limit it to munging through the Keywords field, and I wouldn’t have to strip out commonly-used words, since the only words it would see are words I’d want it to be working with. I think I should be able to adapt your code to my purposes, though.

(It occurs to me that I’m basically using your comments to brainstorm my own ideas. Sorry ‘bout that! ;)

4

Gewichtetet Stichwortlisten sind Besuchern von Flickr sicher bekannt. Die 150 beliebtesten Tags der Benutzer sind auf dieser Seite zusammengefasst. In What’s this blog about? zeigt Adam Kalsey, welche Stichwörter in…

5

Blog entries starting in March, 2005 and beyond will now have subject-appropriate keywords attached to them. They’re currently being linked to Technorati’s tags, and may be used in the future to create a weighted keyword taxonomy list for this blog. Wh…

6

Technology helped me to find out what I am really thinking about - or at least what I am writing about on my blog: Thomas Korte’s Mind Map I added my “Mind Map” with the help of a piece of…

7

I set up folksonomy tags for my Movable Type blog with a fully functional weighted keyword list.

8

With a slight modification to the ‘echo’ code, you can also make each keyword searchable by clicking on it. Kinda cool:

// The keyword needs to be referenced 30 or more times to register.
if ($count >= 30) {
echo ’ <span style=”font-size: ’ . $size . ‘%;”><acronym title=”This keyword occurs ’ . $count . ’ times.”><a href=”>$MTCGIPath$<>$MTSearchScript$>?IncludeBlogs=<$MTBlogID$>&search=’ . $word . ‘”>’ . $word . ‘>/a></acronym></span> ‘; }
}

9

Over on The Artful Manager Andrew Taylor is talking about a weighted word list – i.e. a visual representation of word usage. “Weighted word lists are popping up everywhere nowadays — from the new ‘tagging’ function of Technorati, to…

10

If you’re using tags, you may want to display a “tag cloud” that shows all your tags, weighted by frequency so that the more times the tag is used, the larger it appears (for an example, here’s my del.icio.us tag…

11

envol-flamants-roses.jpg

12

The tag cloud, as seen on the archives page, is adapted ever so slightly from the code posted by the good people at DanAndSherree.com….

13

Building a Weighted Keyword List offers a PHP script to build a ‘tagcloud’. One of the comments adds the ability to click on a keyword to get search results….

14

This is awesome - thanks!

15

Well, I must admit, I have been jealous of Fern’s blog for a little while now. I mean, it is pretty, it has all sorts of WordPress goodies on it and dammit, the girl talks about threesomes. Not that I…

16

So I added a new page to the site, it’s a visual representation of what I’ve been blogging about. The larger the word the more often it’s been mentioned. I’m still tweaking it but generally a word has to show up on the site (in the title or the first…

17

lightning.png superb!

Post a comment

 
 
 


TrackBack URL for Building a Weighted Keyword List:
http://www.eatdrinksleepmovabletype.com/cgi-bin/mt/mt-tb.cgi/218