Building a Weighted Keyword List

Published on December 28, 2004 at 2:27 PM EST
In the Tutorials category.

Similar to the weighted category list, the weighted keyword list builds a list of keywords used on your site and uses varying font sizes to show how popular they are. This list might be more representative of what your site is about than the weighted category list is. Here’s the list for this site:

adobe backup book camera camporee car cars cat category cd christmas collection collector color com computer cow danandsherree detail digital dpi dvd elsewhere file files film fire florida image life light links monitor movable movie news nikon page paper park photo photos photoshop php pinball print profile radio raw river scan scanner scout scouts shot show shows site steering tamiya trip troop truck type upload vacation web west work year years

I used PHP to put this list together. The code is below; put it into an Index Template to use:


// Build the list of words and convert everything to lowercase.
$string = strtolower('<MTEntries lastn="10000"><MTEntryCategory remove_html="1" encode_php="q"> <MTEntryTitle remove_html="1" encode_php="q"> <MTEntryBody remove_html="1" encode_php="q"> <MTEntryMore remove_html="1" encode_php="q"> </MTEntries><MTOtherBlog blog_id="7"><MTEntries lastn="10000"><MTEntryCategory remove_html="1" encode_php="q"> <MTEntryTitle remove_html="1" encode_php="q"> <MTEntryBody remove_html="1" encode_php="q"> </MTEntries></MTOtherBlog>');

// Remove punctuation.
$wordlist = preg_split('/\s*[\s+\.|\?|,|(|)|\-+|\'|\"|=|;|&#0215;|\$|\/|:|{|}]\s*/i', $string);

// Build an array of the unique words and number of times they occur.
$a = array_count_values( $wordlist );

//Remove words that don't matter--"stop words."
$overusedwords = array( '', 'a', 'an', 'the', 'and', 'of', 'i', 'to', 'is', 'in', 'with', 'for', 'as', 'that', 'on', 'at', 'this', 'my', 'was', 'our', 'it', 'you', 'we', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '10', 'about', 'after', 'all', 'almost', 'along', 'also', 'amp', 'another', 'any', 'are', 'area', 'around', 'available', 'back', 'be', 'because', 'been', 'being', 'best', 'better', 'big', 'bit', 'both', 'but', 'by', 'c', 'came', 'can', 'capable', 'control', 'could', 'course', 'd', 'dan', 'day', 'decided', 'did', 'didn', 'different', 'div', 'do', 'doesn', 'don', 'down', 'drive', 'e', 'each', 'easily', 'easy', 'edition', 'end', 'enough', 'even', 'every', 'example', 'few', 'find', 'first', 'found', 'from', 'get', 'go', 'going', 'good', 'got', 'gt', 'had', 'hard', 'has', 'have', 'he', 'her', 'here', 'how', 'if', 'into', 'isn', 'just', 'know', 'last', 'left', 'li', 'like', 'little', 'll', 'long', 'look', 'lot', 'lt', 'm', 'made', 'make', 'many', 'mb', 'me', 'menu', 'might', 'mm', 'more', 'most', 'much', 'name', 'nbsp', 'need', 'new', 'no', 'not', 'now', 'number', 'off', 'old', 'one', 'only', 'or', 'original', 'other', 'out', 'over', 'part', 'place', 'point', 'pretty', 'probably', 'problem', 'put', 'quite', 'quot', 'r', 're', 'really', 'results', 'right', 's', 'same', 'saw', 'see', 'set', 'several', 'she', 'sherree', 'should', 'since', 'size', 'small', 'so', 'some', 'something', 'special', 'still', 'stuff', 'such', 'sure', 'system', 't', 'take', 'than', 'their', 'them', 'then', 'there', 'these', 'they', 'thing', 'things', 'think', 'those', 'though', 'through', 'time', 'today', 'together', 'too', 'took', 'two', 'up', 'us', 'use', 'used', 'using', 've', 'very', 'want', 'way', 'well', 'went', 'were', 'what', 'when', 'where', 'which', 'while', 'white', 'who', 'will', 'would', 'your');

// Remove the stop words from the list.
foreach ($overusedwords as $word) {
 unset( $a[$word] ); }

// Sort the keys alphabetically.
ksort( $a );

// Print the data.
echo '<p class="noindent">';

// Assign a font-size to the word based on frequency of use.
foreach ($a as $word => $count) {
 if ($count <= 35) { $size = 75;
 } elseif ($count <= 50) { $size = 100;
 } elseif ($count <= 65) { $size = 125;
 } elseif ($count <= 80) { $size = 150;
 } elseif ($count <= 95) { $size = 175;
 } elseif ($count <= 110) { $size = 200;
 } elseif ($count <= 125) { $size = 225;
 } elseif ($count <= 140) { $size = 250;
 } elseif ($count <= 155) { $size = 275;
 } elseif ($count <= 170) { $size = 300;
 } elseif ($count <= 200) { $size = 340; }

// The keyword needs to be referenced 30 or more times to register.
 if ($count >= 30) {
   echo ' <span style="font-size: ' . $size . '%;"><acronym title="This keyword occurs ' . $count . ' times.">' . $word . '</acronym></span> '; }

echo '</p>';


You might want to change the captured data in the $string variable—perhaps to include MTEntryAuthor, MTEntryKeywords, or MTEntryDate.

The “stop list” is probably best edited after you have a list of generated keywords, to determine exactly what you want to exclude. I more-or-less aimed to generate a list of nouns. The list above is probably a good starting point for you, though.

A weighted list of keywords that have 30 or more occurences get listed. For me, that made quite a few results. If it doesn’t create enough (or too many) for you, edit that number as well as the “frequency of use” numbers. It’s probably a good starting point, though.