15 comments on “A word cloud where the x and y axes mean something

    • Probably not too well. I tried it up to a few thousand terms and it looked fine. This is more of a prototype than ready-to-use software that could compete with word-cloud generators. I may eventually experiment with better layout algorithms to remove whitespace. I think that would address your question. Thanks for the feedback.

  1. One could make much better use of the x-axis than sorting alphabetically. Two ideas:
    - clustering based on closeness of the words (closeness in terms of meaning). This requires a database of distances. For satisfaction and related, just sorting by valence (positive or negative) would be great and much easiert to do! One short look at you can tell something about the direction of comments
    - clustering based on the closeness of those words within the text, i.e. how many words in between the words in this cluster.

  2. I think word clouds have potential and you’re definitely moving in the right direction. Joint_Posterior’s ideas would also be excellent to incorporate. As far as color I’d do random unless supplied word lists. that map to a color. I’ve wrapped Ian fellows wordcloud function to do something like this. I’m going to release it in a package this summer. I’m very interested in the work you’re doing here. People are drawn to word clouds so why not play on that and improve them to show some more useful information. They have ti potential to show a lot of information in a very small space.

  3. I found your visualization intriguing and used it to visualize stemmed search terms from a web log analysis. The lower quantiles were small compared to the upper so I decided to go with quartiles. I agree that it would be nice to make more of the x-axis, but I’m not sure what that would be for my purposes.

  4. Would it make more sense to label the quantile groups with the central quantile rather than its upper bound (e.g. the group representing the upper quintile (80-100) would probably be better labelled “90″ than “100″, I think).

    A definite improvement, in any case. In some situations some form of clustering of related terms might be able to take place along the x-axis, or perhaps by color.

  5. These are cool. I don’t think there’s any need to worry about supplanting the popular word clouds, since this fulfills a bit of a different need.

    How about letting color indicate, verb, noun, adjective, etc.?

  6. @Tom that rather limits the flexibility of color. I have used color in this way but it’s more sensible to supply any word list(s) and have these be colored, particularly if bridgewater turns this into a function that he dumps into a package. So the function would detect how many word lists there are and either automatically choose a different color for each one or, if the user supplies a color list to the function, the function colors the word lists with the user defined colors. Often in data mining your’re
    interested in different word lists such as polarity, subjectivity etc.

    • A good point, I agree. I suppose mapping to grammatical elements would just be another word list.

      Let’s not forget as well the power in the simplicity of these graphs. There’s power in comprehensible complexity as well, so both directions should be considered. If I were making this and wanted to have color in as default (which I think is a good idea even if it’s meaningless), then I think coloring by grammar, without drawing attention to it with a scale or the like, would be a notch above random.

  7. @Tom your point is well taken, I think a default of that sort would be helpful. I’ve never been a real big fan of using color for frequency when it’s already given by word size and so I think purely random colors, as I first suggested, makes even less sense than that.

  8. Thanks for the great ideas and feedback. I think the coloring and spatial properties could definitely benefit from deeper text-mining features (parts of speech, co-occurance, sentiment, etc.). I agree that this will not replace wordclouds any time soon. Hopefully some later iteration of this will look as nice as the standard cloud from an aestetic point-of-view. I think one of the big things is the density of words and overlap. To improve that much I will need custom layout code (maybe igraph would be useful).

    Thanks again.

  9. I graph is one option another way to go would be to tear apart Ian Fellow’s wordcloud code. I know that you can choose to use an R lay out or a C layout. The C is much nicer but that requires learning a new language if you already are not a C user (I’m not). Obvious you’re doing something different but this is a possible starting point for dealing with spatial formatting issues associated with over plotting of large text files. Ian has been pretty responsive to me and he may be willing to work with you on something to produce a new hybrid word cloud. I don’t think you’ll ever get the same aesthetic appeal as the standard word cloud but that’s really not what you’re after. It’s something new all together, it’s about as much information as you can, in the most interpretable format and having as much appeal as you can, but the information presented takes the precedent IMHO.

  10. If color does not signify anything, you should not use colors. Random is worse than a single color.

    Likewise, if the X-axis does not signify anything, you should not use an X-axis.

    Which would simply leave you with a list of words, sorted by frequency, with font as a monotonic function of freq. That would be the easiest way to represent the one salient feature you have (freq). I do like the addition of the quantile dividers, as that helps to visualize the distribution.

    I’ve never liked word-clouds, I think they are a very poor way of representing something in 2 dimensions that should be represented in 1. I find myself darting around the cloud randomly, looking for the biggest words. That’s just too much cognitive load. And along the way, small, unimportant words jump out at me for reasons completely unrelated to the word’s importance, misleading my subconscious into thinking they are important.

    If the words had some other salient property that you wanted to visualize (POS, sentiment, semantic relatedness, monetizability, etc. etc.), then use of 2 dimensions and/or color may be warranted.

    • Hi Chris,

      I agree with you. The main reason I did the colors was for readability. Since there is some word overlap it helped to have different colors for the overlapping words. But I think there are better uses for color.

