How it works
What a word cloud reveals — and what it hides
Word clouds are the fastest way to get a gut read on a block of text. In about a second you can tell whether an article is really about what its headline claims, whether a meeting transcript circled around the agenda, or whether a competitor's landing page is keyword-stuffed. The catch: size maps to raw count, not sentiment or importance, so the cloud tells you *frequency*, not *meaning*. Use it as the first pass, then read the underlying sentences.
How this calculadora processes your text
The tokeniser runs four deterministic steps on your input so the cloud is repeatable and safe to share:
- Lowercase + Unicode normalise so *Calculadora*, *calculadora* and *CALCULADORA* count as one.
- Split on punctuation and whitespace — only letters, numbers, apostrophes and hyphens survive as part of a word.
- Drop stop-words from a bilingual list (English + Portuguese) and anything shorter than 2 characters.
- Count, rank, and cap at 25 — size scales from font-size 0.9 rem (lowest count) to 3 rem (highest).
Good inputs for word clouds
The technique shines when you throw long-form text at it: a 1,500-word blog post, a 30-minute interview transcript, a full chapter, a quarter of customer-support tickets. Short inputs (a few paragraphs) give noisy clouds because every word effectively occurs once.
- Content audits: compare the cloud of a top-ranking page against yours to spot missing entity clusters.
- SEO research: confirm that a long-form piece actually covers the topic it targets.
- Qualitative research: scan open-ended survey answers for themes before coding them.
- Product meetings: paste a Slack channel export to see which items dominated the quarter.
Reading the ranked table
The cloud is the eye-catching bit, but the table underneath is where decisions get made. It shows count and share (percent of total token occurrences after stop-word removal) so you can do a simple cumulative check: if the top five words together account for more than 40% of the token pool, your text is very narrow; under 15% and it is probably diffuse or poorly themed.
Limits you should know about
The calculator is purely frequency-based. It does not lemmatise, so *run*, *running* and *ran* are three separate entries. It does not do bigrams, so "New York" splits into *new* and *york*. And stop-word lists are never exhaustive — add your own noise words manually to the input text if they dominate. For deeper NLP (TF-IDF, named-entity recognition, topic modelling) move to a dedicated tool.
