Text Analyzer

Analyze text statistics and utilities

Understanding Text Analysis
TL;DR

Text analysis counts characters, words, sentences, and paragraphs. Essential for character limits on social media, SEO meta tags, and SMS.

What is Text Analysis?

Text analysis (or text metrics) is the process of computing statistical properties of a piece of text: character count, word count, sentence count, paragraph count, and derived metrics like average word length and estimated reading time.

These metrics may seem simple, but they are critical in many professional contexts. Social media managers need to know character counts to stay within platform limits. SEO specialists optimize meta descriptions and title tags to specific lengths. Translators track word counts for billing. Writers use reading time estimates to calibrate article length.

Modern text analyzers go beyond simple counting. They handle Unicode correctly (where a single “character” can be multiple bytes), identify sentences by punctuation patterns rather than just counting periods, and distinguish between words separated by spaces, hyphens, or line breaks.

Character vs Word vs Sentence Counting

Character Counting

A character count seems straightforward, but edge cases abound:

  • With or without spaces? Social media platforms typically count spaces. SMS billing does not always.
  • Unicode characters: An emoji like a flag can be 4+ bytes but displays as 1 character. A string’s .length in JavaScript returns UTF-16 code units, not visual characters — use [...str].length or Intl.Segmenter for grapheme clusters.
  • Newlines: \n is typically counted as 1 character, but \r\n (Windows line endings) is 2 bytes.

Word Counting

Words are typically counted by splitting on whitespace and filtering out empty strings. But what counts as a “word” depends on context:

  • Hyphenated terms: is “well-known” one word or two?
  • Contractions: “don’t” is one word
  • Numbers: “42” and “3.14” are usually counted as words
  • URLs: “https://example.com/path” is typically one word

Most text analyzers split on whitespace boundaries (/\s+/), which handles the majority of cases correctly for European languages. CJK (Chinese, Japanese, Korean) languages do not use spaces between words, requiring segmentation algorithms.

Sentence Counting

Sentences are harder to count than you might expect. Naive counting by periods fails on abbreviations (“Dr. Smith”), decimal numbers (“3.14”), and ellipses (“Wait…”). Robust sentence detection looks for sentence-ending punctuation (., !, ?) followed by whitespace and a capital letter, while handling exceptions.

Unicode Considerations

Unicode introduces complexity that trips up many text analyzers:

Grapheme clusters — A single visual character can consist of multiple Unicode code points. The emoji sequence “family” can be 7+ code points but renders as one character. The letter “e with accent” can be either one code point (e) or two (e + combining accent mark), and they look identical.

String length inconsistency — JavaScript’s String.length counts UTF-16 code units. Characters outside the Basic Multilingual Plane (many emoji) use surrogate pairs, so "flag".length may return 4 instead of 1. Use Array.from(str).length or Intl.Segmenter for accurate grapheme counting.

Byte count vs character count — UTF-8 uses 1-4 bytes per character. ASCII characters are 1 byte, European accented characters are 2 bytes, CJK characters are 3 bytes, and emoji are 4 bytes. When byte limits matter (SMS, database columns), character count alone is insufficient.

Common Use Cases

  • SEO optimization: Crafting meta descriptions (target 150-155 characters) and title tags (target 50-60 characters) within Google’s display limits
  • Social media management: Ensuring posts fit within character limits — Twitter/X (280), LinkedIn (3000 for posts, 120 for headlines), Instagram captions (2200)
  • SMS segmentation: Each SMS segment is 160 characters (GSM-7) or 70 characters (Unicode). Exceeding one segment doubles the cost
  • Content planning: Estimating reading time (average 200-250 words per minute) to calibrate article length for audience attention spans
  • Translation quoting: Translators commonly bill per word. Accurate word counts are essential for cost estimation before a project begins

Try These Examples

Paragraph of Text Valid

Analysis results: 139 characters, 25 words, 3 sentences, 1 paragraph. The average word length is 4.2 characters. This text fits within a tweet (280 chars) and an SMS (160 chars).

The quick brown fox jumps over the lazy dog. This sentence contains every letter of the English alphabet. It is commonly used for font testing.
Empty String Valid

An empty input produces all-zero results: 0 characters, 0 words, 0 sentences, 0 paragraphs. The analyzer handles empty input gracefully without errors.