A Robust Title Casing Algorithm

Just like many problems in computer programming, title casing a string of words seems trivial.  We humans do it automatically, but in reality we are appling a complex set of rules.  Those rules are codified in the below algorithm, which I originally posted on my personal blog in 2012.


  • I’ve been thinking a lot about title casing lately. I’ve been tagging my huge music library, which exposes me to all the odd variations that are band, song, and album names. Additionally I’ve recently written an algorithm to support the new Orchard CMS Coverflow module I’ve been working on. This post outlines the logic of the algorithm and includes the C# source code at the bottom.

    The Basic Rules

    In general, title casing means capitalizing the first letter of each word in a string, like this: “My Important Title.” Creating an algorithm to do this is trivial, but unfortunately there are exceptions to this rule.

    The first is that in English there are a handful of words that we agree to not capitalize in titles.  The list of words that I came up with and included are:

    { the, of, or, and, an, a, in, is, are, to, on }

    Unless they are the first or the last word in the title, these words should be lowercased.  See the results for yourself: “The Lord Of The Rings” vs. “The Lord of the Rings.” Notice the first word, “the” is capitalized, but in the middle of the sentence it is lowercased.

    The last word is important too. Take the string “…and the band played on.” The correct title casing should be “…And the Band Played On.”  The last “on” is capitalized because it is at the end. Contrast this with “hop on pop,” which should be cased “Hop on Pop.”

    Exceptions for Specifically Cased Words

    The list of words to lowercase isn’t the only list of special words we need to consider. Perhaps more important in today’s brand conscious world are casing exceptions. These are quite common, like Apple’s “i” products: iPhone, iPad, iPod, etc. If you title case those names, they become downright unrecognizable: “Iphone.” At Planet Telex, we’ve built websites for DEMOGala, ScriptSave, WellDyneRx, BioClaim, and others who wouldn’t be happy to have their brand incorrectly cased as all lowercase except for the first letter.

    An example from my music library is the band MUTEMATH. The correct branding is all caps. If my algorithm makes it “Mutemath” not only is it wrong, its totally lame. The nuances don’t stop there though- consider the band “Portugal. The Man.” Yes, that period is supposed to be there, and the “The” should also be capitalized. That is how the band does it, but it is also natural to the English language. We expect a capital letter after a period. If my algorithm generates “Portugal. the Man” it is also incorrect and lame.

    So a successful casing algorithm needs 2 lists of special words: One to specify words to lowercase when in the middle of the title, and the other to specify words that should be cased specifically, like “MUTEMATH” and “iPhone”.

    Nuances in Punctuation

    A robust title casing algorithm needs to be aware of which symbols that separate words should trigger exceptions to the general lowercase rule, like “Portugal. The Man” or “Pinion/Terrible Lie” (which could produce “Pinion/terrible Lie” in an algorithm that didn’t respect the “/” character).

    To surmount this complexity, I’ve created 2 lists of characters that separate words, a list of “weak” separators and a list of “strong” separators. As their name implies, all of these characters can be seen as flags that separate one word from another, the difference is that after a “strong” separator, the following word should be capitalized, even if it is in the lowercase list.

    The two weak separators are the space and comma. There are more strong ones:

    { . ? ! ( ) { } [ ] < > / & }

     Algorithm Overview

    With the assistance of the lists I’ve defined as well as a few helper methods, the basic algorithm iterates over each character, building words and then adding them when separators are encountered. A separate function handles applying the rules of casing to a single word, the iterator function simply has to control it.

    The biggest complexity is dealing with the possible variations in punctuation. The least obvious rule, which has several lines of explanation in my example, is that if a strong separator is encountered, spaces must be discounted until the next word is written. This way, the word “and” following both a “)” character and then a space character is correctly uppercased.

    The Code

    The following code is a slightly revised version of the code included in the Planet Telex .Net Library. Some formatting is changed to better fit on the page, and the class name has been contrived for this example. Download or fork the source code at our Planet Telex GitHub account.

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using PlanetTelex.Properties;
    
    namespace PlanetTelex.Utilities
    {
      /// <summary>
      /// This class demonstrates a robust title casing algorithm.
      /// </summary>
      public class TitleCaseUtility
      {
        private enum WordPosition { First, Middle, Last }
        private char[] _weakSeparators = new[] { ' ', ',' };
        private char[] _strongSeparators = 
            new[] {'.','?','!','(',')','{','}','[',']','<','>','/','&'};
    
        private IEnumerable<char> AllSeparators
        {
            get { return _weakSeparators.Concat(_strongSeparators); }
        }
    
        /// <summary>
        /// This is the list of specifically cased words embedded in assembly resources.
        /// </summary>
        private static IEnumerable<string> ToCase
        {
            get
            {
                if (_toCase == null)
                    _toCase = Resources.TitleCaseToCase.Split(',', ' ');
    
                return _toCase;
            }
        }
        private static string[] _toCase;
    
        /// <summary>
        /// This is the list of words to lowercase embedded in assembly resources.
        /// </summary>
        private static IEnumerable<string> ToLower
        {
            get
            {
                if (_toLower == null)
                    _toLower = Resources.TitleCaseToLower.Split(',', ' ');
    
                return _toLower;
            }
        }
        private static string[] _toLower;
    
        /// <summary>
        /// This helper method uppercases the first letter of any given string.
        /// </summary>
        public string UppercaseFirstLetter(string toUppercase)
        {
            if (string.IsNullOrEmpty(toUppercase))
                return string.Empty;
    
            return char.ToUpper(toUppercase[0]) + toUppercase.Substring(1);
        }
    
        /// <summary>
        /// This helper method applies casing rules to a single word.
        /// </summary>
        private string CaseWord(string wordToCase, WordPosition wordPosition, 
            char preceedingSeparator, string[] casedWords)
        {
            // If the word is in our embedded specifically cased list, return that casing.
            if (ToCase.Contains(wordToCase, StringComparer.OrdinalIgnoreCase))
              return ToCase.FirstOrDefault(
                s => s.Equals(wordToCase, StringComparison.OrdinalIgnoreCase));
    
            // If the word is in the provided specifically cased list, return that casing.
            if (casedWords != null && 
              casedWords.Contains(wordToCase, StringComparer.OrdinalIgnoreCase))
              return casedWords.FirstOrDefault(
                s => s.Equals(wordToCase, StringComparison.OrdinalIgnoreCase));
    
            // If the word is in our embedded list to lowercase, in the middle of the title, 
            // and not after a strong separator, it should be lowercased.
            if (ToLower.Contains(wordToCase, StringComparer.OrdinalIgnoreCase) && 
              wordPosition == WordPosition.Middle && 
              _weakSeparators.Contains(preceedingSeparator))
              return wordToCase.ToLower();
    
            // The default casing uppercases the first letter and lowercases the rest.
            return UppercaseFirstLetter(wordToCase.ToLower());
        }
    
        /// <summary>
        /// Replaces a section of a string. This method will help us with a fringe case.
        /// </summary>
        public string ReplaceAt(string toReplaceAt, int removeStartIndex, 
            int removeCount, string toInsert)
        {
            // Argument validation.
            if (toReplaceAt == null) 
                throw new ArgumentNullException("toReplaceAt");
            if (removeStartIndex >= toReplaceAt.Length) 
                throw new ArgumentOutOfRangeException("removeStartIndex");
            if (removeStartIndex + removeCount >= toReplaceAt.Length) 
                throw new ArgumentOutOfRangeException("removeCount", 
                    Resources.IndexPlusCountExceedsSize);
    
            // Remove and insert.
            string removed = toReplaceAt.Remove(removeStartIndex, removeCount);
            return removed.Insert(removeStartIndex, toInsert);
        }
    
        /// <summary>
        /// The main title casing algorithm.
        /// </summary>
        public string TitleCase(string toTitleCase, string[] casedWords)
        {
            if (toTitleCase == null)
                return null;
    
            StringBuilder stringBuilder = new StringBuilder();
            string currentWord = string.Empty;
            string lastWord = string.Empty;
            char lastSeparator = '\0';
            int wordCount = 0;
    
            foreach (char c in toTitleCase)
            {
                if (AllSeparators.Contains(c)) // The current character is a separator.
                {
                    if (currentWord.Length > 0)
                    {
                      WordPosition position = wordCount == 0 ? 
                        WordPosition.First : WordPosition.Middle;
                      stringBuilder.Append(
                        CaseWord(currentWord, position, lastSeparator, casedWords));
                      lastWord = currentWord;
                      currentWord = string.Empty;
                      lastSeparator = '\0';
                      wordCount++;
                    }
                    stringBuilder.Append(c);
                    // Set lastSeparator to the current character, unless it is a space AND
                    // the lastSeparator is a strong separator. This is so CaseWord will 
                    // work correctly after strong and space separators happen in succession.
                    if (!(_strongSeparators.Contains(lastSeparator) && char.IsWhiteSpace(c)))
                        lastSeparator = c;
                }
                else // The current character is not a separator.
                    currentWord += c;
            }
    
            if (currentWord.Length > 0) // Add the last word.
                stringBuilder.Append(
                    CaseWord(currentWord, WordPosition.Last, lastSeparator, casedWords));
            else // Add the last word when the last character was a separator.
            {
                string title = stringBuilder.ToString();
                int lastWordIndex = 
                    title.LastIndexOf(lastWord, StringComparison.OrdinalIgnoreCase);
                string toInsert = CaseWord(lastWord, WordPosition.Last, '\0', casedWords);
                return ReplaceAt(title, lastWordIndex, lastWord.Length, toInsert);
            }
            return stringBuilder.ToString();
        }
      }
    }

     


No Comments