How to split a string by a number of characters?

The following code snippet will show you how to split a string by numbers of characters. We create a method called splitToNChars() that takes two arguments. The first arguments is the string to be split and the second arguments is the split size.

This splitToNChars() method will split the string in a for loop. First we’ll create a List object that will store parts of the split string. Next we do a loop and get the substring for the defined size from the text and store it into the List. After the entire string is read we convert the List object into an array of String by using the List‘s toArray() method.

Let’s see the code snippet below:

package org.kodejava.example.lang;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class SplitStringForEveryNChar {
    public static void main(String[] args) {
        String text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

        System.out.println(Arrays.toString(splitToNChar(text, 3)));
        System.out.println(Arrays.toString(splitToNChar(text, 4)));
        System.out.println(Arrays.toString(splitToNChar(text, 5)));
    }

    /**
     * Split text into n number of characters.
     *
     * @param text the text to be split.
     * @param size the split size.
     * @return an array of the split text.
     */
    private static String[] splitToNChar(String text, int size) {
        List<String> parts = new ArrayList<>();

        int length = text.length();
        for (int i = 0; i < length; i += size) {
            parts.add(text.substring(i, Math.min(length, i + size)));
        }
        return parts.toArray(new String[0]);
    }
}

When run the code snippet will output:

[ABC, DEF, GHI, JKL, MNO, PQR, STU, VWX, YZ]
[ABCD, EFGH, IJKL, MNOP, QRST, UVWX, YZ]
[ABCDE, FGHIJ, KLMNO, PQRST, UVWXY, Z]

How do I split a string with multiple spaces?

This code snippet show you how to split string with multiple white-space characters. To split the string this way we use the "\s+" regular expression. The white-space characters include space, tab, line-feed, carriage-return, new line, form-feed.

Let’s see the code snippet below:

package org.kodejava.example.lang;

import java.util.Arrays;

public class SplitStringMultiSpaces {
    public static void main(String[] args) {
        String text = "18/08/2012    SHOES      RUNNING RED   99.9 USD";

        // Split the string using the \s+ regex to split multi spaces
        // line of text.
        String[] items = text.split("\\s+");
        System.out.println("Length = " + items.length);
        System.out.println("Items  = " + Arrays.toString(items));
    }
}

The result of the code snippet is:

Length = 6
Items  = [18/08/2017, SHOES, RUNNING, RED, 99.9, USD]

How do I breaks a paragraph into sentences?

This example show you how to use the BreakIterator.getSentenceInstance() to breaks a paragraphs into sentences that composes the paragraph. To get the BreakIterator instance we call the getSentenceInstance() factory method and passes a locale information.

In the count(BreakIterator bi, String source) method we iterates the the break to extract sentences that composes the paragraph which value is stored in the paragraph variable.

package org.kodejava.example.text;

import java.text.BreakIterator;
import java.util.Locale;

public class BreakSentenceExample {
    public static void main(String[] args) {
        String paragraph =
                "Line boundary analysis determines where a text " +
                "string can be broken when line-wrapping. The " +
                "mechanism correctly handles punctuation and " +
                "hyphenated words. Actual line breaking needs to " +
                "also consider the available line width and is " +
                "handled by higher-level software. ";

        BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);

        int sentences = count(iterator, paragraph);
        System.out.println("Number of sentences: " + sentences);
    }

    private static int count(BreakIterator bi, String source) {
        int counter = 0;
        bi.setText(source);

        int lastIndex = bi.first();
        while (lastIndex != BreakIterator.DONE) {
            int firstIndex = lastIndex;
            lastIndex = bi.next();

            if (lastIndex != BreakIterator.DONE) {
                String sentence = source.substring(firstIndex, lastIndex);
                System.out.println("sentence = " + sentence);
                counter++;
            }
        }
        return counter;
    }
}

Our program will print the following result on the console screen:

sentence = Line boundary analysis determines where a text string can be broken when line-wrapping. 
sentence = The mechanism correctly handles punctuation and hyphenated words. 
sentence = Actual line breaking needs to also consider the available line width and is handled by higher-level software. 
Number of sentences: 3

How do I breaks a text or sentence into words?

At first it might look simple. We can just split the text using the String.split(), the word is split using space. But what if a word ends with questions marks (?) or exclamation marks (!) instead? There might be some other rules that we also need to care.

Using the java.text.BreakIterator makes it much simpler. The class’s getWordInstance() factory method creates a BreakIterator instance for words break. Instantiating a BreakIterator and passing a locale information makes the iterator to breaks the text or sentence according the rule of the locale. This is really helpful when we are working with a complex language such as Japanese or Chinese.

Let us see an example of using the BreakIterator below.

package org.kodejava.example.text;

import java.text.BreakIterator;
import java.util.Locale;

public class BreakIteratorExample {
    public static void main(String[] args) {
        String data = "The quick brown fox jumps over the lazy dog.";
        String search = "dog";

        // Gets an instance of BreakIterator for word break for the
        // given locale. We can instantiate a BreakIterator without
        // specifying the locale. The locale is important when we
        // are working with languages like Japanese or Chinese where
        // the breaks standard may be different compared to English.
        BreakIterator bi = BreakIterator.getWordInstance(Locale.US);

        // Set the text string to be scanned.
        bi.setText(data);

        // Iterates the boundary / breaks
        System.out.println("Iterates each word: ");
        int count = 0;
        int lastIndex = bi.first();
        while (lastIndex != BreakIterator.DONE) {
            int firstIndex = lastIndex;
            lastIndex = bi.next();

            if (lastIndex != BreakIterator.DONE
                && Character.isLetterOrDigit(data.charAt(firstIndex))) {
                String word = data.substring(firstIndex, lastIndex);
                System.out.printf("'%s' found at (%s, %s)%n", word, firstIndex, lastIndex);

                // Counts how many times the word dog occurs.
                if (word.equalsIgnoreCase(search)) {
                    count++;
                }
            }
        }

        System.out.println("Number of word '" + search + "' found = " + count);
    }
}

Here are the program output:

Iterates each word: 
'The' found at (0, 3)
'quick' found at (4, 9)
'brown' found at (10, 15)
'fox' found at (16, 19)
'jumps' found at (20, 25)
'over' found at (26, 30)
'the' found at (31, 34)
'lazy' found at (35, 39)
'dog' found at (40, 43)
Number of word 'dog' found = 1

How do I split-up string using regular expression?

This example uses the java.util.regex.Pattern.split() method to split-up input string separated by commas or whitespaces.

package org.kodejava.example.regex;

import java.util.regex.Pattern;

public class RegexSplitExample {
    public static void main(String[] args) {
        // Pattern for finding commas, whitespaces (space, tabs, new line,
        // carriage return, form feed).
        String pattern = "[,\\s]+";
        String colours = "Red,White, Blue   Green        Yellow, Orange";

        Pattern splitter = Pattern.compile(pattern);
        String[] result = splitter.split(colours);

        for (String colour : result) {
            System.out.format("Colour = \"%s\"%n", colour);
        }
    }
}

The result of our code snippet is:

Colour = "Red"
Colour = "White"
Colour = "Blue"
Colour = "Green"
Colour = "Yellow"
Colour = "Orange"

How do I split a string using Scanner class?

Instead of using the StringTokenizer class or the String.split() method we can use the java.util.Scanner class to split a string.

package org.kodejava.example.util;

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;

public class ScannerTokenDemo {
    public static void main(String[] args) {
        // This file contains some data as follow:
        // a, b, c, d
        // e, f, g, h
        // i, j, k, l
        File file = new File("data.txt");
        try {
            // Here we use the Scanner class to read file content line-by-line.
            Scanner scanner = new Scanner(file);
            while (scanner.hasNextLine()) {
                String line = scanner.nextLine();

                // From the above line of code we got a line from the file
                // content. Now we want to split the line with comma as the 
                // character delimiter.
                Scanner lineScanner = new Scanner(line);
                lineScanner.useDelimiter(",");
                while (lineScanner.hasNext()) {
                    // Get each splitted data from the Scanner object and print
                    // the value.
                    String part = lineScanner.next();
                    System.out.print(part + ", ");
                }                
                System.out.println();
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
    }
}

How do I split a string?

Prior to Java 1.4 we use java.util.StringTokenizer class to split a tokenized string, for example a comma separated string. Starting from Java 1.4 and later the java.lang.String class introduce a String.split(String regex) method that simplify this process.

Below is a code sample how to do it.

package org.kodejava.example.lang;

import java.util.Arrays;

public class StringSplit {
    public static void main(String[] args) {
        String data = "1,Diego Maradona,Footballer,Argentina";
        String[] items = data.split(",");

        // Iterates the array to print it out.
        for (String item : items) {
            System.out.println("item = " + item);
        }

        // Or simply use Arrays.toString() when print it out.
        System.out.println("item = " + Arrays.toString(items));
    }
}

The result of the code snippet:

item = 1
item = Diego Maradona
item = Footballer
item = Argentina
item = [1, Diego Maradona, Footballer, Argentina]

How do I use StringTokenizer to split a string?

The code below is an example of using StringTokenizer to split a string. In the current JDK this class is discouraged to be used, use the String.split(...) method instead or using the new java.util.regex package.

package org.kodejava.example.util;

import java.util.StringTokenizer;

public class StringTokenizerExample {
    public static void main(String[] args) {
        StringTokenizer st =
            new StringTokenizer("A StringTokenizer sample");

        // get how many tokens inside st object
        System.out.println("Tokens count: " + st.countTokens());

        // iterate st object to get more tokens from it
        while (st.hasMoreElements()) {
            String token = st.nextElement().toString();
            System.out.println("Token = " + token);
        }

        // split a date string using a forward slash as delimiter
        st = new StringTokenizer("2017/08/20", "/");
        while (st.hasMoreElements()) {
            String token = st.nextToken();
            System.out.println("Token = " + token);
        }
    }
}

Here is the result of this sample code:

Tokens count: 3
Token = A
Token = StringTokenizer
Token = sample
Token = 2017
Token = 08
Token = 20