How to remove non ASCII characters from a string?

The code snippet below remove the characters from a string that is not inside the range of x20 and x7E ASCII code. The regex below strips non-printable and control characters. But it also keeps the linefeed character n (x0A) and the carriage return r (x0D) characters.

package org.kodejava.regex;

public class ReplaceNonAscii {
    public static void main(String[] args) {
        String str = "Thè quïck brøwn føx jumps over the lãzy dôg.";
        System.out.println("str = " + str);

        // Replace all non ascii chars in the string.
        str = str.replaceAll("[^\\x0A\\x0D\\x20-\\x7E]", "");
        System.out.println("str = " + str);
    }
}

Snippet output:

str = Thè quïck brøwn føx jumps over the lãzy dôg.
str = Th quck brwn fx jumps over the lzy dg.

How to split a string by a number of characters?

The following code snippet will show you how to split a string by numbers of characters. We create a method called splitToNChars() that takes two arguments. The first arguments is the string to be split and the second arguments is the split size.

This splitToNChars() method will split the string in a for loop. First we’ll create a List object that will store parts of the split string. Next we do a loop and get the substring for the defined size from the text and store it into the List. After the entire string is read we convert the List object into an array of String by using the List‘s toArray() method.

Let’s see the code snippet below:

package org.kodejava.lang;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class SplitStringForEveryNChar {
    public static void main(String[] args) {
        String text = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

        System.out.println(Arrays.toString(splitToNChar(text, 3)));
        System.out.println(Arrays.toString(splitToNChar(text, 4)));
        System.out.println(Arrays.toString(splitToNChar(text, 5)));
    }

    /**
     * Split text into n number of characters.
     *
     * @param text the text to be split.
     * @param size the split size.
     * @return an array of the split text.
     */
    private static String[] splitToNChar(String text, int size) {
        List<String> parts = new ArrayList<>();

        int length = text.length();
        for (int i = 0; i < length; i += size) {
            parts.add(text.substring(i, Math.min(length, i + size)));
        }
        return parts.toArray(new String[0]);
    }
}

When run the code snippet will output:

[ABC, DEF, GHI, JKL, MNO, PQR, STU, VWX, YZ]
[ABCD, EFGH, IJKL, MNOP, QRST, UVWX, YZ]
[ABCDE, FGHIJ, KLMNO, PQRST, UVWXY, Z]

How do I split a string with multiple spaces?

This code snippet show you how to split string with multiple white-space characters. To split the string this way we use the "\s+" regular expression. The white-space characters include space, tab, line-feed, carriage-return, new line, form-feed.

Let’s see the code snippet below:

package org.kodejava.lang;

import java.util.Arrays;

public class SplitStringMultiSpaces {
    public static void main(String[] args) {
        String text = "04/11/2021    SHOES      RUNNING RED   99.9 USD";

        // Split the string using the \s+ regex to split multi spaces
        // line of text.
        String[] items = text.split("\\s+");
        System.out.println("Length = " + items.length);
        System.out.println("Items  = " + Arrays.toString(items));
    }
}

The result of the code snippet is:

Length = 6
Items  = [04/11/2021, SHOES, RUNNING, RED, 99.9, USD]

How do I use string in switch statement?

Starting from Java 7 release you can now use a String in the switch statement. On the previous version we can only use constants type of byte, char, short, int (and their corresponding reference / wrapper type) or enum constants in the switch statement.

The code below give you a simple example on how the Java 7 extended to allow the use of String in switch statement.

package org.kodejava.basic;

public class StringInSwitchExample {
    public static void main(String[] args) {
        String day = "Sunday";
        switch (day) {
            case "Sunday":
                System.out.println("doSomething");
                break;
            case "Monday":
                System.out.println("doSomethingElse");
                break;
            case "Tuesday":
            case "Wednesday":
                System.out.println("doSomeOtherThings");
                break;
            default:
                System.out.println("doDefault");
                break;
        }
    }
}

How do I break a text or sentence into words?

At first, it might look simple. We can just split the text using the String.split(), the word is split using space. But what if a word ends with questions marks (?) or exclamation marks (!) instead? There might be some other rules that we also need to care.

Using the java.text.BreakIterator makes it much simpler. The class’s getWordInstance() factory method creates a BreakIterator instance for words break. Instantiating a BreakIterator and passing a locale information makes the iterator to breaks the text or sentence according the rule of the locale. This is really helpful when we are working with a complex language such as Japanese or Chinese.

Let us see an example of using the BreakIterator below.

package org.kodejava.text;

import java.text.BreakIterator;
import java.util.Locale;

public class BreakIteratorExample {
    public static void main(String[] args) {
        String data = "The quick brown fox jumps over the lazy dog.";
        String search = "dog";

        // Gets an instance of BreakIterator for word break for the
        // given locale. We can instantiate a BreakIterator without
        // specifying the locale. The locale is important when we
        // are working with languages like Japanese or Chinese where
        // the breaks standard may be different compared to English.
        BreakIterator bi = BreakIterator.getWordInstance(Locale.US);

        // Set the text string to be scanned.
        bi.setText(data);

        // Iterates the boundary / breaks
        System.out.println("Iterates each word: ");
        int count = 0;
        int lastIndex = bi.first();
        while (lastIndex != BreakIterator.DONE) {
            int firstIndex = lastIndex;
            lastIndex = bi.next();

            if (lastIndex != BreakIterator.DONE
                    && Character.isLetterOrDigit(data.charAt(firstIndex))) {
                String word = data.substring(firstIndex, lastIndex);
                System.out.printf("'%s' found at (%s, %s)%n",
                        word, firstIndex, lastIndex);

                // Counts how many times the word dog occurs.
                if (word.equalsIgnoreCase(search)) {
                    count++;
                }
            }
        }

        System.out.println("Number of word '" + search + "' found = " + count);
    }
}

Here are the program output:

Iterates each word: 
'The' found at (0, 3)
'quick' found at (4, 9)
'brown' found at (10, 15)
'fox' found at (16, 19)
'jumps' found at (20, 25)
'over' found at (26, 30)
'the' found at (31, 34)
'lazy' found at (35, 39)
'dog' found at (40, 43)
Number of word 'dog' found = 1