How to remove non ASCII characters from a string?

The code snippet below remove the characters from a string that is not inside the range of x20 and x7E ASCII code. The regex below strips non-printable and control characters. But it also keeps the linefeed character n (x0A) and the carriage return r (x0D) characters.

package org.kodejava.example.regex;

public class ReplaceNonAscii {
    public static void main(String[] args) {
        String str = "Thè quïck brøwn føx jumps over the lãzy dôg.";
        System.out.println("str = " + str);

        // Replace all non ascii chars in the string.
        str = str.replaceAll("[^\\x0A\\x0D\\x20-\\x7E]", "");
        System.out.println("str = " + str);
    }
}

Snippet output:

str = Thè quïck brøwn føx jumps over the lãzy dôg.
str = Th quck brwn fx jumps over the lzy dg.

How do I sort an array of string data using RuleBasedCollator class?

We can use the java.text.Collator class to sort strings in language-specific order. Using the java.text.Collator class makes the string not just sorted by the ASCII code of their characters but it will follow the language natural order of the characters.

If the predefined collation rules do not meet your needs, you can design your own rules and assign them to a RuleBasedCollator object. Customized collation rules are contained in a String object that is passed to the RuleBasedCollator constructor.

package org.kodejava.example.text;

import java.text.ParseException;
import java.text.RuleBasedCollator;
import java.util.Arrays;
import java.util.Collections;

public class RuleBasedCollatorDemo {
    public static void main(String[] args) {
        String rule1 = ("< a < b < c");
        String rule2 = ("< c < b < a");
        String rule3 = ("< c < a < b");

        String words[] = {"apple", "banana", "carrot", "apricot", "blueberry", "cabbage"};

        try {
            RuleBasedCollator rb1 = new RuleBasedCollator(rule1);
            RuleBasedCollator rb2 = new RuleBasedCollator(rule2);
            RuleBasedCollator rb3 = new RuleBasedCollator(rule3);

            System.out.println("original: ");
            System.out.println(Arrays.toString(words));

            // Sort based on rule1
            Collections.sort(Arrays.asList(words), rb1);
            System.out.println("rule: " + rb1.getRules());
            System.out.println(Arrays.toString(words));

            // Sort based on rule2
            Collections.sort(Arrays.asList(words), rb2);
            System.out.println("rule: " + rb2.getRules());
            System.out.println(Arrays.toString(words));

            // Sort based on rule3
            Collections.sort(Arrays.asList(words), rb3);
            System.out.println("rule: " + rb3.getRules());
            System.out.println(Arrays.toString(words));
        } catch (ParseException e) {
            e.printStackTrace();
        }
    }
}

Below is the result of sorting strings using a different RuleBasedCollator

original: 
[apple, banana, carrot, apricot, blueberry, cabbage]
rule: < a < b < c
[apple, apricot, banana, blueberry, cabbage, carrot]
rule: < c < b < a
[cabbage, carrot, banana, blueberry, apple, apricot]
rule: < c < a < b
[cabbage, carrot, apple, apricot, banana, blueberry]

How do I detect non-ASCII characters in string?

The code below detect if a given string has a non ASCII characters in it. We use the CharsetDecoder class from the java.nio package to decode string to be a valid US-ASCII charset.

package org.kodejava.example.io;

import java.nio.charset.CharsetDecoder;
import java.nio.charset.Charset;
import java.nio.charset.CharacterCodingException;
import java.nio.CharBuffer;
import java.nio.ByteBuffer;
import java.util.Arrays;

public class NonAsciiValidation {
    public static void main(String[] args) {
        // This string contains a non ASCII character which will produce exception
        // in this program. While the second string has a valid ASCII only chars.
        byte[] invalidBytes = "Copyright © 2017 Kode Java Org".getBytes();
        byte[] validBytes = "Copyright (c) 2017 Kode Java Org".getBytes();

        // Returns a charset object for the named charset.
        CharsetDecoder decoder = Charset.forName("US-ASCII").newDecoder();
        try {
            CharBuffer buffer = decoder.decode(ByteBuffer.wrap(validBytes));
            System.out.println(Arrays.toString(buffer.array()));

            buffer = decoder.decode(ByteBuffer.wrap(invalidBytes));
            System.out.println(Arrays.toString(buffer.array()));
        } catch (CharacterCodingException e) {
            System.err.println("The information contains a non ASCII character(s).");
            e.printStackTrace();
        }
    }
}

Below is the result of the program:

The information contains a non ASCII character(s).
[C, o, p, y, r, i, g, h, t,  , (, c, ),  , 2, 0, 1, 7,  , K, o, d, e,  , J, a, v, a,  , O, r, g]
java.nio.charset.MalformedInputException: Input length = 1
    at java.base/java.nio.charset.CoderResult.throwException(CoderResult.java:281)
    at java.base/java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:815)
    at org.kodejava.example.io.NonAsciiValidation.main(NonAsciiValidation.java:23)