At first, it might look simple. We can just split the text using the String.split()
, the word is split using space. But what if a word ends with questions marks (?) or exclamation marks (!) instead? There might be some other rules that we also need to care.
Using the java.text.BreakIterator
makes it much simpler. The class’s getWordInstance()
factory method creates a BreakIterator
instance for words break. Instantiating a BreakIterator
and passing a locale information makes the iterator to breaks the text or sentence according the rule of the locale. This is really helpful when we are working with a complex language such as Japanese or Chinese.
Let us see an example of using the BreakIterator
below.
package org.kodejava.text;
import java.text.BreakIterator;
import java.util.Locale;
public class BreakIteratorExample {
public static void main(String[] args) {
String data = "The quick brown fox jumps over the lazy dog.";
String search = "dog";
// Gets an instance of BreakIterator for word break for the
// given locale. We can instantiate a BreakIterator without
// specifying the locale. The locale is important when we
// are working with languages like Japanese or Chinese where
// the breaks standard may be different compared to English.
BreakIterator bi = BreakIterator.getWordInstance(Locale.US);
// Set the text string to be scanned.
bi.setText(data);
// Iterates the boundary / breaks
System.out.println("Iterates each word: ");
int count = 0;
int lastIndex = bi.first();
while (lastIndex != BreakIterator.DONE) {
int firstIndex = lastIndex;
lastIndex = bi.next();
if (lastIndex != BreakIterator.DONE
&& Character.isLetterOrDigit(data.charAt(firstIndex))) {
String word = data.substring(firstIndex, lastIndex);
System.out.printf("'%s' found at (%s, %s)%n",
word, firstIndex, lastIndex);
// Counts how many times the word dog occurs.
if (word.equalsIgnoreCase(search)) {
count++;
}
}
}
System.out.println("Number of word '" + search + "' found = " + count);
}
}
Here are the program output:
Iterates each word:
'The' found at (0, 3)
'quick' found at (4, 9)
'brown' found at (10, 15)
'fox' found at (16, 19)
'jumps' found at (20, 25)
'over' found at (26, 30)
'the' found at (31, 34)
'lazy' found at (35, 39)
'dog' found at (40, 43)
Number of word 'dog' found = 1
- How do I get number of each day for a certain month in Java? - September 8, 2024
- How do I get operating system process information using ProcessHandle? - July 22, 2024
- How do I sum a BigDecimal property of a list of objects using Java Stream API? - July 22, 2024