public class TextPreprocessor
extends java.lang.Object
This class represents an "interface" to the different text processing tools available in this package.
The currently available operations are:
Note that a copy of the originally set string is stored and the changes are carried out on a copy
Constructor and Description |
---|
TextPreprocessor(java.util.Locale lan) |
Modifier and Type | Method and Description |
---|---|
java.lang.String |
getOriginalString() |
java.lang.String |
getString() |
java.util.List<java.lang.String> |
getTokens() |
java.lang.String |
normalizeAndDeAposText(java.lang.String text)
Normalize the text by shrinking white spaces in one as well as substituting
quotations, dashes and dots.
|
static java.lang.String |
normalizeText(java.lang.String text)
Normalize the text by shrinking white spaces in one as well as substituting
quotations, dashes and dots.
|
void |
removeDiacritics()
Removes the diacritics from the string
|
void |
removeEngStopwords()
Eliminates all the English stopwords from the string
|
void |
removeNonAlphabetic(int minimumSize)
Remove any token which is not in [:alpha:] character class.
|
void |
removeNonAlphaNumeric(int minimumSize)
Remove any token which is not the in [:alnum:] character class.
|
void |
removePunctuation()
Removes the punctuation marks from the string
|
void |
removeStopwords()
Eliminates all the stopwords from the string
|
void |
setString(java.lang.String str)
Stores a copy of the original string and generates a tokenized copy.
|
void |
setStringTokens(java.lang.String str)
Stores a copy of the original string and generates a tokenized copy.
|
void |
stem() |
void |
stemLucene() |
void |
toLowerCase()
Converts the string to lowercase
|
public void toLowerCase()
public void removeDiacritics()
public void removePunctuation()
public void removeStopwords()
public void removeEngStopwords()
public void removeNonAlphaNumeric(int minimumSize)
minimumSize
- Minimum size of accepted tokens. If equals to 0,
all the tokens will be acceptedpublic void removeNonAlphabetic(int minimumSize)
minimumSize
- Minimum size of accepted tokens. If it equals 0,
all the tokens will be acceptedpublic static java.lang.String normalizeText(java.lang.String text)
text
- public java.lang.String normalizeAndDeAposText(java.lang.String text)
text
- public void stem()
public void stemLucene()
public void setString(java.lang.String str)
str
- public void setStringTokens(java.lang.String str)
str
- public java.lang.String getString()
public java.util.List<java.lang.String> getTokens()
public java.lang.String getOriginalString()