public class DomainVocabulary
extends java.lang.Object
implements java.io.Serializable
A DomainVocabulary
instance is used to store a set of terms with its
frequency. It means, each entry contains the number of times that has been
added to the dictionary (its frequency).
It is able to insert terms from a text as well as a collection of strings. Before adding new terms they are preprocessed as follows:
preprocess(String)
function.
It also provides functions to get its content in other ways. On the one hand,
it is able to save them in a file (binary or textual) with the
serialize(File)
and toFile(File)
functions respectively. On
the other hand, it can be transformed into a list of
TermFrequencyTuple
using the toList()
function.
TermFrequencyTuple
,
Serialized FormConstructor and Description |
---|
DomainVocabulary(DomainVocabulary vocabulary)
Creates a new vocabulary identical to the given one.
|
DomainVocabulary(java.util.Locale language)
Creates an empty vocabulary.
|
DomainVocabulary(java.util.Locale language,
java.util.Collection<TermFrequencyTuple> terms)
Creates a vocabulary which include some initial terms with an initial
frequency.
|
Modifier and Type | Method and Description |
---|---|
void |
addTerms(java.util.Collection<java.lang.String> terms)
Add new terms from a collection of words.
|
void |
addTerms(java.lang.String text)
Add new terms or modify the already added with the tokens obtained by the
preprocessing of the given text.
|
boolean |
contains(java.lang.String term)
Checks if a term is contained in the vocabulary.
|
int |
getFrequency(java.lang.String term)
Gives the frequency of the given term.
|
java.util.Locale |
getLanguage() |
int |
getMinimumSize() |
java.util.Map<java.lang.String,java.lang.Integer> |
getTerms() |
DomainVocabulary |
getTop(float percentage)
Extracts the most frequent terms of the vocabulary.
|
DomainVocabulary |
getTop(int qtt)
Extract the
qtt most frequent terms of the vocabulary. |
void |
insertFromFile(java.io.File input)
Inserts the terms stored in the given file into the domain vocabulary.
|
static DomainVocabulary |
loadfromFile(java.io.File input)
Creates a
DomainVocabulary instance by reading a binary file
which contains a domain vocabulary. |
protected java.util.Collection<java.lang.String> |
preprocess(java.lang.String text)
Preprocess a text to extract its terms as defined in TermExtractor.getTerms();
|
void |
serialize(java.io.File output)
Saves the vocabulary in a binary file.
|
void |
setMinimumSize(int size)
Changes the minimum size of the tokens to be accepted as terms of the
vocabulary.
|
void |
setTerms(java.util.Collection<TermFrequencyTuple> terms)
Sets a collection of tuples formed by a term and its frequency as
vocabulary.
|
void |
toFile(java.io.File output)
Exports the vocabulary to a textual file.
|
java.util.List<TermFrequencyTuple> |
toList()
Exports the vocabulary as a list of tuples which contains the term and
its frequency.
|
public DomainVocabulary(java.util.Locale language)
language
- Language of the vocabulry.public DomainVocabulary(java.util.Locale language, java.util.Collection<TermFrequencyTuple> terms)
language
- Language of the vocabulary.terms
- Initial terms to include in the vocabulary.public DomainVocabulary(DomainVocabulary vocabulary)
vocabulary
- The reference vocabularypublic void addTerms(java.util.Collection<java.lang.String> terms)
terms
- Collection of terms to add.public void addTerms(java.lang.String text)
text
- The text with the new terms.public boolean contains(java.lang.String term)
term
- The term to check.true
if the term is contained in the vocabulary.
false
otherwise.public int getFrequency(java.lang.String term)
term
- The term to check its frequency.public DomainVocabulary getTop(float percentage)
percentage
- The percentage which defines the top to extract. The
percentage is given by a value between 0 and 1.DomainVocabulary
instance which contains the
extracted terms.public DomainVocabulary getTop(int qtt)
qtt
most frequent terms of the vocabulary.qtt
- Number of terms to extractDomainVocabulary
instance which contains the
extracted terms.public java.util.List<TermFrequencyTuple> toList()
TermFrequencyTuple
public void toFile(java.io.File output) throws java.io.IOException
N
lines where each one contains one vocabulary entry. The
DomainVocabulary
instances saves in this way can be loaded using
the insertFromFile(File)
function. Each line has the following
format: "term\tfrequency
".output
- Path to the output file. If it doesn't exists, it will be
created. If it already exists, it will be overwritten.java.io.IOException
public void serialize(java.io.File output)
DomainVocabulary
instances saves in this way can be loaded using the
loadfromFile(File)
function.output
- Path to the output file. If it doesn't exists, it will be
created. If it already exists, it will be overwritten.public static DomainVocabulary loadfromFile(java.io.File input)
DomainVocabulary
instance by reading a binary file
which contains a domain vocabulary.input
- Path to the binary file which contains a domain vocabulary.DomainVocabulary
instance which is equal to the one
stored in the input
file.public void insertFromFile(java.io.File input)
input
- Path to the readable file which contains a set of terms stored
with the format used by the toFile(File)
function.protected java.util.Collection<java.lang.String> preprocess(java.lang.String text)
text
- The text to preprocess.public void setMinimumSize(int size)
size
- The new size.public void setTerms(java.util.Collection<TermFrequencyTuple> terms)
terms
- The new terms of the vocabulary.public java.util.Locale getLanguage()
public int getMinimumSize()
public java.util.Map<java.lang.String,java.lang.Integer> getTerms()