public class ArticleTextExtractor
extends java.lang.Object
| Constructor and Description |
|---|
ArticleTextExtractor(java.util.Locale language,
int year)
Creates a preprocessor without any page to preprocess.
|
ArticleTextExtractor(java.util.Locale language,
int year,
java.io.File listOfPages)
Creates a preprocessor with the pages listed in
listOfPages. |
| Modifier and Type | Method and Description |
|---|---|
void |
addPage(int id)
Adds a page ID to the set of page IDs
|
void |
addPages(java.util.Collection<java.lang.Integer> ids)
Adds a collection of pages to the set of page IDs
|
void |
addPreprocess(TypePreprocess preprocess)
Adds a new preprocess to the available ones.
|
static void |
extractEntireWikipedia(java.util.Locale locale,
int year,
java.io.File directory)
The entire Wiki for the set language and year
|
static void |
extractSpecificArticles(java.util.Locale locale,
int year,
java.io.File pagesFile,
java.io.File directory)
Extract only the articles specified in the pagesFile
|
static void |
extractSpecificArticles(java.util.Locale locale,
int year,
java.lang.Integer[] articleIDs,
java.io.File directory) |
java.util.Set<TypePreprocess> |
getAvailablePreprocesses() |
java.io.File |
getRootDirectory() |
boolean |
isAvailablePreprocess(TypePreprocess preprocess)
Query if exists the preprocess within the available preprocesses.
|
void |
loadPages(java.io.File list)
Loads all the page IDs of the file
list. |
void |
loadPages(java.lang.Integer[] list) |
static void |
main(java.lang.String[] args)
Extract texts from Wikipedia articles and save them into text files, after
some given preprocessing.
|
void |
preprocess(TypePreprocess preprocess)
Preprocess all the pages with the preprocess method indentified by
preprocess
|
void |
preprocessAll()
Applies all the available preprocessing steps to the pages.
|
void |
removePage(int id) |
boolean |
removePreprocess(TypePreprocess preprocess)
Removes the preprocess if it belongs to the set of available
preprocesses.
|
void |
setRootDirectory(java.io.File directory)
Changes the root directory of the preprocessor.
|
public ArticleTextExtractor(java.util.Locale language,
int year)
language - Language of the Wikipediayear - Year of the dump of Wikipediapublic ArticleTextExtractor(java.util.Locale language,
int year,
java.io.File listOfPages)
throws java.io.IOException
listOfPages. This
file must have one ID by line. The root directory is set to the user's
home.language - Language of the Wikipediayear - Year of the dump of WikipedialistOfPages - File with one page ID by line.java.io.IOExceptionpublic void preprocessAll()
throws java.lang.InterruptedException,
WikiApiException
java.lang.InterruptedExceptionWikiApiExceptionpublic void preprocess(TypePreprocess preprocess) throws java.lang.InterruptedException, WikiApiException
preprocess - Identifier of the preprocessjava.lang.InterruptedExceptionWikiApiExceptionpublic void loadPages(java.io.File list)
throws java.io.IOException
list. The older loaded pages
are kept in the preprocessor.list - File with one page ID by line.java.io.IOExceptionpublic void loadPages(java.lang.Integer[] list)
public void addPage(int id)
id - ID of the pagepublic void removePage(int id)
public void addPages(java.util.Collection<java.lang.Integer> ids)
ids - Collection of page IDspublic void addPreprocess(TypePreprocess preprocess)
preprocess is
already added, nothing is donepreprocess - The new preprocess. It can't be a null object.public boolean removePreprocess(TypePreprocess preprocess)
preprocess - Preprocess to remove.public void setRootDirectory(java.io.File directory)
directory
isn't a valid directory (with writable permissions), any change is done.directory - A valid directory.public boolean isAvailablePreprocess(TypePreprocess preprocess)
preprocess - Desired preprocess.public static void extractSpecificArticles(java.util.Locale locale,
int year,
java.lang.Integer[] articleIDs,
java.io.File directory)
public static void extractSpecificArticles(java.util.Locale locale,
int year,
java.io.File pagesFile,
java.io.File directory)
locale - year - pagesFile - directory - public java.util.Set<TypePreprocess> getAvailablePreprocesses()
public java.io.File getRootDirectory()
public static void extractEntireWikipedia(java.util.Locale locale,
int year,
java.io.File directory)
public static void main(java.lang.String[] args)
args - language, year [file with ids]