public class NumWordsRulesClassifier extends java.lang.Object implements BoilerpipeFilter
TextBlock
s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in
the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010),
particularly using number of words per block and link density per block.Modifier and Type | Field and Description |
---|---|
static NumWordsRulesClassifier |
INSTANCE |
Constructor and Description |
---|
NumWordsRulesClassifier() |
Modifier and Type | Method and Description |
---|---|
protected boolean |
classify(TextBlock prev,
TextBlock curr,
TextBlock next) |
static NumWordsRulesClassifier |
getInstance()
Returns the singleton instance for RulebasedBoilerpipeClassifier.
|
boolean |
process(TextDocument doc)
Processes the given document
doc . |
public static final NumWordsRulesClassifier INSTANCE
public static NumWordsRulesClassifier getInstance()
public boolean process(TextDocument doc) throws BoilerpipeProcessingException
BoilerpipeFilter
doc
.process
in interface BoilerpipeFilter
doc
- The TextDocument
that is to be processed.true
if changes have been made to the
TextDocument
.BoilerpipeProcessingException