See: Description
Class | Description |
---|---|
DensityRulesClassifier |
Classifies
TextBlock s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in the
paper "Boilerplate Detection using Shallow Text Features", particularly using
text densities and link densities. |
IgnoreBlocksAfterContentFilter |
Marks all blocks as "non-content" that occur after blocks that have been
marked
DefaultLabels.INDICATES_END_OF_TEXT . |
IgnoreBlocksAfterContentFromEndFilter |
Marks all blocks as "non-content" that occur after blocks that have been
marked
DefaultLabels.INDICATES_END_OF_TEXT , and after any content block. |
KeepLargestFulltextBlockFilter |
Keeps the largest
TextBlock only (by the number of words). |
MinFulltextWordsFilter |
Keeps only those content blocks which contain at least k full-text words
(measured by
HeuristicFilterBase.getNumFullTextWords(TextBlock) ). |
NumWordsRulesClassifier |
Classifies
TextBlock s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in
the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010),
particularly using number of words per block and link density per block. |
TerminatingBlocksFinder |
Finds blocks which are potentially indicating the end of an article text and
marks them with
DefaultLabels.INDICATES_END_OF_TEXT . |
The BoilerpipeFilters in this package have only been tested on English text.
That is, they will probably work with other Western languages, but maybe need some parameter tuning to perform well.