Package | Description |
---|---|
de.l3s.boilerpipe |
The Boilerpipe top-level package.
|
de.l3s.boilerpipe.extractors |
This package contains some standard extractors (i.e., completely piped BoilerpipeFilters)
|
de.l3s.boilerpipe.filters.english |
The BoilerpipeFilters in this package have only been tested on English text.
|
de.l3s.boilerpipe.filters.heuristics |
The BoilerpipeFilters in this package are pure heuristics.
|
de.l3s.boilerpipe.filters.simple |
The BoilerpipeFilters in this package are straight-forward and probably not really specific to English.
|
Modifier and Type | Interface and Description |
---|---|
interface |
BoilerpipeExtractor
Describes a complete filter pipeline.
|
Modifier and Type | Class and Description |
---|---|
class |
ArticleExtractor
A full-text extractor which is tuned towards news articles.
|
class |
ArticleSentencesExtractor
A full-text extractor which is tuned towards extracting sentences from news articles.
|
class |
CanolaExtractor
|
class |
DefaultExtractor
A quite generic full-text extractor.
|
class |
ExtractorBase
The base class of Extractors.
|
class |
KeepEverythingExtractor
Marks everything as content.
|
class |
KeepEverythingWithMinKWordsExtractor
A full-text extractor which extracts the largest text component of a page.
|
class |
LargestContentExtractor
A full-text extractor which extracts the largest text component of a page.
|
class |
NumWordsRulesExtractor
A quite generic full-text extractor solely based upon the number of words per
block (the current, the previous and the next block).
|
Modifier and Type | Field and Description |
---|---|
static BoilerpipeFilter |
CanolaExtractor.CLASSIFIER
The actual classifier, exposed.
|
Modifier and Type | Class and Description |
---|---|
class |
DensityRulesClassifier
Classifies
TextBlock s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in the
paper "Boilerplate Detection using Shallow Text Features", particularly using
text densities and link densities. |
class |
IgnoreBlocksAfterContentFilter
Marks all blocks as "non-content" that occur after blocks that have been
marked
DefaultLabels.INDICATES_END_OF_TEXT . |
class |
IgnoreBlocksAfterContentFromEndFilter
Marks all blocks as "non-content" that occur after blocks that have been
marked
DefaultLabels.INDICATES_END_OF_TEXT , and after any content block. |
class |
KeepLargestFulltextBlockFilter
Keeps the largest
TextBlock only (by the number of words). |
class |
MinFulltextWordsFilter
Keeps only those content blocks which contain at least k full-text words
(measured by
HeuristicFilterBase.getNumFullTextWords(TextBlock) ). |
class |
NumWordsRulesClassifier
Classifies
TextBlock s as content/not-content through rules that have
been determined using the C4.8 machine learning algorithm, as described in
the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010),
particularly using number of words per block and link density per block. |
class |
TerminatingBlocksFinder
Finds blocks which are potentially indicating the end of an article text and
marks them with
DefaultLabels.INDICATES_END_OF_TEXT . |
Modifier and Type | Class and Description |
---|---|
class |
AddPrecedingLabelsFilter
Adds the labels of the preceding block to the current block, optionally adding a prefix.
|
class |
ArticleMetadataFilter |
class |
BlockProximityFusion
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.
|
class |
ContentFusion |
class |
DocumentTitleMatchClassifier
Marks
TextBlock s which contain parts of the HTML
<TITLE> tag, using some heuristics which are quite
specific to the news domain. |
class |
ExpandTitleToContentFilter
Marks all
TextBlock s "content" which are between the headline and the part that
has already been marked content, if they are marked DefaultLabels.MIGHT_BE_CONTENT . |
class |
KeepLargestBlockFilter
Keeps the largest
TextBlock only (by the number of words). |
class |
LabelFusion
Fuses adjacent blocks if their labels are equal.
|
class |
SimpleBlockFusionProcessor
Merges two subsequent blocks if their text densities are equal.
|
Modifier and Type | Class and Description |
---|---|
class |
BoilerplateBlockFilter
Removes
TextBlock s which have explicitly been marked as "not content". |
class |
InvertedFilter
Reverts the "isContent" flag for all
TextBlock s |
class |
LabelToBoilerplateFilter
Marks all blocks that contain a given label as "boilerplate".
|
class |
LabelToContentFilter
Marks all blocks that contain a given label as "content".
|
class |
MarkEverythingContentFilter
Marks all blocks as content.
|
class |
MinClauseWordsFilter
Keeps only blocks that have at least one segment fragment ("clause") with at
least k words (default: 5).
|
class |
MinWordsFilter
Keeps only those content blocks which contain at least k words.
|
class |
SplitParagraphBlocksFilter
Splits TextBlocks at paragraph boundaries.
|
class |
SurroundingToContentFilter |