Skip navigation links
A B C D E F G H I K L M N P R S T U 

A

addLabel(String) - Method in class de.l3s.boilerpipe.document.TextBlock
Adds an arbitrary String label to this TextBlock.
addLabelAction(LabelAction) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
addLabels(Set<String>) - Method in class de.l3s.boilerpipe.document.TextBlock
Adds a set of labels to this TextBlock.
addLabels(String...) - Method in class de.l3s.boilerpipe.document.TextBlock
Adds a set of labels to this TextBlock.
addLabelsTo(TextBlock) - Method in class de.l3s.boilerpipe.labels.LabelAction
 
AddPrecedingLabelsFilter - Class in de.l3s.boilerpipe.filters.heuristics
Adds the labels of the preceding block to the current block, optionally adding a prefix.
AddPrecedingLabelsFilter(String) - Constructor for class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
Creates a new AddPrecedingLabelsFilter instance.
addTagAction(String, TagAction) - Method in class de.l3s.boilerpipe.sax.TagActionMap
Adds a particular TagAction for a given tag.
addTextBlock(TextBlock) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
addTo(TextBlock) - Method in class de.l3s.boilerpipe.labels.ConditionalLabelAction
 
addTo(TextBlock) - Method in class de.l3s.boilerpipe.labels.LabelAction
 
addWhitespaceIfNecessary() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
ARTICLE_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
Works very well for most types of Article-like HTML.
ARTICLE_METADATA - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
 
ArticleExtractor - Class in de.l3s.boilerpipe.extractors
A full-text extractor which is tuned towards news articles.
ArticleExtractor() - Constructor for class de.l3s.boilerpipe.extractors.ArticleExtractor
 
ArticleMetadataFilter - Class in de.l3s.boilerpipe.filters.heuristics
 
ArticleSentencesExtractor - Class in de.l3s.boilerpipe.extractors
A full-text extractor which is tuned towards extracting sentences from news articles.
ArticleSentencesExtractor() - Constructor for class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
 
avgNumWords() - Method in class de.l3s.boilerpipe.document.TextDocumentStatistics
Returns the average number of words at block-level (= overall number of words divided by the number of blocks).

B

BlockProximityFusion - Class in de.l3s.boilerpipe.filters.heuristics
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit.
BlockProximityFusion(int, boolean, boolean) - Constructor for class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
Creates a new BlockProximityFusion instance.
BlockTagLabelAction(LabelAction) - Constructor for class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
 
BoilerpipeDocumentSource - Interface in de.l3s.boilerpipe
Something that can be represented as a TextDocument.
BoilerpipeExtractor - Interface in de.l3s.boilerpipe
Describes a complete filter pipeline.
BoilerpipeFilter - Interface in de.l3s.boilerpipe
A generic BoilerpipeFilter.
BoilerpipeHTMLContentHandler - Class in de.l3s.boilerpipe.sax
A simple SAX ContentHandler, used by BoilerpipeSAXInput.
BoilerpipeHTMLContentHandler() - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
BoilerpipeHTMLContentHandler(TagActionMap) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Constructs a BoilerpipeHTMLContentHandler using the given TagActionMap.
BoilerpipeHTMLParser - Class in de.l3s.boilerpipe.sax
A simple SAX Parser, used by BoilerpipeSAXInput.
BoilerpipeHTMLParser() - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
Constructs a BoilerpipeHTMLParser using a default HTML content handler.
BoilerpipeHTMLParser(BoilerpipeHTMLContentHandler) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
BoilerpipeHTMLParser(boolean) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
 
BoilerpipeInput - Interface in de.l3s.boilerpipe
A source that returns TextDocuments.
BoilerpipeProcessingException - Exception in de.l3s.boilerpipe
Exception for signaling failure in the processing pipeline.
BoilerpipeProcessingException() - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
 
BoilerpipeProcessingException(String, Throwable) - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
 
BoilerpipeProcessingException(String) - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
 
BoilerpipeProcessingException(Throwable) - Constructor for exception de.l3s.boilerpipe.BoilerpipeProcessingException
 
BoilerpipeSAXInput - Class in de.l3s.boilerpipe.sax
Parses an InputSource using SAX and returns a TextDocument.
BoilerpipeSAXInput(InputSource) - Constructor for class de.l3s.boilerpipe.sax.BoilerpipeSAXInput
Creates a new instance of BoilerpipeSAXInput for the given InputSource.
BoilerplateBlockFilter - Class in de.l3s.boilerpipe.filters.simple
Removes TextBlocks which have explicitly been marked as "not content".
BoilerplateBlockFilter() - Constructor for class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
 

C

CANOLA_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
Trained on krdwrd Canola (different definition of "boilerplate").
CanolaExtractor - Class in de.l3s.boilerpipe.extractors
A full-text extractor trained on krdwrd Canola .
CanolaExtractor() - Constructor for class de.l3s.boilerpipe.extractors.CanolaExtractor
 
Chained(TagAction, TagAction) - Constructor for class de.l3s.boilerpipe.sax.CommonTagActions.Chained
 
changesTagLevel() - Method in class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
 
changesTagLevel() - Method in class de.l3s.boilerpipe.sax.CommonTagActions.Chained
 
changesTagLevel() - Method in class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
 
changesTagLevel() - Method in class de.l3s.boilerpipe.sax.MarkupTagAction
 
changesTagLevel() - Method in interface de.l3s.boilerpipe.sax.TagAction
 
characters(char[], int, int) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
CLASSIFIER - Static variable in class de.l3s.boilerpipe.extractors.CanolaExtractor
The actual classifier, exposed.
classify(TextBlock, TextBlock, TextBlock) - Method in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
 
classify(TextBlock, TextBlock, TextBlock) - Method in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
 
clone() - Method in class de.l3s.boilerpipe.document.TextBlock
 
CommonExtractors - Class in de.l3s.boilerpipe.extractors
Provides quick access to common BoilerpipeExtractors.
CommonTagActions - Class in de.l3s.boilerpipe.sax
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
CommonTagActions.BlockTagLabelAction - Class in de.l3s.boilerpipe.sax
CommonTagActions for block-level elements, which triggers some LabelAction on the generated TextBlock.
CommonTagActions.Chained - Class in de.l3s.boilerpipe.sax
 
CommonTagActions.InlineTagLabelAction - Class in de.l3s.boilerpipe.sax
CommonTagActions for inline elements, which triggers some LabelAction on the generated TextBlock.
ConditionalLabelAction - Class in de.l3s.boilerpipe.labels
Adds labels to a TextBlock if the given criteria are met.
ConditionalLabelAction(TextBlockCondition, String...) - Constructor for class de.l3s.boilerpipe.labels.ConditionalLabelAction
 
ContentFusion - Class in de.l3s.boilerpipe.filters.heuristics
 
ContentFusion() - Constructor for class de.l3s.boilerpipe.filters.heuristics.ContentFusion
Creates a new ContentFusion instance.

D

de.l3s.boilerpipe - package de.l3s.boilerpipe
The Boilerpipe top-level package.
de.l3s.boilerpipe.conditions - package de.l3s.boilerpipe.conditions
 
de.l3s.boilerpipe.document - package de.l3s.boilerpipe.document
The classes in this package represent the simple Boilerpipe document model.
de.l3s.boilerpipe.estimators - package de.l3s.boilerpipe.estimators
 
de.l3s.boilerpipe.extractors - package de.l3s.boilerpipe.extractors
This package contains some standard extractors (i.e., completely piped BoilerpipeFilters)
de.l3s.boilerpipe.filters.english - package de.l3s.boilerpipe.filters.english
The BoilerpipeFilters in this package have only been tested on English text.
de.l3s.boilerpipe.filters.heuristics - package de.l3s.boilerpipe.filters.heuristics
The BoilerpipeFilters in this package are pure heuristics.
de.l3s.boilerpipe.filters.simple - package de.l3s.boilerpipe.filters.simple
The BoilerpipeFilters in this package are straight-forward and probably not really specific to English.
de.l3s.boilerpipe.labels - package de.l3s.boilerpipe.labels
 
de.l3s.boilerpipe.sax - package de.l3s.boilerpipe.sax
Classes related to parsing and producing HTML from/to Boilerpipe TextDocuments.
de.l3s.boilerpipe.util - package de.l3s.boilerpipe.util
Some helper classes.
debugString() - Method in class de.l3s.boilerpipe.document.TextDocument
Returns detailed debugging information about the contained TextBlocks.
DEFAULT_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
Usually worse than ArticleExtractor, but simpler/no heuristics.
DEFAULT_INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
 
DEFAULT_INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
 
DefaultExtractor - Class in de.l3s.boilerpipe.extractors
A quite generic full-text extractor.
DefaultExtractor() - Constructor for class de.l3s.boilerpipe.extractors.DefaultExtractor
 
DefaultLabels - Class in de.l3s.boilerpipe.labels
Some pre-defined labels which can be used in conjunction with TextBlock.addLabel(String) and TextBlock.hasLabel(String).
DefaultLabels() - Constructor for class de.l3s.boilerpipe.labels.DefaultLabels
 
DefaultTagActionMap - Class in de.l3s.boilerpipe.sax
Default TagActions.
DefaultTagActionMap() - Constructor for class de.l3s.boilerpipe.sax.DefaultTagActionMap
 
DensityRulesClassifier - Class in de.l3s.boilerpipe.filters.english
Classifies TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.
DensityRulesClassifier() - Constructor for class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
 
DocumentTitleMatchClassifier - Class in de.l3s.boilerpipe.filters.heuristics
Marks TextBlocks which contain parts of the HTML <TITLE> tag, using some heuristics which are quite specific to the news domain.
DocumentTitleMatchClassifier(String) - Constructor for class de.l3s.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier
 

E

EMPTY_END - Static variable in class de.l3s.boilerpipe.document.TextBlock
 
EMPTY_START - Static variable in class de.l3s.boilerpipe.document.TextBlock
 
end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
 
end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.Chained
 
end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
 
end(BoilerpipeHTMLContentHandler, String, String) - Method in class de.l3s.boilerpipe.sax.MarkupTagAction
 
end(BoilerpipeHTMLContentHandler, String, String) - Method in interface de.l3s.boilerpipe.sax.TagAction
 
endDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
endElement(String, String, String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
endPrefixMapping(String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
ExpandTitleToContentFilter - Class in de.l3s.boilerpipe.filters.heuristics
Marks all TextBlocks "content" which are between the headline and the part that has already been marked content, if they are marked DefaultLabels.MIGHT_BE_CONTENT.
ExpandTitleToContentFilter() - Constructor for class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
 
ExtractorBase - Class in de.l3s.boilerpipe.extractors
The base class of Extractors.
ExtractorBase() - Constructor for class de.l3s.boilerpipe.extractors.ExtractorBase
 

F

fetch(URL) - Static method in class de.l3s.boilerpipe.sax.HTMLFetcher
Fetches the document at the given URL, using URLConnection.
flushBlock() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 

G

getCharset() - Method in class de.l3s.boilerpipe.sax.HTMLDocument
 
getContainedTextElements() - Method in class de.l3s.boilerpipe.document.TextBlock
Returns the containedTextElements BitSet, or null.
getContent() - Method in class de.l3s.boilerpipe.document.TextDocument
Returns the TextDocument's content.
getData() - Method in class de.l3s.boilerpipe.sax.HTMLDocument
 
getDefaultInstance() - Static method in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
Returns the singleton instance for DeleteBlocksAfterContentFilter.
getDefaultInstance() - Static method in class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
 
getExtraStyleSheet() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Returns the extra stylesheet definition that will be inserted in the HEAD element.
getInstance() - Static method in class de.l3s.boilerpipe.extractors.ArticleExtractor
Returns the singleton instance for ArticleExtractor.
getInstance() - Static method in class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
Returns the singleton instance for ArticleSentencesExtractor.
getInstance() - Static method in class de.l3s.boilerpipe.extractors.CanolaExtractor
Returns the singleton instance for CanolaExtractor.
getInstance() - Static method in class de.l3s.boilerpipe.extractors.DefaultExtractor
Returns the singleton instance for DefaultExtractor.
getInstance() - Static method in class de.l3s.boilerpipe.extractors.LargestContentExtractor
Returns the singleton instance for LargestContentExtractor.
getInstance() - Static method in class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
Returns the singleton instance for NumWordsRulesExtractor.
getInstance() - Static method in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
Returns the singleton instance for RulebasedBoilerpipeClassifier.
getInstance() - Static method in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
Returns the singleton instance for RulebasedBoilerpipeClassifier.
getInstance() - Static method in class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
Returns the singleton instance for TerminatingBlocksFinder.
getInstance() - Static method in class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
Returns the singleton instance for ExpandTitleToContentFilter.
getInstance() - Static method in class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
Returns the singleton instance for BlockFusionProcessor.
getInstance() - Static method in class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
Returns the singleton instance for BoilerplateBlockFilter.
getInstance() - Static method in class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
Returns the singleton instance for TerminatingBlocksFinder.
getLabels() - Method in class de.l3s.boilerpipe.document.TextBlock
Returns the labels associated to this TextBlock, or null if no such labels exist.
getLinkDensity() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getNumWords() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getNumWords() - Method in class de.l3s.boilerpipe.document.TextDocumentStatistics
Returns the overall number of words in all blocks.
getNumWordsInAnchorText() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getOffsetBlocksEnd() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getOffsetBlocksStart() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getPostHighlight() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Returns the string that will be inserted after any highlighted HTML block.
getPotentialTitles() - Method in class de.l3s.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier
 
getPreHighlight() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Returns the string that will be inserted before any highlighted HTML block.
getTagLevel() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getText(String) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
Extracts text from the HTML code given as a String.
getText(InputSource) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
Extracts text from the HTML code available from the given InputSource.
getText(Reader) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
Extracts text from the HTML code available from the given Reader.
getText(TextDocument) - Method in interface de.l3s.boilerpipe.BoilerpipeExtractor
Extracts text from the given TextDocument object.
getText() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getText(boolean, boolean) - Method in class de.l3s.boilerpipe.document.TextDocument
Returns the TextDocument's content, non-content or both
getText(String) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
Extracts text from the HTML code given as a String.
getText(InputSource) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
Extracts text from the HTML code available from the given InputSource.
getText(URL) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
Extracts text from the HTML code available from the given URL.
getText(Reader) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
Extracts text from the HTML code available from the given Reader.
getText(TextDocument) - Method in class de.l3s.boilerpipe.extractors.ExtractorBase
Extracts text from the given TextDocument object.
getTextBlocks() - Method in class de.l3s.boilerpipe.document.TextDocument
Returns the TextBlocks of this document.
getTextDensity() - Method in class de.l3s.boilerpipe.document.TextBlock
 
getTextDocument() - Method in interface de.l3s.boilerpipe.BoilerpipeInput
Returns (somehow) a TextDocument.
getTextDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeSAXInput
Retrieves the TextDocument using a default HTML parser.
getTextDocument(BoilerpipeHTMLParser) - Method in class de.l3s.boilerpipe.sax.BoilerpipeSAXInput
Retrieves the TextDocument using the given HTML parser.
getTitle() - Method in class de.l3s.boilerpipe.document.TextDocument
Returns the "main" title for this document, or null if no such title has ben set.
getTitle() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 

H

hasLabel(String) - Method in class de.l3s.boilerpipe.document.TextBlock
Checks whether this TextBlock has the given label.
HR - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
 
HTMLDocument - Class in de.l3s.boilerpipe.sax
HTMLDocument(byte[], Charset) - Constructor for class de.l3s.boilerpipe.sax.HTMLDocument
 
HTMLDocument(String) - Constructor for class de.l3s.boilerpipe.sax.HTMLDocument
 
HTMLFetcher - Class in de.l3s.boilerpipe.sax
A very simple HTTP/HTML fetcher, really just for demo purposes.
HTMLHighlighter - Class in de.l3s.boilerpipe.sax
Highlights text blocks in an HTML document that have been marked as "content" in the corresponding TextDocument.

I

ignorableWhitespace(char[], int, int) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
IgnoreBlocksAfterContentFilter - Class in de.l3s.boilerpipe.filters.english
Marks all blocks as "non-content" that occur after blocks that have been marked DefaultLabels.INDICATES_END_OF_TEXT.
IgnoreBlocksAfterContentFilter(int) - Constructor for class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
 
IgnoreBlocksAfterContentFromEndFilter - Class in de.l3s.boilerpipe.filters.english
Marks all blocks as "non-content" that occur after blocks that have been marked DefaultLabels.INDICATES_END_OF_TEXT, and after any content block.
INDICATES_END_OF_TEXT - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
 
InlineTagLabelAction(LabelAction) - Constructor for class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
 
InputSourceable - Interface in de.l3s.boilerpipe.sax
An InputSourceable can return an arbitrary number of new InputSources for a given document.
INSTANCE - Static variable in class de.l3s.boilerpipe.estimators.SimpleEstimator
Returns the singleton instance of SimpleEstimator
INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.ArticleExtractor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.CanolaExtractor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.DefaultExtractor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.KeepEverythingExtractor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.LargestContentExtractor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFromEndFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.ArticleMetadataFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.ContentFusion
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.LabelFusion
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.InvertedFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.MarkEverythingContentFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
 
INSTANCE - Static variable in class de.l3s.boilerpipe.sax.DefaultTagActionMap
 
INSTANCE_200 - Static variable in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
 
INSTANCE_EXPAND_TO_SAME_TAGLEVEL - Static variable in class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
 
INSTANCE_PRE - Static variable in class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
 
INSTANCE_STRICTLY_NOT_CONTENT - Static variable in class de.l3s.boilerpipe.filters.simple.LabelToBoilerplateFilter
 
INSTANCE_TEXT - Static variable in class de.l3s.boilerpipe.filters.simple.SurroundingToContentFilter
 
InvertedFilter - Class in de.l3s.boilerpipe.filters.simple
Reverts the "isContent" flag for all TextBlocks
isContent() - Method in class de.l3s.boilerpipe.document.TextBlock
 
isLowQuality(TextDocumentStatistics, TextDocumentStatistics) - Method in class de.l3s.boilerpipe.estimators.SimpleEstimator
Given the statistics of the document before and after applying the BoilerpipeExtractor, can we regard the extraction quality (too) low? Works well with DefaultExtractor, ArticleExtractor and others.
isOutputHighlightOnly() - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
If true, only HTML enclosed within highlighted content will be returned

K

KEEP_EVERYTHING_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
Dummy Extractor; should return the input text.
KeepEverythingExtractor - Class in de.l3s.boilerpipe.extractors
Marks everything as content.
KeepEverythingWithMinKWordsExtractor - Class in de.l3s.boilerpipe.extractors
A full-text extractor which extracts the largest text component of a page.
KeepEverythingWithMinKWordsExtractor(int) - Constructor for class de.l3s.boilerpipe.extractors.KeepEverythingWithMinKWordsExtractor
 
KeepLargestBlockFilter - Class in de.l3s.boilerpipe.filters.heuristics
Keeps the largest TextBlock only (by the number of words).
KeepLargestBlockFilter(boolean) - Constructor for class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
 
KeepLargestFulltextBlockFilter - Class in de.l3s.boilerpipe.filters.english
Keeps the largest TextBlock only (by the number of words).
KeepLargestFulltextBlockFilter() - Constructor for class de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
 

L

LabelAction - Class in de.l3s.boilerpipe.labels
Helps adding labels to TextBlocks.
LabelAction(String...) - Constructor for class de.l3s.boilerpipe.labels.LabelAction
 
LabelFusion - Class in de.l3s.boilerpipe.filters.heuristics
Fuses adjacent blocks if their labels are equal.
LabelFusion(String) - Constructor for class de.l3s.boilerpipe.filters.heuristics.LabelFusion
Creates a new LabelFusion instance.
labels - Variable in class de.l3s.boilerpipe.labels.LabelAction
 
LabelToBoilerplateFilter - Class in de.l3s.boilerpipe.filters.simple
Marks all blocks that contain a given label as "boilerplate".
LabelToBoilerplateFilter(String...) - Constructor for class de.l3s.boilerpipe.filters.simple.LabelToBoilerplateFilter
 
LabelToContentFilter - Class in de.l3s.boilerpipe.filters.simple
Marks all blocks that contain a given label as "content".
LabelToContentFilter(String...) - Constructor for class de.l3s.boilerpipe.filters.simple.LabelToContentFilter
 
LARGEST_CONTENT_EXTRACTOR - Static variable in class de.l3s.boilerpipe.extractors.CommonExtractors
Like DefaultExtractor, but keeps the largest text block only.
LargestContentExtractor - Class in de.l3s.boilerpipe.extractors
A full-text extractor which extracts the largest text component of a page.

M

MarkEverythingContentFilter - Class in de.l3s.boilerpipe.filters.simple
Marks all blocks as content.
MARKUP_PREFIX - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
 
MarkupTagAction - Class in de.l3s.boilerpipe.sax
Assigns labels for element CSS classes and ids to the corresponding TextBlock.
MarkupTagAction(boolean) - Constructor for class de.l3s.boilerpipe.sax.MarkupTagAction
 
MAX_DISTANCE_1 - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
 
MAX_DISTANCE_1_CONTENT_ONLY - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
 
MAX_DISTANCE_1_CONTENT_ONLY_SAME_TAGLEVEL - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
 
MAX_DISTANCE_1_SAME_TAGLEVEL - Static variable in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
 
meetsCondition(TextBlock) - Method in interface de.l3s.boilerpipe.conditions.TextBlockCondition
Returns true iff the given TextBlock tb meets the defined condition.
mergeNext(TextBlock) - Method in class de.l3s.boilerpipe.document.TextBlock
 
MIGHT_BE_CONTENT - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
 
MinClauseWordsFilter - Class in de.l3s.boilerpipe.filters.simple
Keeps only blocks that have at least one segment fragment ("clause") with at least k words (default: 5).
MinClauseWordsFilter(int) - Constructor for class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
 
MinClauseWordsFilter(int, boolean) - Constructor for class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
 
MinFulltextWordsFilter - Class in de.l3s.boilerpipe.filters.english
Keeps only those content blocks which contain at least k full-text words (measured by HeuristicFilterBase.getNumFullTextWords(TextBlock)).
MinFulltextWordsFilter(int) - Constructor for class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
 
MinWordsFilter - Class in de.l3s.boilerpipe.filters.simple
Keeps only those content blocks which contain at least k words.
MinWordsFilter(int) - Constructor for class de.l3s.boilerpipe.filters.simple.MinWordsFilter
 

N

newExtractingInstance() - Static method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Creates a new HTMLHighlighter, which is set-up to return only the extracted HTML text, including enclosed markup.
newHighlightingInstance() - Static method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Creates a new HTMLHighlighter, which is set-up to return the full HTML text, with the extracted text portion highlighted.
NumWordsRulesClassifier - Class in de.l3s.boilerpipe.filters.english
Classifies TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.
NumWordsRulesClassifier() - Constructor for class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
 
NumWordsRulesExtractor - Class in de.l3s.boilerpipe.extractors
A quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).
NumWordsRulesExtractor() - Constructor for class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
 

P

process(TextDocument) - Method in interface de.l3s.boilerpipe.BoilerpipeFilter
Processes the given document doc.
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.ArticleExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.ArticleSentencesExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.CanolaExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.DefaultExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.KeepEverythingExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.KeepEverythingWithMinKWordsExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.LargestContentExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.extractors.NumWordsRulesExtractor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.DensityRulesClassifier
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFromEndFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.MinFulltextWordsFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.NumWordsRulesClassifier
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.AddPrecedingLabelsFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.ArticleMetadataFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.BlockProximityFusion
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.ContentFusion
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.DocumentTitleMatchClassifier
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.ExpandTitleToContentFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.KeepLargestBlockFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.LabelFusion
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.BoilerplateBlockFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.InvertedFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.LabelToBoilerplateFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.LabelToContentFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.MarkEverythingContentFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.MinClauseWordsFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.MinWordsFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
 
process(TextDocument) - Method in class de.l3s.boilerpipe.filters.simple.SurroundingToContentFilter
 
process(TextDocument, String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Processes the given TextDocument and the original HTML text (as a String).
process(TextDocument, InputSource) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Processes the given TextDocument and the original HTML text (as an InputSource).
process(URL, BoilerpipeExtractor) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
 
processingInstruction(String, String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 

R

recycle() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Recycles this instance.
removeLabel(String) - Method in class de.l3s.boilerpipe.document.TextBlock
 

S

setContentHandler(BoilerpipeHTMLContentHandler) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
 
setContentHandler(ContentHandler) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
 
setDocumentLocator(Locator) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
setExtraStyleSheet(String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Sets the extra stylesheet definition that will be inserted in the HEAD element.
setIsContent(boolean) - Method in class de.l3s.boilerpipe.document.TextBlock
 
setOutputHighlightOnly(boolean) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Sets whether only HTML enclosed within highlighted content will be returned, or the whole HTML document.
setPostHighlight(String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Sets the string that will be inserted after any highlighted HTML block.
setPreHighlight(String) - Method in class de.l3s.boilerpipe.sax.HTMLHighlighter
Sets the string that will be inserted prior to any highlighted HTML block.
setTagAction(String, TagAction) - Method in class de.l3s.boilerpipe.sax.TagActionMap
Sets a particular TagAction for a given tag.
setTagLevel(int) - Method in class de.l3s.boilerpipe.document.TextBlock
 
setTitle(String) - Method in class de.l3s.boilerpipe.document.TextDocument
Updates the "main" title for this document.
setTitle(String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
SimpleBlockFusionProcessor - Class in de.l3s.boilerpipe.filters.heuristics
Merges two subsequent blocks if their text densities are equal.
SimpleBlockFusionProcessor() - Constructor for class de.l3s.boilerpipe.filters.heuristics.SimpleBlockFusionProcessor
 
SimpleEstimator - Class in de.l3s.boilerpipe.estimators
Estimates the "goodness" of a BoilerpipeExtractor on a given document.
skippedEntity(String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
SplitParagraphBlocksFilter - Class in de.l3s.boilerpipe.filters.simple
Splits TextBlocks at paragraph boundaries.
SplitParagraphBlocksFilter() - Constructor for class de.l3s.boilerpipe.filters.simple.SplitParagraphBlocksFilter
 
start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.BlockTagLabelAction
 
start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.Chained
 
start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.CommonTagActions.InlineTagLabelAction
 
start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.MarkupTagAction
 
start(BoilerpipeHTMLContentHandler, String, String, Attributes) - Method in interface de.l3s.boilerpipe.sax.TagAction
 
startDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
startElement(String, String, String, Attributes) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
startPrefixMapping(String, String) - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
 
STRICTLY_NOT_CONTENT - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
 
SurroundingToContentFilter - Class in de.l3s.boilerpipe.filters.simple
 
SurroundingToContentFilter(TextBlockCondition) - Constructor for class de.l3s.boilerpipe.filters.simple.SurroundingToContentFilter
 

T

TA_ANCHOR_TEXT - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Marks this tag as "anchor" (this should usually only be set for the <A> tag).
TA_BLOCK_LEVEL - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Explicitly marks this tag a simple "block-level" element, which always generates whitespace
TA_BODY - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Marks this tag the body element (this should usually only be set for the <BODY> tag).
TA_FONT - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Special TagAction for the <FONT> tag, which keeps track of the absolute and relative font size.
TA_IGNORABLE_ELEMENT - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Marks this tag as "ignorable", i.e.
TA_INLINE - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Deprecated.
TA_INLINE_NO_WHITESPACE - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Marks this tag a simple "inline" element, which neither generates whitespace, nor a new block.
TA_INLINE_WHITESPACE - Static variable in class de.l3s.boilerpipe.sax.CommonTagActions
Marks this tag a simple "inline" element, which generates whitespace, but no new block.
TagAction - Interface in de.l3s.boilerpipe.sax
Defines an action that is to be performed whenever a particular tag occurs during HTML parsing.
TagActionMap - Class in de.l3s.boilerpipe.sax
Base class for definition a set of TagActions that are to be used for the HTML parsing process.
TagActionMap() - Constructor for class de.l3s.boilerpipe.sax.TagActionMap
 
TerminatingBlocksFinder - Class in de.l3s.boilerpipe.filters.english
Finds blocks which are potentially indicating the end of an article text and marks them with DefaultLabels.INDICATES_END_OF_TEXT.
TerminatingBlocksFinder() - Constructor for class de.l3s.boilerpipe.filters.english.TerminatingBlocksFinder
 
TextBlock - Class in de.l3s.boilerpipe.document
Describes a block of text.
TextBlock(String) - Constructor for class de.l3s.boilerpipe.document.TextBlock
 
TextBlock(String, BitSet, int, int, int, int, int) - Constructor for class de.l3s.boilerpipe.document.TextBlock
 
TextBlockCondition - Interface in de.l3s.boilerpipe.conditions
Evaluates whether a given TextBlock meets a certain condition.
TextDocument - Class in de.l3s.boilerpipe.document
A text document, consisting of one or more TextBlocks.
TextDocument(List<TextBlock>) - Constructor for class de.l3s.boilerpipe.document.TextDocument
Creates a new TextDocument with given TextBlocks, and no title.
TextDocument(String, List<TextBlock>) - Constructor for class de.l3s.boilerpipe.document.TextDocument
Creates a new TextDocument with given TextBlocks and given title.
TextDocumentStatistics - Class in de.l3s.boilerpipe.document
Provides shallow statistics on a given TextDocument
TextDocumentStatistics(TextDocument, boolean) - Constructor for class de.l3s.boilerpipe.document.TextDocumentStatistics
Computes statistics on a given TextDocument.
TITLE - Static variable in class de.l3s.boilerpipe.labels.DefaultLabels
 
toInputSource() - Method in class de.l3s.boilerpipe.sax.HTMLDocument
 
toInputSource() - Method in interface de.l3s.boilerpipe.sax.InputSourceable
 
tokenize(CharSequence) - Static method in class de.l3s.boilerpipe.util.UnicodeTokenizer
Tokenizes the text and returns an array of tokens.
toString() - Method in class de.l3s.boilerpipe.document.TextBlock
 
toString() - Method in class de.l3s.boilerpipe.labels.LabelAction
 
toTextDocument() - Method in interface de.l3s.boilerpipe.BoilerpipeDocumentSource
 
toTextDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler
Returns a TextDocument containing the extracted TextBlock s.
toTextDocument() - Method in class de.l3s.boilerpipe.sax.BoilerpipeHTMLParser
Returns a TextDocument containing the extracted TextBlock s.

U

UnicodeTokenizer - Class in de.l3s.boilerpipe.util
Tokenizes text according to Unicode word boundaries and strips off non-word characters.
UnicodeTokenizer() - Constructor for class de.l3s.boilerpipe.util.UnicodeTokenizer
 
A B C D E F G H I K L M N P R S T U 
Skip navigation links