Package | Description |
---|---|
de.l3s.boilerpipe |
The Boilerpipe top-level package.
|
de.l3s.boilerpipe.extractors |
This package contains some standard extractors (i.e., completely piped BoilerpipeFilters)
|
de.l3s.boilerpipe.filters.english |
The BoilerpipeFilters in this package have only been tested on English text.
|
de.l3s.boilerpipe.filters.heuristics |
The BoilerpipeFilters in this package are pure heuristics.
|
de.l3s.boilerpipe.filters.simple |
The BoilerpipeFilters in this package are straight-forward and probably not really specific to English.
|
de.l3s.boilerpipe.sax |
Classes related to parsing and producing HTML from/to Boilerpipe TextDocuments.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
BoilerpipeExtractor.getText(org.xml.sax.InputSource is)
Extracts text from the HTML code available from the given
InputSource . |
java.lang.String |
BoilerpipeExtractor.getText(java.io.Reader r)
Extracts text from the HTML code available from the given
Reader . |
java.lang.String |
BoilerpipeExtractor.getText(java.lang.String html)
Extracts text from the HTML code given as a String.
|
java.lang.String |
BoilerpipeExtractor.getText(TextDocument doc)
Extracts text from the given
TextDocument object. |
TextDocument |
BoilerpipeInput.getTextDocument()
Returns (somehow) a
TextDocument . |
boolean |
BoilerpipeFilter.process(TextDocument doc)
Processes the given document
doc . |
TextDocument |
BoilerpipeDocumentSource.toTextDocument() |
Modifier and Type | Method and Description |
---|---|
java.lang.String |
ExtractorBase.getText(org.xml.sax.InputSource is)
Extracts text from the HTML code available from the given
InputSource . |
java.lang.String |
ExtractorBase.getText(java.io.Reader r)
Extracts text from the HTML code available from the given
Reader . |
java.lang.String |
ExtractorBase.getText(java.lang.String html)
Extracts text from the HTML code given as a String.
|
java.lang.String |
ExtractorBase.getText(TextDocument doc)
Extracts text from the given
TextDocument object. |
java.lang.String |
ExtractorBase.getText(java.net.URL url)
Extracts text from the HTML code available from the given
URL . |
boolean |
KeepEverythingWithMinKWordsExtractor.process(TextDocument doc) |
boolean |
LargestContentExtractor.process(TextDocument doc) |
boolean |
ArticleSentencesExtractor.process(TextDocument doc) |
boolean |
DefaultExtractor.process(TextDocument doc) |
boolean |
KeepEverythingExtractor.process(TextDocument doc) |
boolean |
NumWordsRulesExtractor.process(TextDocument doc) |
boolean |
CanolaExtractor.process(TextDocument doc) |
boolean |
ArticleExtractor.process(TextDocument doc) |
Modifier and Type | Method and Description |
---|---|
boolean |
DensityRulesClassifier.process(TextDocument doc) |
boolean |
IgnoreBlocksAfterContentFilter.process(TextDocument doc) |
boolean |
TerminatingBlocksFinder.process(TextDocument doc) |
boolean |
IgnoreBlocksAfterContentFromEndFilter.process(TextDocument doc) |
boolean |
MinFulltextWordsFilter.process(TextDocument doc) |
boolean |
KeepLargestFulltextBlockFilter.process(TextDocument doc) |
boolean |
NumWordsRulesClassifier.process(TextDocument doc) |
Modifier and Type | Method and Description |
---|---|
boolean |
BlockProximityFusion.process(TextDocument doc) |
boolean |
ExpandTitleToContentFilter.process(TextDocument doc) |
boolean |
ArticleMetadataFilter.process(TextDocument doc) |
boolean |
SimpleBlockFusionProcessor.process(TextDocument doc) |
boolean |
LabelFusion.process(TextDocument doc) |
boolean |
ContentFusion.process(TextDocument doc) |
boolean |
DocumentTitleMatchClassifier.process(TextDocument doc) |
boolean |
KeepLargestBlockFilter.process(TextDocument doc) |
boolean |
AddPrecedingLabelsFilter.process(TextDocument doc) |
Modifier and Type | Method and Description |
---|---|
boolean |
LabelToContentFilter.process(TextDocument doc) |
boolean |
LabelToBoilerplateFilter.process(TextDocument doc) |
boolean |
MinWordsFilter.process(TextDocument doc) |
boolean |
InvertedFilter.process(TextDocument doc) |
boolean |
SurroundingToContentFilter.process(TextDocument doc) |
boolean |
BoilerplateBlockFilter.process(TextDocument doc) |
boolean |
MarkEverythingContentFilter.process(TextDocument doc) |
boolean |
MinClauseWordsFilter.process(TextDocument doc) |
boolean |
SplitParagraphBlocksFilter.process(TextDocument doc) |
Modifier and Type | Method and Description |
---|---|
TextDocument |
BoilerpipeSAXInput.getTextDocument()
Retrieves the
TextDocument using a default HTML parser. |
TextDocument |
BoilerpipeSAXInput.getTextDocument(BoilerpipeHTMLParser parser)
Retrieves the
TextDocument using the given HTML parser. |
java.lang.String |
HTMLHighlighter.process(TextDocument doc,
org.xml.sax.InputSource is)
Processes the given
TextDocument and the original HTML text (as
an InputSource ). |
java.lang.String |
HTMLHighlighter.process(TextDocument doc,
java.lang.String origHTML)
Processes the given
TextDocument and the original HTML text (as a
String). |
java.lang.String |
HTMLHighlighter.process(java.net.URL url,
BoilerpipeExtractor extractor) |