Conversation
Contributor
Author
|
Außerdem scheint lingua wirklich komplett offline zu funktionieren: "It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline." |
e0ad60d to
15e0653
Compare
15e0653 to
868c7e1
Compare
4e4c577 to
1f433c3
Compare
1f433c3 to
c8d10b8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Über lingua werden die folgenden 2 Werte in der result map gesetzt:
MOST_LIKELY_TEXT_LANGUAGE: erkannte Sprache oder UNKNOWN, falls das Ergebnis nicht sicher genug istTEXT_LANGUAGE_CONFIDENCE_VALUES: Konfidenz Werte für erkannte Sprachen, in absteigender Reihenfolge, fängt mit derMOST_LIKELY_TEXT_LANGUAGEmit dem Wert 1.0 anDas Feature muss über die Umgebungsvariable
org.jadice.filetype.matchers.PDFMatcher.languageCheckaktiviert werden.Nach aktualler Konfiguration werden alle verfügbaren Sprachen in Betracht gezogen (75), es wäre aber aus Performancegründen durchaus sinnvoll, das einzuschränken, falls möglich:
Außerdem habe ich es erstmal so eingestellt, dass ab einer Textlänge von 120 Zeichen
lLanguageDetectorBuilder.withLowAccuracyMode()benutzt wird, da so kleinere Datensätze benutzt werden. Unter 120 Zeichen soll das laut Doku zu große Ungenauigkeit zur Folge haben.