Feat: add language detection by axherrm · Pull Request #33 · levigo/filetype-analyzer

axherrm · 2023-03-10T13:03:57Z

Über lingua werden die folgenden 2 Werte in der result map gesetzt:

MOST_LIKELY_TEXT_LANGUAGE: erkannte Sprache oder UNKNOWN, falls das Ergebnis nicht sicher genug ist
TEXT_LANGUAGE_CONFIDENCE_VALUES: Konfidenz Werte für erkannte Sprachen, in absteigender Reihenfolge, fängt mit der MOST_LIKELY_TEXT_LANGUAGE mit dem Wert 1.0 an

Das Feature muss über die Umgebungsvariable org.jadice.filetype.matchers.PDFMatcher.languageCheck aktiviert werden.
Nach aktualler Konfiguration werden alle verfügbaren Sprachen in Betracht gezogen (75), es wäre aber aus Performancegründen durchaus sinnvoll, das einzuschränken, falls möglich:

// include all languages available in the library
// WARNING: in the worst case this produces high memory 
//          consumption of approximately 3.5GB 
//          and slow runtime performance
//          (in high accuracy mode)
LanguageDetectorBuilder.fromAllLanguages()

// include only languages that are not yet extinct (= currently excludes Latin)
LanguageDetectorBuilder.fromAllSpokenLanguages()

// include only languages written with Cyrillic script
LanguageDetectorBuilder.fromAllLanguagesWithCyrillicScript()

// exclude only the Spanish language from the decision algorithm
LanguageDetectorBuilder.fromAllLanguagesWithout(Language.SPANISH)

// only decide between English and German
LanguageDetectorBuilder.fromLanguages(Language.ENGLISH, Language.GERMAN)

// select languages by ISO 639-1 code
LanguageDetectorBuilder.fromIsoCodes639_1(IsoCode639_1.EN, IsoCode639_3.DE)

// select languages by ISO 639-3 code
LanguageDetectorBuilder.fromIsoCodes639_3(IsoCode639_3.ENG, IsoCode639_3.DEU)

Außerdem habe ich es erstmal so eingestellt, dass ab einer Textlänge von 120 Zeichen lLanguageDetectorBuilder.withLowAccuracyMode() benutzt wird, da so kleinere Datensätze benutzt werden. Unter 120 Zeichen soll das laut Doku zu große Ungenauigkeit zur Folge haben.

axherrm · 2023-03-10T13:29:50Z

Außerdem scheint lingua wirklich komplett offline zu funktionieren: "It does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline."

axherrm force-pushed the feat/detect-language branch from e0ad60d to 15e0653 Compare March 10, 2023 14:03

Base automatically changed from feat/pdf-contains-text to master March 10, 2023 14:03

axherrm force-pushed the feat/detect-language branch from 15e0653 to 868c7e1 Compare March 10, 2023 14:16

welschsn force-pushed the feat/detect-language branch from 4e4c577 to 1f433c3 Compare February 11, 2026 11:42

feat(JF-466): add language recognition

c8d10b8

welschsn force-pushed the feat/detect-language branch from 1f433c3 to c8d10b8 Compare February 11, 2026 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add language detection#33

Feat: add language detection#33
axherrm wants to merge 1 commit intomasterfrom
feat/detect-language

axherrm commented Mar 10, 2023

Uh oh!

axherrm commented Mar 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

axherrm commented Mar 10, 2023

Uh oh!

axherrm commented Mar 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant