Class: Kiba::Extend::Utils::StringNormalizer

Inherits:
Object
  • Object
show all
Defined in:
lib/kiba/extend/utils/string_normalizer.rb

Overview

Normalizes the given string according to the given parameters.

Can be used two ways. Preferred method when using in a transform or other context when the same normalization settings will be used to normalize many strings:

  # First initialize an instance of the class as an instance variable in
  #   your context
  @normalizer = Kiba::Extend::Utils::StringNormalizer.new(
    xforms: [:blank]
  )

  # for the repetitive part:
  vals.each{ |val| @normalizer.call(val) }

For one-off usage, testing normalization logic, or where the normalization settings vary per normalized value, you can do:

Kiba::Extend::Utils::StringNormalizer.call(
  xforms: [:blank], str: 'Card table'
)
  => 'Cardtable'

The second way is much less performant, as it initializes a new instance of the class every time it is called.

Examples:

Downcase only

util = Kiba::Extend::Utils::StringNormalizer.new(xforms: [:lower])
input = [
  'Oświęcim (Poland)',
  'Oswiecim, Poland',
  'Iași, Romania',
  'Iasi, Romania',
  'Table, café',
  '1,001 Arabian Nights',
  "foo\n\nbar"
]
expected = [
  'oświęcim (poland)',
  'oswiecim, poland',
  'iași, romania',
  'iasi, romania',
  'table, café',
  '1,001 arabian nights',
  "foo\n\nbar"
]
results = input.map{ |str| util.call(str) }
expect(results).to eq(expected)

to_ascii, nonword

util = Kiba::Extend::Utils::StringNormalizer.new(
  xforms: [:to_ascii, :nonword]
)
input = [
  'Oświęcim (Poland)',
  'Oswiecim, Poland',
  'Iași, Romania',
  'Iasi, Romania',
  'Table, café',
  '1,001 Arabian Nights',
  "foo\n\nbar"
]
expected = [
 'OswiecimPoland',
 'OswiecimPoland',
 'IaiRomania',
 'IasiRomania',
 'Tablecafe',
 '1001ArabianNights',
 'foobar'
]
results = input.map{ |str| util.call(str) }
expect(results).to eq(expected)

:cspaceid mode

util = Kiba::Extend::Utils::StringNormalizer.new(mode: :cspaceid)
input = [
  'Oświęcim (Poland)',
  'Oswiecim, Poland',
  'Iași, Romania',
  'Iasi, Romania'
]
expected = [
 'oswiecimpoland',
 'oswiecimpoland',
 'iasiromania',
 'iasiromania'
]
results = input.map{ |str| util.call(str) }
expect(results).to eq(expected)

Punctuation, custom proc

util = Kiba::Extend::Utils::StringNormalizer.new(
  xforms: [:punct, ->(val) { val.upcase }]
)
input = ["Release the bats!!"]
expected = ["RELEASE THE BATS"]
results = input.map{ |str| util.call(str) }
expect(results).to eq(expected)

Since:

  • 3.3.0

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(mode: nil, replacements: {}, xforms: []) ⇒ StringNormalizer

Defined xforms

  • :nfkc - ON BY DEFAULT: Applies Unicode compatibility decomposition, followed by canonical composition; See https://unicode.org/reports/tr15/ for more details than you want.
  • :replace - ON BY DEFAULT: performs find-and-replace operations specified in replacements parameter
  • :blank - deletes all spaces and tabs, using Ruby /\pBlank/ regexp
  • :lower - downcase the string
  • :nonword - removes ALL characters that are not letters, numbers, or underscores
  • :punct - removes all characters matching Ruby /\pPunct/ regexp
  • :to_ascii - replaces non-ASCII characters with an ASCII approximation, or if none exists, a replacement character which defaults to “?”.

Defined modes

  • :cspaceid - replaces weird characters that don’t convert to ASCII properly, :to_ascii, :nonword, :lower

Parameters:

  • mode (:cspaceid) (defaults to: nil)

    Use an established set of xforms and replacement settings

  • replacements (Hash{Regexp => String}) (defaults to: {})

    simple gsub find/replaces to be applied, in order, to the string being normalized; key is the find/match value; value is the replacement string

  • xforms (Array<Symbol, Proc>) (defaults to: [])

    Symbol must match one of the defined transforms; A Proc that takes one String arg and returns a String may also be passed to apply uncommon normalization logic

Since:

  • 3.3.0



150
151
152
153
154
155
# File 'lib/kiba/extend/utils/string_normalizer.rb', line 150

def initialize(mode: nil, replacements: {}, xforms: [])
  @mode = mode
  @replacements = replacements
  @xforms = %i[nfkc replace] + xforms
  apply_mode_settings
end

Class Method Details

.call(str:, mode: nil, replacements: {}, xforms: []) ⇒ Object

Defined xforms

  • :nfkc - ON BY DEFAULT: Applies Unicode compatibility decomposition, followed by canonical composition; See https://unicode.org/reports/tr15/ for more details than you want.
  • :replace - ON BY DEFAULT: performs find-and-replace operations specified in replacements parameter
  • :blank - deletes all spaces and tabs, using Ruby /\pBlank/ regexp
  • :lower - downcase the string
  • :nonword - removes ALL characters that are not letters, numbers, or underscores
  • :punct - removes all characters matching Ruby /\pPunct/ regexp
  • :to_ascii - replaces non-ASCII characters with an ASCII approximation, or if none exists, a replacement character which defaults to “?”.

Defined modes

  • :cspaceid - replaces weird characters that don’t convert to ASCII properly, :to_ascii, :nonword, :lower

Parameters:

  • mode (:cspaceid) (defaults to: nil)

    Use an established set of xforms and replacement settings

  • replacements (Hash{Regexp => String}) (defaults to: {})

    simple gsub find/replaces to be applied, in order, to the string being normalized; key is the find/match value; value is the replacement string

  • xforms (Array<Symbol, Proc>) (defaults to: [])

    Symbol must match one of the defined transforms; A Proc that takes one String arg and returns a String may also be passed to apply uncommon normalization logic

  • str (String)

    to normalize



112
113
114
115
116
117
118
# File 'lib/kiba/extend/utils/string_normalizer.rb', line 112

def call(str:, mode: nil, replacements: {}, xforms: [])
  new(
    mode: mode,
    replacements: replacements,
    xforms: xforms
  ).call(str)
end

Instance Method Details

#call(val) ⇒ Object

Since:

  • 3.3.0



157
158
159
160
161
# File 'lib/kiba/extend/utils/string_normalizer.rb', line 157

def call(val)
  return val if val.blank?

  xforms.inject(val) { |res, nv| do_xform(res, nv) }
end