Implement `from_segments` #32

arubehn · 2024-07-17T11:32:38Z

As discussed in #31 , here is my proposal for explicitly handling data that already comes segmented and must not be segmented any further. I also included the functionality of passing custom separators and made whitespace handling more flexible.

xrotwang · 2024-07-17T15:30:20Z

src/linse/typedsequence.py


+    @classmethod
+    def from_segments(cls, s, separator=None):
+        return s.split(separator) if separator else s.split()


I find this more obfuscation than convenience. Isn't

cls.from_segments(s.split('_'))

more explicit than

cls.from_segments(s, separator='_')

? I.e. if the caller already knows what to split on, then they should do it right away.

One aspect that I would say should not be forgotten here is that we would have two situations that happen a lot in calling the function:

There is a string, so we want to split by the separator. Normally, actually ALWAYS, it is a + by which we split, so we have cls.from_segments(string) as the convenient normal case. If we have to use s.split(" + ") here also always, we end up writing many more lines.

There is a list, if we read in data from cldf, where the split is done on due to the way we handle the data as a list there.

So if we want to say the from_segments is dealing with Segments in Lingpy aka TOKENS and CLDF aka CLDF_Segments, I'd consider it advantageous to have a check if it is a list and then revert it. But I know this is may obfuscate it even more.

But the handling with separator as kw is something I consider an urgent convenience, since we have the default here, which we'd otherwise have to invoke ALWAYS via s.split(" + ").

Hm. Maybe we should have even more factory methods? I want to avoid the "seems to work" situations, i.e. situations where you are not forced to think about what your input actually looks like - yet something seems to happen and you just accept the results. Having separate methods that only accept one datatype as input force you to think about this - and allow tools like PyCharm to help you with this.

Another advantage of additional methods is that methods have docstrings, so we get a canonical place where to document the clever things we might do to manipulate input :)

@classmethod def from_segments(cls, s): """ only accepts list! """

Is different from __init__.

Here, we have a list like ["p", "a", "+", "t", "e", "r"]. But we want internally [["p", "a"], ["t", "e", "r"]].

One way to address this is " ".join(["p", "a", "+", "t", "e", "r"]).split(" + "), but I guess I would prefer a direct solution by iterating over the list and then splitting.

import itertools # using groupby split_by = lambda lst: [list(group) for k, group in itertools.groupby(lst, lambda x: x == "+") if not k]

Example:

>>> split_by("p a t + e r + e r".split()) [['p', 'a', 't'], ['e', 'r'], ['e', 'r']]

I would not use lambda, it was to show how this works. I got the solution after checking again on itertools, looking for the opposit of itertools.chain and then I found this blog demonstrating the solution.

@LinguList you are right, thank you for pointing that out. I have not thought about this case - then, it does seem reasonable to me to have three factory methods (all of which require some sort of preprocessing before calling __init__ and allow for a custom separator), as you guys have suggested. I can quickly implement that :)

And I think itertools.groupby is better than using a hand-forged solution.

xrotwang · 2024-08-08T09:50:18Z

src/linse/typedsequence.py


+    def reversed_segments(self):
+        return Word([m[::-1] for m in self[::-1]])
+


Looks good to me.

What do we do with the other changes in this PR?

That's up to you to decide, of course ;) but since I didn't touch any of the existing methods, I think it would be no harm to keep the from_segments method around -- or do you suggest doing something differently?

arubehn added 3 commits July 17, 2024 12:16

test

0c281e3

test complete

44dc9f3

implement parsing segmented sequences

fa11cde

xrotwang reviewed Jul 17, 2024

View reviewed changes

implement Word.reversed_segments

93485da

arubehn force-pushed the master branch from 05d87ea to 93485da Compare August 8, 2024 09:06

xrotwang reviewed Aug 8, 2024

View reviewed changes


		def reversed_segments(self):
		return Word([m[::-1] for m in self[::-1]])

Implement from_segments #32

Are you sure you want to change the base?

Implement from_segments #32

Uh oh!

Conversation

arubehn commented Jul 17, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement `from_segments` #32

Implement `from_segments` #32