Updated regex and url parsing #12

NBIX-Mark-Southern · 2025-06-04T00:30:44Z

✨ Changes

The data url regex is made verbose so it is easier to read. It also contains a capture for parameters (attribute=value). In a follow up PR I'll add the ability to parse and assemble parameters.

Using re.search instead of re.fullmatch allows for urls embedded within text. A follow up could potentially iterate through found urls.

I also check to see if the regex actually did match if not, DataURL.from_url returns None to indicate no match.

telday

Thanks for contributing. The regex changes look good (thanks for the readability updates they look great!) but would like to hold off on the change to re.search in favor of a more layered approach to the functionality.

I saw your other PR as well but will hold off on reviewing that until this one goes through.

telday · 2025-06-10T20:46:20Z

data_url/__init__.py

-    r"data:(?P<MIME>([\w-]+\/[\w+\.-]+(;[\w-]+\=[\w-]+)?)?)(?P<encoded>;base64)?,(?P<data>[\w\d.~%\=\/\+-]+)"
+    r"""
+    data:                                         # literal data:
+    (?P<MIME>[\w\-\.+]+/[\w\-\.+]+)?              # optional media type


MIME types are not allowed to have . or + characters in the prefix. They are restricted to lower case letters, digits, and - characters. So the \w match I use here should definitely be updated to only include lower case letters as well.

There is a case for allowing non-standard MIME types to be processed however if this functionality is added it should be controlled by a flag of some sort so the user is aware they are outside RFC specification.

I'm interested to hear what some of the use cases are for this module, specifically how often are we producing/processing non-standard URL's

Happy to give something back. This library definitely saved me some time.

I looped through all the mime types in the python mimetypes module...

import mimetypes mimetypes.init() primary_chars = set() secondary_chars = set() for extension, mime_type in mimetypes.types_map.items(): primary, secondary = mime_type.split("/") primary_chars.update(primary) secondary_chars.update(secondary) print("".join(sorted(primary_chars))) print("".join(sorted(secondary_chars)))

acdefghilmnoprstuvx +-.0123456789ABCDEFGHIKLMNOPQRSTVWXYZ_abcdefghijklmnopqrstuvwxyz

but I've also seen examples such as x-world/x-vrml and x-music/x-midi. I think there's a case to try to be permissive and not break things for people if the real world gets in the way. Even python's mimetype module allows non standard things without error. e.g.

mimetypes.add_type("1/2", ".12")

I've updated the regex on the basis of that investigation.

the type and subtype now must start with a lower case letter.

the type can only contain lower case letters and a dash

the subtype can contain +-._0-9a-z

telday · 2025-06-10T21:18:45Z

data_url/__init__.py

+    r"""
+    data:                                         # literal data:
+    (?P<MIME>[\w\-\.+]+/[\w\-\.+]+)?              # optional media type
+    (?P<parameters>(?:;[\w\-\.+]+=[\w\-\.+%]+)*)  # optional attribute=values, value can be url encoded


Parameter values are also allowed to be the quoted-string token defined in RFC 822 so if we are going to add parameterization here we should probably accept those values as well.

I was going off of https://www.rfc-editor.org/rfc/rfc2397.html, didn't see quoted strings there. Do you think it could still be a useful advance without them? It seems like a fringe case of a fringe case to me.

Personally, I'm using the parameters functionality to store a filename for the encoded data... and possibly the charset may be useful in future...

If you can, please look at the follow up PR as (sic) this one doesn't exist in isolation.

I'm fine with that for now, just something to keep in mind. That RFC imports the "value" token from RFC 2045 which defines a token as a quotable string.

telday · 2025-06-10T21:25:40Z

data_url/__init__.py

-            self._data = base64.b64decode(raw_data)
-        else:
-            self._data = raw_data
+        match = DATA_URL_RE.search(self._url)


I'm not huge on the switch to search here. I would prefer that at this level in the code we keep a low level focus and instead add a higher level module along side this one which works on chunks of text which may have a data URL embedded in them or have multiple.

I would be more than happy to write the additional functions if you don't have time, just outline your requirements in an issue.

Fine with that. I switched to using fullmatch and the tests still pass :-).

So long as we keep the updated __parse_url logic, it fits my use cases too!

…atch

updated regex and url parsing

5c3c6c6

telday requested changes Jun 10, 2025

View reviewed changes

telday mentioned this pull request Jun 10, 2025

Handling non-data urls appropriately #10

Closed

more explicit mimetype type and subtype, revert re.search to re.fullm…

2a93a00

…atch

telday approved these changes Jun 22, 2025

View reviewed changes

telday merged commit 6887e16 into telday:master Jun 22, 2025
1 check passed

telday mentioned this pull request Jul 17, 2025

Generated url cannot be parsed #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updated regex and url parsing #12

Updated regex and url parsing #12

Uh oh!

NBIX-Mark-Southern commented Jun 4, 2025

Uh oh!

telday left a comment

Uh oh!

telday Jun 10, 2025

Uh oh!

NBIX-Mark-Southern Jun 13, 2025

Uh oh!

telday Jun 10, 2025

Uh oh!

NBIX-Mark-Southern Jun 13, 2025

Uh oh!

telday Jun 22, 2025

Uh oh!

telday Jun 10, 2025

Uh oh!

NBIX-Mark-Southern Jun 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Updated regex and url parsing #12

Updated regex and url parsing #12

Uh oh!

Conversation

NBIX-Mark-Southern commented Jun 4, 2025

✨ Changes

Uh oh!

telday left a comment

Choose a reason for hiding this comment

Uh oh!

telday Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

NBIX-Mark-Southern Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

telday Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

NBIX-Mark-Southern Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

telday Jun 22, 2025

Choose a reason for hiding this comment

Uh oh!

telday Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

NBIX-Mark-Southern Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants