Skip to content

Conversation

@NBIX-Mark-Southern
Copy link
Contributor

✨ Changes

The data url regex is made verbose so it is easier to read. It also contains a capture for parameters (attribute=value). In a follow up PR I'll add the ability to parse and assemble parameters.

Using re.search instead of re.fullmatch allows for urls embedded within text. A follow up could potentially iterate through found urls.

I also check to see if the regex actually did match if not, DataURL.from_url returns None to indicate no match.

Copy link
Owner

@telday telday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing. The regex changes look good (thanks for the readability updates they look great!) but would like to hold off on the change to re.search in favor of a more layered approach to the functionality.

I saw your other PR as well but will hold off on reviewing that until this one goes through.

r"data:(?P<MIME>([\w-]+\/[\w+\.-]+(;[\w-]+\=[\w-]+)?)?)(?P<encoded>;base64)?,(?P<data>[\w\d.~%\=\/\+-]+)"
r"""
data: # literal data:
(?P<MIME>[\w\-\.+]+/[\w\-\.+]+)? # optional media type
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MIME types are not allowed to have . or + characters in the prefix. They are restricted to lower case letters, digits, and - characters. So the \w match I use here should definitely be updated to only include lower case letters as well.

There is a case for allowing non-standard MIME types to be processed however if this functionality is added it should be controlled by a flag of some sort so the user is aware they are outside RFC specification.

I'm interested to hear what some of the use cases are for this module, specifically how often are we producing/processing non-standard URL's

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to give something back. This library definitely saved me some time.

I looped through all the mime types in the python mimetypes module...

import mimetypes

mimetypes.init()
primary_chars = set()
secondary_chars = set()
for extension, mime_type in mimetypes.types_map.items():
    primary, secondary = mime_type.split("/")
    primary_chars.update(primary)
    secondary_chars.update(secondary)

print("".join(sorted(primary_chars)))
print("".join(sorted(secondary_chars)))
acdefghilmnoprstuvx
+-.0123456789ABCDEFGHIKLMNOPQRSTVWXYZ_abcdefghijklmnopqrstuvwxyz

but I've also seen examples such as x-world/x-vrml and x-music/x-midi. I think there's a case to try to be permissive and not break things for people if the real world gets in the way. Even python's mimetype module allows non standard things without error. e.g.

mimetypes.add_type("1/2", ".12")

I've updated the regex on the basis of that investigation.

  • the type and subtype now must start with a lower case letter.
  • the type can only contain lower case letters and a dash
  • the subtype can contain +-._0-9a-z

r"""
data: # literal data:
(?P<MIME>[\w\-\.+]+/[\w\-\.+]+)? # optional media type
(?P<parameters>(?:;[\w\-\.+]+=[\w\-\.+%]+)*) # optional attribute=values, value can be url encoded
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter values are also allowed to be the quoted-string token defined in RFC 822 so if we are going to add parameterization here we should probably accept those values as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going off of https://www.rfc-editor.org/rfc/rfc2397.html, didn't see quoted strings there. Do you think it could still be a useful advance without them? It seems like a fringe case of a fringe case to me.

Personally, I'm using the parameters functionality to store a filename for the encoded data... and possibly the charset may be useful in future...

If you can, please look at the follow up PR as (sic) this one doesn't exist in isolation.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with that for now, just something to keep in mind. That RFC imports the "value" token from RFC 2045 which defines a token as a quotable string.

self._data = base64.b64decode(raw_data)
else:
self._data = raw_data
match = DATA_URL_RE.search(self._url)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not huge on the switch to search here. I would prefer that at this level in the code we keep a low level focus and instead add a higher level module along side this one which works on chunks of text which may have a data URL embedded in them or have multiple.

I would be more than happy to write the additional functions if you don't have time, just outline your requirements in an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with that. I switched to using fullmatch and the tests still pass :-).

So long as we keep the updated __parse_url logic, it fits my use cases too!

@telday telday merged commit 6887e16 into telday:master Jun 22, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants