-
Notifications
You must be signed in to change notification settings - Fork 37
Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tracting a post's view count
…ribute type Channel.
… attribute; fixed video edge cases.
…s didn't have a next page link (added reasonable default)
…se they weren't in a post containing a 'tgme_widget_message_text' div
|
I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message. Additional changes:
|
…edundant outlinks
…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text
|
One thing we need to decide is if we want to include pinned messages, e.g. https://t.me/s/SouthwestOhioPB/17, where the content is just "[CHANNEL NAME] pinned a [ATTACHMENT TYPE] ". Unfortunately, unlike the desktop app, the browser interface doesn't include the link to the message that was pinned, so there's very little information in the scraped post. |
| if link['href'] == rawUrl or link['href'] == url: | ||
| style = link.attrs.get('style', '') | ||
| # Generic filter of links to the post itself, catches videos, photos, and the date link | ||
| if style != '': | ||
| imageUrls = re.findall('url\(\'(.*?)\'\)', style) | ||
| if len(imageUrls) == 1: | ||
| media.append(Photo(url = imageUrls[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is partially duplicated below (152-155) maybe it could be isolated to a method, or at least the REGEX into a variable so it stays consistent.
snscrape/modules/telegram.py
Outdated
| forwarded = forward_tag['href'].split('t.me/')[1].split('/')[0] | ||
| for voice_player in post.find_all('a', {'class': 'tgme_widget_message_voice_player'}): | ||
| audioUrl = voice_player.find('audio')['src'] | ||
| durationStr = voice_player.find('time').text.split(':') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
durationStr comes from split so it will be a list rather than string. Both calls pass lists so maybe renaming the variables + durationStrToSeconds method to reflect that.
snscrape/modules/telegram.py
Outdated
| videoThumbnailUrl = None | ||
| else: | ||
| style = iTag['style'] | ||
| videoThumbnailUrl = re.findall('url\(\'(.*?)\'\)', style)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regex can be extracted to variable since it's also used above
snscrape/modules/telegram.py
Outdated
| if videoTag is None: | ||
| videoUrl = None | ||
| else: | ||
| videoUrl = videoTag['src'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if videoTag is None: | |
| videoUrl = None | |
| else: | |
| videoUrl = videoTag['src'] | |
| videoUrl = None if videoTag is None else videoTag['src'] |
| else: | ||
| cls = Video | ||
| durationStr = video_player.find('time').text.split(':') | ||
| mKwargs['duration'] = durationStrToSeconds(durationStr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment on list vs str as above for durationStrToSeconds
snscrape/modules/telegram.py
Outdated
| if viewsSpan is None: | ||
| views = None | ||
| else: | ||
| views = parse_num(viewsSpan.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if viewsSpan is None: | |
| views = None | |
| else: | |
| views = parse_num(viewsSpan.text) | |
| views = None if viewsSpan is None else parse_num(viewsSpan.text) |
snscrape/modules/telegram.py
Outdated
| s = s.replace(' ', '') | ||
| if s.endswith('M'): | ||
| return int(float(s[:-1]) * 1e6), 10 ** (6 if '.' not in s else 6 - len(s[:-1].split('.')[1])) | ||
| elif s.endswith('K'): | ||
| return int(float(s[:-1]) * 1000), 10 ** (3 if '.' not in s else 3 - len(s[:-1].split('.')[1])) | ||
| else: | ||
| return int(s), 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not check this logic, maybe adding some docstr with example expected input and expected output
snscrape/modules/telegram.py
Outdated
| if r.status_code == 200: | ||
| return (True, None) | ||
| elif r.status_code // 100 == 5: | ||
| return (False, f'status code: {r.status_code}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| return (False, f'status code: {r.status_code}') | |
| return (False, f'{r.status_code=}') |
discovered this recently for python 3.8+, see here, just a suggestion
snscrape/modules/telegram.py
Outdated
| else: | ||
| return (False, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| else: | |
| return (False, None) | |
| return (False, None) |
no need for else and having a base-level return with the default values is also a good pattern
…TTERN as variable
Implemented requested changes from JustAnotherArchivist#413
ChannelAdditional steps that should be done:
Documentdataclass for arbitrary attached documents