key Jupyter notebook mods by L4R5m · Pull Request #1 · seanhelvey/IntroToPythonForRUsers

L4R5m · 2021-11-17T08:05:29Z

Changes to key.ipynb:

Add section headings and Table of Contents (visible if you have ToC2 nbextension installed and enabled)
show how to set Pandas options
refactor and bugfix data wrangling code
update matplotlib API example to match R ggplot example

Mostly minor refactoring, but making copy of 'data' df is important to avoid SettingWithCopyWarning when changing a column's dtype.

Mutating cols based on their own value is bad practice as the cell is not repeatable... it is better to create a new col w/ the results.

…ample

L4R5m · 2021-11-18T06:04:24Z

I read Sam's presentation and now see that we want to convert the dataframes to 'tidy long form', not just find the quickest / easiest way to calculate 'number of colors' & 'number of description words'. So let me have a look at doing that... your original solution may have been the right approach!

My prior method was a shortcut that works for this problem statement, but isn't a 'data wrangling' best practice (make you dataframes tidy, eg: 1 measurement per row). Now we show both approaches.

Jupyter autosave... so slow.

seanhelvey · 2021-11-22T16:04:08Z

I'm still happy to merge this if you'd like @L4R5m. I really like the comments, headings, and formatting. The big challenge is that Sam and I were trying to follow a very simple format of 1) read the data 2) wrangle it 3) visualize so that newcomers would be able to follow. I think we need to distill this so that it maps more closely to Sam's solution here https://github.com/samanthacsik/RLadiesSB-RvsPython/blob/main/KEY_Rcode.R. Can we get it down to three simple steps?

Simplify the notebook to only show one method of analyzing the data. Also added a few more comments and cleaned up formatting.

L4R5m · 2021-11-22T20:01:55Z

Sounds good, I updated the Notebook to remove one of the analysis methods so now it is just based on the 'tidy' dataframe method. Feel free to edit further if you want to simplify more.

seanhelvey · 2021-11-26T19:32:18Z

I don't think I can modify the pull request since you forked @L4R5m. I did my best to incorporate your code into key-2. @samanthacsik and @an-bui feel free to give feedback here too. Some notes:

I've gone back and forth between using a lamba vs Tidy format one per row. Why don't we do colors Tidy style and then use a lambda on the image description so they can see each?
We can make it very simple and clear for attendees to get into the notebook and write code by starting with the pandas steps 1-3, and then introduce Matplotlib and NumPy afterwards.
Since we are going to be writing every line of code together for the first few steps at least, we can eliminate the formatting options by not using print statements, so output is nicely formatted by default.
I am conflicted between filtering using df[df['hs_tf'] == 'Yes'] and df.loc[df.hs_tf == 'Yes'], but to me the first seems easier to understand, even though I know the docs say .loc is best.
Maybe we can avoid dot notation and just use square brackets for consistency?
Is the sort=False in the group_by for performance? I'm trying to eliminate anything we can avoid explaining.

L4R5m added 4 commits November 15, 2021 22:23

Add TOC & sections, set Pandas options, make copy of data df

2c71c1c

Mostly minor refactoring, but making copy of 'data' df is important to avoid SettingWithCopyWarning when changing a column's dtype.

Stop mutating existing cols, and fix counting of colors / num_words

925e945

Mutating cols based on their own value is bad practice as the cell is not repeatable... it is better to create a new col w/ the results.

Update data plots, adding matplotlib API example to match R ggplot ex…

8507b0f

…ample

Display notebook Table of Contents

16950a7

L4R5m added 2 commits November 19, 2021 10:35

Add 'tidy' data analysis approach to key.ipynb

b237812

My prior method was a shortcut that works for this problem statement, but isn't a 'data wrangling' best practice (make you dataframes tidy, eg: 1 measurement per row). Now we show both approaches.

Jupyter notebook hadn't saved prior to last commit, so fixing

be63e51

Jupyter autosave... so slow.

Removed analysis section based on the non-tidy method

f1e5255

Simplify the notebook to only show one method of analyzing the data. Also added a few more comments and cleaned up formatting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

key Jupyter notebook mods#1

key Jupyter notebook mods#1
L4R5m wants to merge 7 commits intoseanhelvey:mainfrom
L4R5m:key_nb_mods

L4R5m commented Nov 17, 2021

Uh oh!

L4R5m commented Nov 18, 2021

Uh oh!

seanhelvey commented Nov 22, 2021

Uh oh!

L4R5m commented Nov 22, 2021

Uh oh!

seanhelvey commented Nov 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

L4R5m commented Nov 17, 2021

Uh oh!

L4R5m commented Nov 18, 2021

Uh oh!

seanhelvey commented Nov 22, 2021

Uh oh!

L4R5m commented Nov 22, 2021

Uh oh!

seanhelvey commented Nov 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants