Skip to content

key Jupyter notebook mods#1

Open
L4R5m wants to merge 7 commits intoseanhelvey:mainfrom
L4R5m:key_nb_mods
Open

key Jupyter notebook mods#1
L4R5m wants to merge 7 commits intoseanhelvey:mainfrom
L4R5m:key_nb_mods

Conversation

@L4R5m
Copy link

@L4R5m L4R5m commented Nov 17, 2021

Changes to key.ipynb:

  • Add section headings and Table of Contents (visible if you have ToC2 nbextension installed and enabled)
  • show how to set Pandas options
  • refactor and bugfix data wrangling code
  • update matplotlib API example to match R ggplot example

Mostly minor refactoring, but making copy of 'data' df is important to
avoid SettingWithCopyWarning when changing a column's dtype.
Mutating cols based on their own value is bad practice as the cell is not
repeatable... it is better to create a new col w/ the results.
@L4R5m
Copy link
Author

L4R5m commented Nov 18, 2021

I read Sam's presentation and now see that we want to convert the dataframes to 'tidy long form', not just find the quickest / easiest way to calculate 'number of colors' & 'number of description words'. So let me have a look at doing that... your original solution may have been the right approach!

My prior method was a shortcut that works for this problem statement, but
isn't a 'data wrangling' best practice (make you dataframes tidy, eg: 1
measurement per row). Now we show both approaches.
@seanhelvey
Copy link
Owner

I'm still happy to merge this if you'd like @L4R5m. I really like the comments, headings, and formatting. The big challenge is that Sam and I were trying to follow a very simple format of 1) read the data 2) wrangle it 3) visualize so that newcomers would be able to follow. I think we need to distill this so that it maps more closely to Sam's solution here https://github.com/samanthacsik/RLadiesSB-RvsPython/blob/main/KEY_Rcode.R. Can we get it down to three simple steps?

Simplify the notebook to only show one method of analyzing the data.

Also added a few more comments and cleaned up formatting.
@L4R5m
Copy link
Author

L4R5m commented Nov 22, 2021

Sounds good, I updated the Notebook to remove one of the analysis methods so now it is just based on the 'tidy' dataframe method. Feel free to edit further if you want to simplify more.

@seanhelvey
Copy link
Owner

I don't think I can modify the pull request since you forked @L4R5m. I did my best to incorporate your code into key-2. @samanthacsik and @an-bui feel free to give feedback here too. Some notes:

  • I've gone back and forth between using a lamba vs Tidy format one per row. Why don't we do colors Tidy style and then use a lambda on the image description so they can see each?
  • We can make it very simple and clear for attendees to get into the notebook and write code by starting with the pandas steps 1-3, and then introduce Matplotlib and NumPy afterwards.
  • Since we are going to be writing every line of code together for the first few steps at least, we can eliminate the formatting options by not using print statements, so output is nicely formatted by default.
  • I am conflicted between filtering using df[df['hs_tf'] == 'Yes'] and df.loc[df.hs_tf == 'Yes'], but to me the first seems easier to understand, even though I know the docs say .loc is best.
  • Maybe we can avoid dot notation and just use square brackets for consistency?
  • Is the sort=False in the group_by for performance? I'm trying to eliminate anything we can avoid explaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants