Applied Metaphors: Learning TRIZ, Complexity, Data/Stats/ML using Metaphors
  1. Teaching
  2. Data Science with No Code
  3. Change
  • Teaching
    • Data Analytics for Managers and Creators
      • Tools
        • Introduction to R and RStudio
        • Introduction to Radiant
        • Introduction to Orange
      • Descriptive Analytics
        • Data
        • Summaries
        • Counts
        • Quantities
        • Groups
        • Densities
        • Groups and Densities
        • Change
        • Proportions
        • Parts of a Whole
        • Evolution and Flow
        • Ratings and Rankings
        • Surveys
        • Time
        • Space
        • Networks
        • Experiments
        • Miscellaneous Graphing Tools, and References
      • Statistical Inference
        • 🧭 Basics of Statistical Inference
        • 🎲 Samples, Populations, Statistics and Inference
        • Basics of Randomization Tests
        • 🃏 Inference for a Single Mean
        • 🃏 Inference for Two Independent Means
        • 🃏 Inference for Comparing Two Paired Means
        • Comparing Multiple Means with ANOVA
        • Inference for Correlation
        • 🃏 Testing a Single Proportion
        • 🃏 Inference Test for Two Proportions
      • Inferential Modelling
        • Modelling with Linear Regression
        • Modelling with Logistic Regression
        • 🕔 Modelling and Predicting Time Series
      • Predictive Modelling
        • 🐉 Intro to Orange
        • ML - Regression
        • ML - Classification
        • ML - Clustering
      • Prescriptive Modelling
        • 📐 Intro to Linear Programming
        • 💭 The Simplex Method - Intuitively
        • 📅 The Simplex Method - In Excel
      • Workflow
        • Facing the Abyss
        • I Publish, therefore I Am
      • Case Studies
        • Demo:Product Packaging and Elderly People
        • Ikea Furniture
        • Movie Profits
        • Gender at the Work Place
        • Heptathlon
        • School Scores
        • Children's Games
        • Valentine’s Day Spending
        • Women Live Longer?
        • Hearing Loss in Children
        • California Transit Payments
        • Seaweed Nutrients
        • Coffee Flavours
        • Legionnaire’s Disease in the USA
        • Antarctic Sea ice
        • William Farr's Observations on Cholera in London
    • R for Artists and Managers
      • 🕶 Lab-1: Science, Human Experience, Experiments, and Data
      • Lab-2: Down the R-abbit Hole…
      • Lab-3: Drink Me!
      • Lab-4: I say what I mean and I mean what I say
      • Lab-5: Twas brillig, and the slithy toves…
      • Lab-6: These Roses have been Painted !!
      • Lab-7: The Lobster Quadrille
      • Lab-8: Did you ever see such a thing as a drawing of a muchness?
      • Lab-9: If you please sir…which way to the Secret Garden?
      • Lab-10: An Invitation from the Queen…to play Croquet
      • Lab-11: The Queen of Hearts, She Made some Tarts
      • Lab-12: Time is a Him!!
      • Iteration: Learning to purrr
      • Lab-13: Old Tortoise Taught Us
      • Lab-14: You’re are Nothing but a Pack of Cards!!
    • ML for Artists and Managers
      • 🐉 Intro to Orange
      • ML - Regression
      • ML - Classification
      • ML - Clustering
      • 🕔 Modelling Time Series
    • TRIZ for Problem Solvers
      • I am Water
      • I am What I yam
      • Birds of Different Feathers
      • I Connect therefore I am
      • I Think, Therefore I am
      • The Art of Parallel Thinking
      • A Year of Metaphoric Thinking
      • TRIZ - Problems and Contradictions
      • TRIZ - The Unreasonable Effectiveness of Available Resources
      • TRIZ - The Ideal Final Result
      • TRIZ - A Contradictory Language
      • TRIZ - The Contradiction Matrix Workflow
      • TRIZ - The Laws of Evolution
      • TRIZ - Substance Field Analysis, and ARIZ
    • Math Models for Creative Coders
      • Maths Basics
        • Vectors
        • Matrix Algebra Whirlwind Tour
        • content/courses/MathModelsDesign/Modules/05-Maths/70-MultiDimensionGeometry/index.qmd
      • Tech
        • Tools and Installation
        • Adding Libraries to p5.js
        • Using Constructor Objects in p5.js
      • Geometry
        • Circles
        • Complex Numbers
        • Fractals
        • Affine Transformation Fractals
        • L-Systems
        • Kolams and Lusona
      • Media
        • Fourier Series
        • Additive Sound Synthesis
        • Making Noise Predictably
        • The Karplus-Strong Guitar Algorithm
      • AI
        • Working with Neural Nets
        • The Perceptron
        • The Multilayer Perceptron
        • MLPs and Backpropagation
        • Gradient Descent
      • Projects
        • Projects
    • Data Science with No Code
      • Data
      • Orange
      • Summaries
      • Counts
      • Quantity
      • 🕶 Happy Data are all Alike
      • Groups
      • Change
      • Rhythm
      • Proportions
      • Flow
      • Structure
      • Ranking
      • Space
      • Time
      • Networks
      • Surveys
      • Experiments
    • Tech for Creative Education
      • 🧭 Using Idyll
      • 🧭 Using Apparatus
      • 🧭 Using g9.js
    • Literary Jukebox: In Short, the World
      • Italy - Dino Buzzati
      • France - Guy de Maupassant
      • Japan - Hisaye Yamamoto
      • Peru - Ventura Garcia Calderon
      • Russia - Maxim Gorky
      • Egypt - Alifa Rifaat
      • Brazil - Clarice Lispector
      • England - V S Pritchett
      • Russia - Ivan Bunin
      • Czechia - Milan Kundera
      • Sweden - Lars Gustaffsson
      • Canada - John Cheever
      • Ireland - William Trevor
      • USA - Raymond Carver
      • Italy - Primo Levi
      • India - Ruth Prawer Jhabvala
      • USA - Carson McCullers
      • Zimbabwe - Petina Gappah
      • India - Bharati Mukherjee
      • USA - Lucia Berlin
      • USA - Grace Paley
      • England - Angela Carter
      • USA - Kurt Vonnegut
      • Spain-Merce Rodoreda
      • Israel - Ruth Calderon
      • Israel - Etgar Keret
  • Posts
  • Blogs and Talks

On this page

  • What graphs will we see today?
  • What kind of Data Variables will we choose?
  • Inspiration
  • How do these Chart(s) Work?
  • Plotting a Scatter Plot
  • What is the Story here?
  • Dataset: Cancer
    • Examine the Data
    • Data Dictionary
    • Research Questions
    • What is the Story Here?
  • A Variant
  • Your Turn
  • Wait, But Why?
  • Readings
  1. Teaching
  2. Data Science with No Code
  3. Change

Change

If you wish to go anywhere, you must run twice as fast as that.

Correlations
Scatter and Bubble Plots
Regression Lines
Published

April 25, 2024

Modified

July 28, 2024

Abstract
How one variable changes with another

What graphs will we see today?

Variable #1 Variable #2 Chart Names Chart Shape
Quant Quant Scatter Plot

What kind of Data Variables will we choose?

No Pronoun Answer Variable/Scale Example What Operations?
1 How Many / Much / Heavy? Few? Seldom? Often? When? Quantities, with Scale and a Zero Value.Differences and Ratios /Products are meaningful. Quantitative/Ratio Length,Height,Temperature in Kelvin,Activity,Dose Amount,Reaction Rate,Flow Rate,Concentration,Pulse,Survival Rate Correlation

Inspiration

Figure 1: ScatterPlot Inspiration http://www.calamitiesofnature.com/archive/?c=559

Does belief in Evolution depend upon the GSP of of the country? Where is the US in all of this? Does the Bible Belt tip the scales here?

And India?

How do these Chart(s) Work?

Scatter Plots take two separate Quant variables as inputs. Each of the variables is mapped to a position, or coordinate: one for the X-axis, and the other for the Y-axis, like an ordered pair. Each pair of observations from the two Quant variables ( which would be in one row!) thus gives us a point in the Scatter Plot.

Looking at these clouds of points gives us an intuitive sense of the relationship between the two Quant variables, how one varies with the other. A cloud that slopes upward to the right indicates a positive relationship between the two; a cloud that slopes down to the right indicates a negative one. An amorphous cloud that does not discernibly slope in either way would lead us to infer that there is little or no relationship between the variables.

ImportantSlope and the Correlation Coefficient are Related

Under the assumption of a linear relationship between the two Quant variables, we plot a straight trend line, or regression line through the cloud of points, as a line that best represents that linear relationship. The slope of the regression line is directly linked to the Pearson Correlation Coefficient between the two variables.

Plotting a Scatter Plot

  • Using Orange
  • Using RAWgraphs
  • Using DataWrapper

We can use the now (overly) familiar iris dataset to plot our first scatter plot. Download the workflow file below:

Figure 2: Iris Scatter Plot
  • Try setting shapes and colours, and try plotting a “regression line”. Do you get one line, or several? Why, or why not? How can you switch between the two “methods”?
  • Try other pairs of Quant variables in the dataset.
  • Which plot is the most informative? Why?

We can also add the correlations widget to evaluate correlations between all pairs of numerical/Quant variables. Then keeping that widget open along with the Scatter Plot widget we can visualize the relationship between the plot and the correlation score.

When we do this, we might get a setup as shown below.

Figure 3: Iris Correlations and Scatter Plot

Here we can choose which correlation score we want to visualize in the correlations widget window and see the plot change in the scatter plot window.

Can you spot Simpson’s Paradox here? More on that further below.

https://academy.datawrapper.de/article/65-how-to-create-a-scatter-plot

What is the Story here?

  • There are three species of iris flowers and they are “separable” based on combinations of their quantitative measurements.
  • Some pairs of Quant variables create Scatter Plots that are quite disjoint and allow easy identification of the species variable.
  • In a ML model for this dataset, the species variable is most likely to be the target variable while the rest are predictors.

Dataset: Cancer

Let us examine a fairly complex dataset pertaining to cancer, and analyze that with scatter plots.

We can use the same Workflow as before.

Examine the Data

Figure 4: Cancer Data Table

From Figure 4, we see that there is one Qual column Diagnosis, and all the remaining 31 columns seem to be some Quant measurements of a total of 569 tumours. (Not all columns are visible)

Figure 5: Cancer Data Summary Table #1

Figure 5 gives is histograms and statistics of all the 32 columns. Most histograms seem roughly symmetric, but a detailed look must be taken.

Figure 6: Cancer Data Summary Table #2

In Figure 6, we see that there is some imbalance between the counts for the one Qual variable, Diagnosis.

Data Dictionary

NoteQuantitative Data
“Id”
“Radius (mean)” “Texture (mean)”
“Perimeter (mean)” “Area (mean)”
“Smoothness (mean)” “Compactness (mean)”
“Concavity (mean)” “Concave points (mean)”
“Symmetry (mean)” “Fractal dimension (mean)”
“Radius (se)” “Texture (se)”
“Perimeter (se)” “Area (se)”
“Smoothness (se)” “Compactness (se)”
“Concavity (se)” “Concave points (se)”
“Symmetry (se)” “Fractal dimension (se)”
“Radius (worst)” “Texture (worst)”
“Perimeter (worst)” “Area (worst)”
“Smoothness (worst)” “Compactness (worst)”
“Concavity (worst)” “Concave points (worst)”
“Symmetry (worst)” “Fractal dimension (worst)”
  • Many of the Quant variables seem to be mean measurements, with the mean presumably taken over several “sites” within the same tumour.
  • Along with the mean, there are also measurements of se or standard error which is, roughly speaking, a measure of the standard deviation of the multiple measurements made. So for instance, Area(mean) and Area(se) are pairs of measurements created using multiple “sites” or “cross-sections” on one tumour.
  • Some other variables are labelled as worst, which may be either the max or min of such a set of “multi-site” tumour measurements.
ImportantMay the (data) Source be with you

It is important to note that these are (educated?) guesses; one is best off connecting with the person/agency that provided the data for a precise understanding of variables. This will prevent nonsensical plots/models and inferences from showing up in your work.

NoteQualitative Data
  • Diagnosis: (text) (B)enign, or (M)alignant

Do use the joint view of correlation score and scatter plot to answer these, and possibly other Research Questions.

Research Questions

NoteQuestion

Q1. Are the mean and se observations correlated, for a particular variable?

(a) Cancer Scatter Plot #1
(b) Cancer Scatter Plot #2
(c) Cancer Scatter Plot #3
(d) Cancer Scatter Plot #4
Figure 7
NoteQuestion

Q2. Are the mean and worst observations correlated, for a particular variable?

Figure 8: Cancer Scatter Plot #5
Figure 9: Cancer Scatter Plot #6
Figure 10: Cancer Scatter Plot #7
Figure 11: Cancer Scatter Plot #8

What is the Story Here?

From Figure 7 (a), we see that the area(mean) and area(se) are somewhat correlated; moreover the correlation is slightly higher for the malignant tumours ( red dots, appropriately…). This trend shows up also for radius in Figure 7 (b), and for fractaldimension in Figure 7 (d). However, for smoothness, we see much lower correlation {#fig-cancer-smoothness-mean-se}.

For the mean vs worst scatter plots, we see decent correlations all around, with each of the graphs showing clouds tilted upward to the right.

ImportantSimpson’s Paradox

Try to remove colours and then plot a regression line. This usually gives a more clear idea of the correlation, without running into problems such as the Simpson’s Paradox:

Figure 12: Simpson’s Paradox

And see also this:

A Variant

Your Turn

  1. Try to play this online Correlation Game.
Note2. School Expenditure and Grades.

Note3. Gas Prices and Consumption

As described here. Note the log-transformed Quant data…why do you reckon this was done in the data set itself?

Note4. Horror Movies (Bah.You awful people..)

Note6. Food Delivery Times

Wait, But Why?

  • Scatter Plots, when they show “linear” clouds, tell us that there is some relationship between two Quant variables we have just plotted
  • If so, then if one is the target variable you are trying to design for, then the other independent, or controllable, variable is something you might want to design with.
Important

Target variables are usually plotted on the Y-axis, while Predictor variables are on the X-Axis, in a Scatter Plot. Why? Because y=mx+c !

  • Correlation scores are good indicators of things that are, well, related. While one variable may not necessarily cause another, a good correlation score may indicate how to chose a good predictor.
  • Always, always, plot and test your data! Both numerical summaries and graphical summaries are necessary! See below!!
WarningAnd How about these datasets?
dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
away 54.26610 47.83472 16.76983 26.93974 -0.0641284
bullseye 54.26873 47.83082 16.76924 26.93573 -0.0685864
circle 54.26732 47.83772 16.76001 26.93004 -0.0683434
dino 54.26327 47.83225 16.76514 26.93540 -0.0644719
dots 54.26030 47.83983 16.76774 26.93019 -0.0603414
h_lines 54.26144 47.83025 16.76590 26.93988 -0.0617148
high_lines 54.26881 47.83545 16.76670 26.94000 -0.0685042
slant_down 54.26785 47.83590 16.76676 26.93610 -0.0689797
slant_up 54.26588 47.83150 16.76885 26.93861 -0.0686092
star 54.26734 47.83955 16.76896 26.93027 -0.0629611
v_lines 54.26993 47.83699 16.76996 26.93768 -0.0694456
wide_lines 54.26692 47.83160 16.77000 26.93790 -0.0665752
x_shape 54.26015 47.83972 16.76996 26.93000 -0.0655833

Yes, you did want to plot that cute T-Rex in Orange, didn’t you? Here is the data then!!

Warning
  • Can selling more ice-cream make people drown?
  • Use your head about pairs of variables. Do not fall into this trap)

Readings

  1. Rohrer JM. Thinking Clearly About Correlations and Causation: Graphical Causal Models for Observational Data. Advances in Methods and Practices in Psychological Science. 2018;1(1):27-42. https://doi.org/10.1177/2515245917745629 PDF

  2. Case Study on Horror Movies. (Arvind: Bah.) https://notawfulandboring.blogspot.com/2024/04/using-pulse-rates-to-determine-scariest.html

  3. The Datasaurus Package: https://cran.r-project.org/web/packages/datasauRus/vignettes/Datasaurus.html

  4. A superb web-scrolly on Sustainable Development Goals (SDGs)s! Go and see! https://datatopics.worldbank.org/sdgatlas/goal-1-no-poverty?lang=en

  5. Hunter, W. G. (1981). Six Statistical Tales. The Statistician, 30(2), 107. doi:10.2307/2987563. https://sci-hub.ru/10.2307/2987563

  6. A cartoon+interactive explanation of Simpson’s Paradox. Real fun! https://pwacker.com/simpson.html

  7. Futility Closet blog. (December 12, 2014). English by Degrees. https://www.futilitycloset.com/2014/12/12/english-by-degrees/ A short article that seems to speak of LLMs/ChatGPT…in 1948!!

Back to top
Groups
Rhythm

License: CC BY-SA 2.0

Website made with ❤️ and Quarto, by Arvind V.

Hosted by Netlify .