Teaching Mathematics: What, When and Why

An in-depth examination of mathematics education, topic by topic


Exploratory Data Analysis

My daughter no. 1 (D_1) has long covid: lack of energy, brain fog, etc. Luckily her GP had her iron levels tested, they were almost zero. Two separate infusions have restored the levels and her health is much improved. Nurses at her hospital say that low iron levels are common with long covid patients.

D_1 was enrolled in a research program looking at long covid. The doctors involved had not heard of correlation with iron levels and were not interested in pursuing the matter. It seems the research program has a research plan all worked out, even before getting any data.

All this brought to my mind the great John Tukey, with his emphasis on the “iterative nature of data analysis”. Or, as my former colleague Neil Diamond says, “we first eyeball the data”. This involves keeping an open mind as to which data are relevant to an investigation, and a careful look at outliers. Terry Mills has posted on the opportunity in schools of teaching Tukey’s method under the heading “exploratory data analysis”.

So is this happening already? There must be fun opportunities investigating issues in secondary schools. As an outsider, the only example I can think of is the chaos at daughter D_2‘s school when all the parents were dropping off students simultaneously. There must also be great examples within industry. Suggestions please. What have you tried or witnessed? What might you try if not buckling under reporting requirements?



14 responses to “Exploratory Data Analysis”

  1. […] has a new post on his Teaching Mathematics blog, on Exploratory Data Analysis. Please check it out, and support […]

    Like

  2. I did an assignment pretty much about this idea. It was for a year 12 Mathematics Standard class. It was based off one I did at uni for a subject called programming for data analysis.

    The Structure was as follows:
    – Motivation (reason for exploring a topic)
    – The data (who, what, when, where, why)
    – Data analysis (graphs, statistical measures, etc)
    – Results and insights (what conclusions could be drawn and what it means)
    – Reflection (where to from here and what could have been improved)

    What students got out if was how to make links between different parts of their data; draw conclusions that are reasonable; and what to do next to build upon current insights or whether a change of direction was needed.

    The statistics needed was not intensive (eg person’s correlation coefficient). More time was spent on formulating a cohesive report that drew valid conclusions (well the students attempted that and I gave feedback). For the subject (lowest level of maths for the HSC) it was just at the right level. The following year another teacher used the same assignment and had the same success.

    Like

    1. Interesting Potii
      Are you able to give examples of the investigations?

      Like

      1. Students where allowed to pick a topic of their choice or use datasets I already had.

        Here is one example:

        One student looked at whether height affects performance in professional basketball players. The performance indicators the student used was points per game, assists per game and rebounds per game. The student respectively found no correlation, weak negative correlation and weak positive correlation. The student also looked at the spread of data with respect to standard deviation. However, wasn’t able to use it effectively to explain aspects of the data or to roughly consider potential outliers. The student concluded that from the performance indicators considered it was not definitive that height affects performance as there was a mix of results. Upon reflection, the student made good points about considering a wider variety of performance indicators (bias from selecting the wrong thing to look at) and also not just considering elite basketball players (comparisons to general population to see different within and between populations of basketball players).

        This was a response I considered to be very good, especially considering the level of mathematics the course was. I felt most students realised that their initial thoughts had limitation. That further research, better initial questions posed or a different approach to analysis was needed to improve their understanding of the data they were exploring. Though interesting students no students questioned the validity or quality of the data. Something to work on for the future I guess.

        Like

  3. Not sure whether I can add much to the useful comments above since I’ve not taught at primary or secondary levels but here is some discussion about the role of exploratory data analysis:
    https://statmodeling.stat.columbia.edu/2022/09/06/exploratory-and-confirmatory-data-analysis/

    By the way in the above there is a link to an interview of John Chambers, who was the principal developer of the S language and largely involved in the subsequent R environment for statistical computation and graphics. The video is quite long but he does make reference to the important role that Tukey played.

    When it came out Tukey’s book was a complete revelation, particularly for a young statistician who knew how to determine a sufficient statistic and all those fancy things but had hardly ever seen a statistical graph.

    Like

    1. Welcome Neil
      I recall teaching statistics to management students, having zero exposure to the subject. I was bored silly by the data representation component. Only started to take it seriously when I realised that histograms “approach” the probability density function as the sample size increases. It’s only working with research groups on complicated data that I learned the importance of data representation and informal exploration. This only took 40 years – better late than never?

      On this site we don’t need to stick to primary and secondary syllabus work; all levels of teaching are grist. And I find all students interested in stories from the world of business or research. Stories such as when one includes an irrelevant factor just to bulk out an experiment only to find that factor is actually significant?

      Like

  4. A good example of just that was with an industry project conducted by two students in the Computer and Mathematical Sciences course at Victoria University. In the project an experiment was carried out to improve the rubber composition of the bush, an important component of the suspension system in a locally manufactured car (those were the days!). Following what they were taught, the students suggested a 16-run two level experiment be conducted in two sets of eight, with the hope that the first eight would only be required. They also suggested to include some factors that the company did not consider most likely to be the to be important because it might turn out that they were, and even if they were not, it would mean the results held over a wider set of conditions. After the first set was completed involving six factors each at two levels, and the students analysed the results, an unexpected finding arose – one of the additional factors seemed to be the most important. The company was sceptical, but agreed to run the second set of experiments two weeks later. Not only were the conclusions confirmed, but this was also in spite of a sizeable “block” effect where an extraneous change had been made to the machine producing the bush. The example was engaging for subsequent VU students, since students similar to them had conducted the design and analysis, but has also been used successfully in a number of industry workshops over the years.

    I was involved in a very simple example of EDA yesterday. I doubt it is new, but it was new to me. I have a four-year-old granddaughter who is very keen on drawing and writing. She produced two A4 sheets and told me that she would draw on one while I was supposed to draw on the other. I drew a picture of her and since my drawing skills are not that good, I wrote her name next to it. I also drew Sadie the dog and Sadie’s name. I wrote down her baby sister’s name, as well as her parents’ names, as well as my wife’s and my name, as well as the pet turtle’s name – I probably should have got her to do the names. She had a look and noted that she shared one letter with her Mum and one letter with her Dad. I suggested we count the letters in the seven names. She wrote down the alphabet and together we proceeded to count the A’s, the B’s, and so on while she wrote down the count opposite each letter. It turns out there were more E’s then all the other letters. Coming up with activities like this where some data is collected and collated and graphed in some way seems to be an important learning experience as well as fun and interesting for all involved.

    Like

  5. Agreed. There’s a reason why management consultants are better at exploring things like this, than theoretical statisticians. The former will start with the rock simple, graph it and look at it. It’s hilarious to me that the fussy math rigor divide by zero hawk epsilon delta pushers need John Tukey (!) to tell them to do this. Like…duh! The math types may be “analysts” in terms of Rudin. But they sure aren’t “analysts” in terms of exploring practical issues in the wild.

    Maybe graph it too different ways (and they have a couple ideas about the few to do). After that, more sophisticated methods (out of sample testing, formal measures like ANOVA, normality tests, etc.) can be employed to test relations and hypotheses. But just to start out? Do a fishbone, do a line chart, etc. to even come up with likely hypotheses to check.

    And Say it with Charts by Zelazny is a good, simple, book.

    Like

  6. Hi Anonymous
    I’m feeling vertigo; you and I are agreeing on things, again. As long as you don’t include Neil Diamond in your epsilon/delta brigade.

    I must however draw the line on management consultants. One example should suffice. They were consulted, at great expense, by the Bendigo C. A. E. to advise on reforms. Their advice was to amalgamate some departments. A decade or so later some management consultants were again hired at great expense. They recommended splitting of those departments.

    Then again, my colleague Rex Hunt occasionally moonlighted as a management consultant. He would have a quite word with the shop floor workers. “What’s the problem? What should be done?” He would then use the answers as his (expensive) report. Always worked.

    And the big accounting firms are wonderful. Especially Deloitte where my grandson works.

    Like

    1. Reflecting lower down issues to leadership is a common thing in consulting studies. But it’s not the only thing. You have to take a multi-mode approach when studying a situation. Look at data. Do multiple interviews (you will get different points of view, will get both right and wrong answers). If time, do a test. [Just got done doing a major assessment of a troubled new facility.]

      Like

  7. I hadn’t even read this from Diamond, but did just now, based on your remark.

    “When it came out Tukey’s book was a complete revelation, particularly for a young statistician who knew how to determine a sufficient statistic and all those fancy things but had hardly ever seen a statistical graph.”

    I repeat…DUH! He’s EXACTLY the blinded theoretician that needed a two by four calibration. Yeah…I 100% throw him in that boiling kettle with cannibals huddling next to it. And it helped that Tukey was the one to tell him that…given the brainpower of the man. But it shouldn’t have needed that!

    If “duh” is not sufficient, you can just look at the 1940 article Teaching of Statistics by Hotelling (evah hoid of him?) And Hotelling was (even then) DEFENDING the development of rigor in statistics research…but he still said the rigor pushers needed some grounding via consulting to governments, ag researchers, etc. Need to realize that “your data is dependent on a tired night watchmen filling out a form” to misquote some English pioneer of stats whose name I can’t recall.

    https://www.jstor.org/stable/2235726

    And it’s intriguing to me that Hotelling was trying to defend schools of statistics “turf” and the (already occuring in 1940) pattern of engineering, biology, ag, etc. schools of wanting to do their own courses with their own material and own teachers. Just think about some poor set of nursing students (who have all kinds of other challenges in their regular courses), getting stuck with a blinded rigor pusher giving them their baby stats familiarization! I completely understand why they don’t want some R1 math grad student…and it’s not that they want the hassle…would be happy to outsource…but they end up compelled to go hire adjuncts and give them the specific tasking…because the rigor pushers are so obtuse.

    https://www.youtube.com/watch?v=I7wREOySaxU

    Like

    1. Hi Anonymous
      I have set up the blog to accept all comments, but it seems to have held up your recent one, the one with “DUH”. It was rather intemperate. Perhaps you could take a deep breath and reconsider? Can a recent graduate just learning the ropes be called a blinded theoretician?

      Like

      1. The spam algorithm probably tripped because of the multiple links. It’s not yet smart enough to spank me for intemperance.

        “Can a recent graduate just learning the ropes be called a blinded theoretician?”

        Actually, yes. They are (generally) the worst blinded theoreticians. I see them routinely thinking they should teach real analysis before calculus and other insanities of that ilk. And absolutely a recent stats Ph.D. should also have some idea of how to start looking at a real problem in the field. Perhaps not extreme savviness. But absolutely with a curious mind. To need Tukey (!) to tell you to graph the data and look at it…ai yi yi!

        Like

  8. I’m reading a great book “Computer Age Statistical Inference: Algorithms, Evidence, and Data Science” by Efron and Hastie (among other things, inventor of the Bootstrap and Generalised Additive Models, respectively). In the epilogue, they give a diagram of the development of the statistical discipline since the end of the 19th century. The diagram is a ternary diagram with vertices “Applications”, “Mathematics”, and “Computation”. The path of Statistics begins at the end of the 19th century very close to the Applications vertex, and moves towards the Mathematics vertex over the next 50 years. Efron and Hastie say “it is fair to describe the 1950s as the nadir of the influence of the statistics application on scientific applications”. The path changes in the early 1960s towards a balance of Applications and Computation with much less emphasis on Mathematics. Efron and Hastie comment “The arrival of electronic computation in the mid 1950s began the process of stirring statistics out of the inward-gazing preoccupations with mathematical structure. Tukey’s paper ‘The future of data analysis’ argued for a more application and computation-oriented discipline. Mosteller and Tukey later suggested changing the field’s name to data analysis, a prescient hint of today’s data science”.

    It is now much easier to do an appropriate analysis than it used to be. Unfortunately, it also much easier to do an inappropriate analysis, for instance fitting a model unrelated to reality. Graphs make it possible to sort out the difference and also better convey the conclusions. Students would benefit if teachers could spend more time discussing how to plot the data and interpret the plots.

    Like

Leave a reply to tom Cancel reply