[an error occurred while processing this directive]

# ASTRONOMY 3130: Guide to Making Proper Scatterplots

This webpage has borrowed figures and content from the following websites, which are also worth visiting: http://www2.selu.edu/Academics/Faculty/dgurney/Math241/StatTopics/ScatGen.htm , http://misterguch.brinkster.net/graph.html and http://tfscientist.hubpages.com/hub/How-to-Draw-a-Scientific-Graph.

In this class, and very often in a scientific career, we will need to understand how various measurements vary as a function of some other variable. The goal is typically to see whether two properties are correlated, revealing a dependency that demonstrates a physical connection. This is best done by plotting the data in what is called a "scatterplot", which shows how one variable varies as a function of the other. Once a pattern emerges that suggests a correlation/dependency, it is typical to look for a fit of the trend that perhaps reveals a fundamental mathematical relationship.

This webpage lays out the standard expectations for doing a proper scatterplot. I will expect the conventions laid out here to be followed in your lab reports.

## I. BASIC RULES FOR PLOTTING SCATTERPLOTS

You should ensure that you follow these basic rules when plotting a graph:

1. The independent variable is always plotted along the abscissa and the dependent variable is plotted on the ordinate.

Be clear of the definitions of the above terms:

• The independent variable is the variable that you control and adjust.

For example, in Lab 1 this could be the magnification, which you are changing from high to low as you change eyepieces from short to large focal lengths.

• The dependent variable is the variable that you don't know before the experiment begins and is the one you are measuring.

For example, in Lab 1 this would be the fields of view that you measure for each different eyepiece.

If x is the independent variable, the dependent variable is f(x). ##### Examples of the proper orientation of variables on a plot: The dependent variable is on the ordinate (y-axis). From http://www.narragansett.k12.ri.us/resources/necap%20support/gle_support/Math/resources_functions/dep_indep.htm .
• If you have ordered pairs of points to plot, the abscissa is the element of the ordered pair that corresponds to the horizontal axis (sometimes called the "x-axis").

• When you have ordered pairs of points to plot, the ordinate is the element of the ordered pair that is plotted along the vertical axis (sometimes called the "y-axis"). ##### From http://piratesandrevolutionaries.blogspot.com/2008/12/abscissa-and-ordinate-of-cartesian.html.
2. Plot numerical scales along each axis.

Your plot should have regularly spaced tickmarks and numerical scales along each axis. You do not have to give a number for each tickmark, but you should give enough numbers that the reader can easily interpret the plot.

Often one can make neater plots by having your "major" tick mark intervals subdivided into "minor" tick mark intervals; one should only label the major tickmarks. ##### Use of major and minor tick marks. From http://www.originlab.com/www/helponline/origin/en/UserGuide/Configuring_Axis_Scales_and_Tick_Marks.html .
3. Make good use of available space.

The range of the numerical scale for each axis should be sensible -- large enough to include all data of interest, but small enough that there aren't large sections of the plot that are empty, serve no purpose, and have the net effect of compressing the available space for the plotted data. The plot itself should be large enough and neatly drawn so as to be easily legible.    ##### Three examples of scatterplots making poor use of space and one (the last one) with a good use of space. From http://www.itl.nist.gov/div898/handbook/eda/section3/scattera.htm .
Despite the admonition to make good use of space, it is often helpful to include the origin (0,0) in the plot, if it isn't going to make the plot ridiculous in the use of space.

Sometimes you will have spurious, outlier points that are very distant from the main group. In the case of ridiculously separated points, rather than squeeze a lot of points into a small area in the plot just so you can include one extreme outlier into the plot, it is acceptable to leave the outlier out of the plot, but indicate its presence with an arrow pointing to it just inside the nearest edge of the plot to the point. Generally, any such point is being ignored in the estimation of any linear trend from the data (see below). ##### Example of a scatterplot showing an obviously suspicious outlier. From http://www.itl.nist.gov/div898/handbook/eda/section3/scattera.htm . ##### Example of a scatterplot (right hand panel) where some points are so far from the main group of points that their positions are only shown by arrows to indicate the direction they lie (and a label stating the abscissa point corresponding to that point, so the reader can determine how far from the plot the point lies). From Munoz et al. (2005, Astrophysical Journal Letters, ApJ, 631, L137, http://iopscience.iop.org/1538-4357/631/2/L137/fulltext/19470.figures2.html ) .
4. Always label what variable is plotted along each axis. These labels should also make clear what units are being used for the variables being plotted.

For example, in Lab 1 you will be making a plot that should have on the ordinate a label of "field of view (arcminutes)". If you do not put the units it will be impossible for the reader to know whether you are taking about degrees, arcminutes, radians, etc.

In the "Car Wash" plot above, you can see that the revenue is in "\$". In the golfing plots you see the units are speed in mph and distance in yards.

5. Plot your (x, f(x)) pairs on the plot using points or other clear symbols.

If you are plotting multiple data sets on the same plot, make sure each set has a unique and easily distinguishable symbol (or color) for all its points. Include a legend in the plot that explains what the different symbols mean. ##### Example of a scatterplot (showing what relationship I cannot understand!) showing two sets of points and best fit lines. The "legend" in the upper left hand diagram explains the points and the lines associated with each set of points. From http://people.ucsc.edu/~ggilbert/RTransition.html .
6. Do not use a bar graph!

Many commercial plotting packages are made with business, not science, applications in mind, and their default is to plot vertical bar graphs, not point plots. Unless you are intentionally trying to make a histogram you should not use a bar graph. ##### Use of a bar graph when a scatterplot is appropriate. Do NOT do this!! From http://4dpiecharts.com/2010/08/ .
7. If you know the uncertainties in your measurement, plot the points with appropriate eror bars showing the uncertainties in both the downward and upward directions.

The error bars are made from the center of the point with a length equal to the estimated uncertainty in each direction. For example, in the figure below, the uncertainty in the ordinate is 0.2 units, so there is an upward error bar of 0.2 units length and a downward error bar of the same length. ##### Error bar construction from http://www.upscale.utoronto.ca/PVB/Harrison/ErrorAnalysis/Graphical.html . ##### Example of a scatterplot with the uncertainties in the ordinate shown as errorbars. Drawing only vertical error bars is typically done when you very accurately know, or set, the abscissa values, so that their uncertainty is negligible. Note in this case that each point has a different associated uncertainty. From http://www.omatrix.com/spmanual/serrorbars2.gif . ##### Example of a scatterplot with the uncertainties in both directions shown as error bars. From http://www.grapl.com/desktop/whyuse/education/science.html .
8. Do NOT connect the dots!!

Many plotting programs have a feature that connects adjacent points with lines. Do not play "connect the dots"! Rarely is this conveying useful information.

The plot below is NOT what to do: ##### Example of a BAD scatterplot, which, among numerous other problems (discussed below) is connecting the dots. From http://misterguch.brinkster.net/graph.html .
In most cases your data are not precise enough to be considered "absolute" -- that is, the measurements always have uncertainties (random errors) which scatter them from any actual perfect global trend. This means that any trend from one individual point to the next one is not very meaningful and we are only concerned with estimating from the entire ensemble of points what the global trend is. So it is totally worthless and meaningless to "connect the dots" on a graph. Instead...

9. Draw a line of best fit.

Most commonly (but not always) we will be seeing a linear correlation between dependent and independent variable. You should estimate this linear trend by plotting a line that passes through or near as many data points as you can. (If you suspect the correlation is not linear, you can draw a curve of the shape that you expect for the correlation.) ##### Example of a GOOD scatterplot of the same data as shown just above, which, among numerous other positive aspects (discussed below) is fitting the lines with a mean trend. From http://misterguch.brinkster.net/graph.html .
10. Beware of outliers and influential observations when drawing a best fit line and interpreting your results.

Often you will have outlier measurements -- anomalous results that are simply way off the trend of the rest of the points (even if their formal uncertainty is small). An example of this was already shown above. Generally these happen for unknown reasons, but typically it means that something just went plain wrong in the experiment (e.g., a human error of some kind). It is generally considered acceptable to ignore truly suspicious points in your linear fit; but you should clearly mark such ignored points (for example, by circling them). ##### Example of a scatterplot showing an obviously suspicious outlier, which is circled and ignored in the fitting of the best fit (regression) line. From http://www2.selu.edu/Academics/Faculty/dgurney/Math241/StatTopics/ScatGen.htm .
In many plotting routines this line can be more accurately calculated using linear regression. Such routines not only more accurately calculate a best fitting line, but they can automatically and objectively reject outliers (e.g., through "iterative n-sigma rejection") and can weight the points that influence the fit by their uncertainties.

Also be aware of influential observations in a scatter diagram. These are isolated points separated by a wide horizontal separation from the majority of other points. These isolated points may have no better precision than the other points but, by virtue of their placement in the plot, can have undue influence on a fit or on the interpretation of the overall trend. ##### Example of a scatterplot showing several points on the right that have a great influence on the overall position of the best fit line. From http://www2.selu.edu/Academics/Faculty/dgurney/Math241/StatTopics/ScatGen.htm .
One can tell whether a point is an "influential observation" if when the point is removed from the plot or should happen to be relocated, the position of the best fit line changes substantially.

11. Put a title above the graph or make a descriptive caption for it (beneath the figure).

For example, a plot in Lab 1 should have the title "Field of view as a function of magnitude."

## II. EXAMPLES OF GOOD AND BAD PLOTS

Below we have two plots of the same data, with one showing a BAD graph with a number of problems we already discussed, and the other showing how to do things properly. ##### From http://misterguch.brinkster.net/graph.html .
The above plot that has violated at least four of the rules listed above, and even if the data points are correctly plotted, you can't interpret what they mean:

• There is no title or caption. What is this a plot of?

• There are no axis labels. What are those numbers representing?

• There are also no units on either axis. Again, what are these numbers and on what scale?

• There is no reason to "connect the dots" -- these lines have no meaning and their slopes are clearly dictated by the random errors that are scattering the points away from the main trend.

• No error bars plotted.
Here is a better version of the same plot (though, without error bars still): ##### From http://misterguch.brinkster.net/graph.html .
Here is another example of a bad/good pair of plots of the same data. Can you identify the problems in the "bad" one? ## III. HOW TO PLOT GOOD SCIENCE GRAPHS IN EXCEL

It's perfectly acceptable to hand in neatly hand-drawn plots for this class (but do them in pencil!).

On the other hand, there are many commercially made plotting programs one can use to make good scatterplots, but often students are most familiar with Excel.

I found this video, which shows how to use Excel to make reasonably good plots.