labs:lab13

# Differences

This shows you the differences between two versions of the page.

labs:lab13 [2012/04/21 08:01]
neu
labs:lab13 [2018/04/26 09:50] (current)
neu
Line 1: Line 1:
-====== Lab 13 Linked Lists ======+====== Lab 13 Fitting Data -- LAST LAB! ======

-{{labs:​13.jpg ​ }}Find a new partner and login at one of the Ubuntu machines. You can either make two terminals, for logging into each partner’s **galileo** account, or agree to work from one of your accounts. At the end of each lab, make sure that each partner has a copy of the code you wrote that day. Don’t delete your programs from the labs. It is important to have familiar and working code examples to refer to as you write new programs!+Find a new partner and login at one of the lab computers. You can either make two terminals, for logging into each partner’s **galileo** account, or agree to work from one of your accounts. At the end of each lab, make sure that each partner has a copy of the code you wrote that day. Don’t delete your programs from the labs. It is important to have familiar and working code examples to refer to as you write new programs!
+

Create a working directory called //lab13// for this lab session. ​ Create a working directory called //lab13// for this lab session. ​

+For the first 20 min of class, look at [[@../​html/​notes/​pdf/​L13_fits.pdf|these slides]] which cover three important topics:
+  * Comparing a model to some data: the chi^2
+  * Degrees of freedom and the reduced chi^2
+  * Tuning a Model to Best Match Some Data -- **fitting**

+You should get through slide 45 in this review. Do this thoughtfully -- it will help you understand what is going on below!

-====== A Linked List Example ======+===== 0. Intro to fitting data =====

-In this lab we will explore simple examples of linked list +To begin, simply run the following ​program: **//​lab13data//​** \\
-data structures. ​ Begin by studying ​the program ​below This program  +[[@../​html/​labs/​lab13/​lab13data.cpp|(source code)]]
-implements a STACK type of list Can you see why?+

-<code c> +The program creates three data files in your working area: h_200.dat, h_10k.dat, h_1M.dat. Verify the .dat files are created using the '​ls ​-lrt' command in your working area.
-// stack.cpp +
-// A simple linked-list example: STACK +
-// +

-#include <stdio.h> +These data files contain data from histograms randomly drawn from a Gaussian distribution with mean of 100 and sigma of 10The data files are created from histogram dump_entries function: the 1st column contains bin numbers, the 2nd column contains x-coordinates,​ the 3rd column contains y-coordinates,​ and the 4th column contains errors on the y-values ​dump_entries only prints out bins w/ non-zero entries.
-#include <stdlib.h> +
-#include <unistd.h>+

-const int NDATA=2000;+Three histograms should appear on your screen. To plot one of these by hand your self, in gnuplot view your histogram as follows:

-struct Node{              // a data element +  gnuplot> plot '​h_200.dat'​ using 2:3:4 with errorbars
-  int nbytes; +
-  char buffer[NDATA]; ​    // some data +
-  struct Node *next; ​     // a pointer to another element of this type +
-};+

-typedef struct Node Node;+{{labs:​200_event_hist.jpg?​200|Click to enlarge}}

+Make sure that you understand the plot command above. In gnuplot use **//help plot//** for more details and look at the histogram data files to see the data.  Ask your TA if you need assistance.

-int main(int argc,​char ​*argv[]){ +Next we will setup **gnuplot** to do a **fit** to the histograms in these files. **Fitting** is a common term for extracting parameters from some data source ​(for more information on the fitting tool you can always type in **gnuplot:** //**help fit**//). The task is to
-  ​if (argc<​2){ +  ​- Come up with some model that has tunable parameeters
-    ​printf("​Usage:​ stack <​datafile>​\n"​);​ +  - Vary those parameters
-    ​return 1; +  - Calculate a figure of merit such as the chi^2 between the data and the model under those parameter choices
-  ​}+  ​- obtain the best representation of the data with that model by iterating through different choices of the tunable parameters while minimizing the chi^2

-  Nodehead=NULL; ​   // pointer to the head of our list +The actual work of minimization will be accomplished by **gnuplot**. You will learn about steering ​the fits towards the chi^2 minimum and interpreting the results in this lab.
-                      // initially it points to nothing+

-  // open a data file in binary mode +Note: There are many modern tools to perform fitting exercises such as we are pursuing ​in today'​s lab. We start here with gnuplot since it is familiar to us and is really simpleAfter you understand the basics, you are encouraged to try other modern tools that accomplish these tasks.
-  // The structure of our data file is+
-  // An Integer: nbytes +
-  // char buffer[nbytes] +
-  // An Integer: nbytes +
-  // char buffer[nbytes] +
-  // ..+
-  FILE *input=fopen(argv[1],"​rb"​);​+

-  int nnodes=0; +First define ​model -- here an analytical function -- that you want to use to fit to the data in the histogram. We'll use a Gaussian ​to begin:
-  while (!feof(input)){ +
-    // first we allocate memory and create ​new data node +
-    Node *newNode; +
-    newNode = (Node*)malloc(sizeof(Node));​ +
-    int nbytes; ​         +
-    newNode->​nbytes=nbytes;​ +
-    nnodes++; ​                 // number of nodes created from file +
-  } +
-  printf("​created %d nodes\n",​nnodes);​+

-  // print the nodes +  ​gnuplot> g(x) = a*exp(-(x-m)**2/2/s**2)
-  ​system("​clear"​) // ​clear the screen +
-  Node* handle; +We have defined ​the functional form of a Gaussian, with normalization,​ a mean m, and sigma s.
-  ​handle=head;​ +To check your function, define some values for the parameters as follows:
-  ​while ​(handle !=){ +
-    for (int i=0; i<​handle->​nbytes;​ i+++  ​gnuplot> a=50
-      ​putchar(handle->buffer[i]); +  gnuplot> m=25
-    ​handle = handle->next; +  gnuplot> s=1
-    ​usleep(200000);  ​//0.2 seconds delay +  gnuplot> plot [0:100] g(x)
-    ​system("​clear"​); +
-  ​}+You should see a Gaussian shaped curve plotted in the range [0:100] with a mean of 25, a height of 50, and a sigma of 1.0
+
+Try to fit your 1st histogram using the fit command as follows:
+
+  gnuplot> fit [50:150] g(x) '​h_200.dat'​ using 2:3:4 via a,m,s
+
+Let's break this command down...\\
+[50:150] is a range option, it tells the fitting function to ignore any data with an x-value outside of this range.\\
+g(x) is the name of the function we will fit to the data\\
+'​h_200.dat'​ using 2:3:4 specifies the data file and the columns used in the fit with values x:​y:​sigma_y\\
+via a,m,s specifies the parameters of the function that gnuplot may vary in order to minimize chi^2\\
+
+It it is possible that your first attempt to fit the data failed. You may have seen an error message like the following: ​//“Undefined value during function evaluation”//,​ //​“Singular matrix”//,​ ...
+To get an idea of why your fit failed view your function and the data:
+
+  ​gnuplot> plot [0:150] '​h_200.dat'​ using 2:3:4 with errorbars, g(x)
+
+From this plot we see that our first choice of parameters for the model (a=50, m=25, s=1is VERY VERY different from the data we seek to fit (which, as stated above, has a mean of 100 and sigma of 10). It is no wonder that our attempt to fit the data using these **poorly chosen initial conditions** failed.
+
+The moral of this lesson is that fitting requires some finesse. While some programs are “smarter” than others, you can rarely plug in your data, push GO, and expect to get a meaningful fit as a result! You must find a starting point for your parameters that puts your function somewhere near the data.
+
+(note: when using **gnuplot** to fit data you must be careful not to set the errors of any point to 0, otherwise you'll see lots of NaN values... can you figure out why this would be the case?
+
+Now look at your histogram and estimate reasonable starting values to use for your parameters. These estimates can be very rough, the program will do most of the work. In this exercise you know the parameters, but let's go through it assuming they are unknown. Look at the plot for your data. For example: We can easily read off some starting values for our parameters by looking at the data plot. For example a~25 from the highest point, m~96, because that's kind of in the middle of the distribution,​ and s~15, because that range of x-values that contains most of the data.
+Now try the fit again and it should converge:
+
+<code>
+gnuplot> a=25
+gnuplot> m=96
+gnuplot> s=15
+gnuplot> fit [50:150g(x'​h_200.dat'​ using 2:3:4 via a,m,s
+
+</code
+
+Your output should look something like this after many messages pass by the screen:
+
+<​code>​
+After 6 iterations the fit converged.
+final sum of squares of residuals : 17.7616
+rel. change during last iteration : -2.42676e-06
+
+degrees of freedom ​(ndf: 21
+rms of residuals ​     (stdfit) = sqrt(WSSR/ndf)      : 0.919669
+variance of residuals (reduced chisquare) = WSSR/ndf : 0.845791 ​ Reduced chi-square
+Final set of parameters ​           Asymptotic Standard Error
+======================= ​           ==========================
+
+a               = 16.0944 ​         +/- 1.507        ​(9.366%  Final parameter values
+m               = 99.0838 ​         +/- 0.9412 ​      ​(0.9499%) ​ and parameter uncertainty
+s               = 9.65835 ​         +/- 0.7679 ​      ​(7.951%)
+
+
+correlation matrix of the fit parameters:

-  return ​0; +                a      m      s
-}+a               ​1.000
+m              -0.101 ​ 1.000
+s              -0.529 -0.335 ​ 1.000
</​code>​ </​code>​

+Compare your results to the chi-square probability contours [[@../​html/​notes/​pdf/​L13_fits.pdf|on pages 22-24 of this week's lecture slides]]. For your reduced chi-square and your number of degrees of freedom. What is the probability of observing a larger chi-square than the one you measured? Repeat your fit using the other data files. ​

-===== (A) FIFO =====+Plot your fit on top of your data using the command:

-Compile ​the above program ​[stack.cpp].  ​Also download ​the following ​data file:+  gnuplot> plot [50:150] '​h_200.dat'​ using 2:3:4 with errorbars, g(x)
+
+Rerun the //​**lab13data**// ​program ​to regenerate the histograms two more timesAnd redo your fit to the h_200 data each time. Observe how the reduced chi-square fluctuates and where these results lie on the probability contours. Repeat your fit on the files h_10K.dat and h_1M.dat.
+
+Notice the uncertainties that gnuplot prints out beside the best-fit parameter values.  ​How do the uncertainties on the parameters change for the higher statistics samples?
+
+
+
+===== 1. Excluding a model for a data set =====
+
+Define a new fit function:
+  f(x) = a*exp(-abs(x-m)/​s**2)
+Use this function to fit h_200.dat. If your fit fails and gnuplot returns an error message, you should **re-adjust the initial values** of your parameters a, m, s and try the fit again. Remember, you may need to "poke around"​ for reasonable starting parameters before you can start fitting. ​ Try guessing at parameter values and plotting the data and the function together to see if the function is near the data... Look at the plot of your data and fit together.
+
+What is your reduced chi-squared for fitting this function to h_200.dat? In this case would you say that your data are consistent or inconsistent with the functional form. If we were handed these data without knowing how they were generated, could we argue convincingly that they (probably) did not come from a parent distribution modeled by f(x)? Repeat this test for h_10K.dat. What can you conclude now?
+
+
+
+===== (2) A fit with a large background component =====
+
+This exercise will give you some more practice in doing a fit to a data distribution. ​ **Recommendation:​** open a second window on **galileo**,​ this will allow you to rerun the program to generate data without leaving **gnuplot**. ​ This is useful for the last part of this exercise. ​
+
+Here you will perform an experiment where protons are colliding with anti-protons at very high energies. From these collisions, you choose to study cases where an electron and a positron are produced in the debris that come out of the collision. Assuming that this electron/​positron pair is the result of a very massive particle (The [[Wp>​Z_boson]]) that decays in your detector, you can use the energies of these very light particles to measure the mass of the //Z//. If the //Z// particle is being produced in your experiment, you should see a peak or grouping of the electron+positron energies that corresponds to the mass of the Z.
+
+{{lab11:​particle.png}}
+
+Run the program: //​**lab11data**//​ to produce the files:
+  h_1k.dat
+  h_100k.dat
+
+These histograms show the rate of collisions observed (cross section) for various values of the electron+positron energy. The histograms have 100 and 100,000 entries respectively.
+
+Using gnuplot plot the histograms to your screen (for example):
<​code>​ <​code>​
-cp $LAB13/​anim1.dat .+plot "h_1k.dat" using 2:3:4 with errorbars </​code>​ </​code>​ -Run the program with this data file as input. ​ You should see an animation ​of the world in motion. ​The problem is that it's spinning ​the wrong way. Copy this program ​to [fifo.cpp] and modify the code to process the data nodes in FIFO structure, so the world spins in the correct direction.+The energies here are expressed in units of GeV, or 10^9 electron Volts. ​In the lowest statistics histogram you may or may not be able to discern ​peak representing ​the Z-boson resonance ​in the cross section Run the program several times, you may see a peak if the statistics favor you. ​ -===== (B) Circular Buffer =====+Looking at the histogram with 100K entries, you will notice what looks like an excess of entries as compared to the trend just above 90 GeV. Perhaps something is going on here... -Modify ​the program a second time [cbuff.cpp] ​to implement a circular buffer ​and make the world rotate 5 times by continually following ​the //next// pointers in your data structures You should be able to do this WITHOUT using the count of the number of nodes you read in from the data file!+**We will fit the data to determine the mass, width, ​and production rate of the //Z//. ** -**Have your TA verify ​your results**+First notice that two things are going on in your histogram. There is a background distribution in the cross section plus there is a clear peak around the mass for the //Z//. In order to fit these data you must account for both the background and signal components. Use the h_100k.dat file to perform your fit. + +Use a Lorenzian as the functional form for the peak itself: \\ <​math>​\large A\frac{\Gamma}{2\pi((x-E_0)^2+(\Gamma/​2)^2)}\mbox{~~}</​math>​. ​ \\ The parameter "**//​A//​**"​ is a measure of how often the Z is produced, "​**//​E0//​**"​ measures its mass, "<​math>​\large\Gamma</​math>"​ measures the observed width. ​ The width can be related to the lifetime of the particle via the [[Wp>​Uncertainty_principle]] for Energy and Time, or it could be dominated by imperfect energy resolution/​noise in your detector (in this case, a Gaussian peak would often be more appropriate). + +For the background distribution,​ use a simple 2nd-order polynomial: p1 + p2*x + p3*x^2. + +To fit these functions to the data, try the following:​ + + - In gnuplot, define the function f(x) as the Lorentzian function above, with parameters a, e0 and gamma. + - Define the function g(x) as the polynomial for the background function. + - Start out by asking gnuplot to fit g(x) to the data: <​code>​fit g(x) "​h_100k.dat"​ using 2:3:4 via p1,​p2,​p3</​code>​ + - After fitting, plot the data and g(x) together, to see how well it fits:<​code>​plot "​h_100k.dat"​ using 2:3:4 with errorbars, g(x)</​code>​ + - Now try fitting the peak. Tell gnuplot to just look at points around the peak, by specifying a narrow range:<​code>​fit [85:95] f(x) "​h_100k.dat"​ using 2:3:4 via a,​e0,​gamma</​code>​ + - Now plot the sum of both fits together with the data:<​code>​plot "​h_100k.dat"​ using 2:3:4 with errorbars, f(x)+g(x)</​code>​ + - Modifying the values of a, e0 and gamma by hand, and replot. ​ Repeat this until you have a pretty good fit. + - Now define a new function:<​code>​h(x) = f(x)+g(x)</​code>​ + - Ask gnuplot to try fitting h(x) to the data:<​code>​fit h(x) "​h_100k.dat"​ using 2:3:4 via a,​e0,​gamma,​p1,​p2,​p3</​code>​ + - Does your best fit show significant deviations from the data? + - Once you have a good fit, plot the following:<​code>​plot "​h_100k.dat"​ using 2:​($3-g($2)):​4 with erorbars,​f(x)</​code>​This shows the peak, with the background subtracted, and your best-fit Lorentzian curve superimposed. + +What are the values and errors on the values for "​**//​E0//​**"​ , the mass of the //Z//, and "<​math>​\large\Gamma</​math>"​ , the width of the Z from your fit, and **//A//** the quantity related to the production rate? + +Is your chi-square reasonable? How close can you get to a reduced chi^2 of 1.0? + +**Write down your best values and the errors from the fit for your three signal parameters.** + +Rerun the program to generate the histograms until you see a hint of a peak in the h_1k distribution. +Try the your fit again and extract the parameters for your signal. ​ Do a relative measure of the rate of //Z// production from the two samples (e.g. compare A_h_100k/​100,​000 to A_h_1k/​1000). ​ Which one gives you a greater estimate of the rate of //Z// production? ​ Remake several more histograms and redo the fits. Do you notice a trend in the comparison? ​ This is a common problem/​bias of low statistics! ​ What is the nature of this bias? + +**Discuss ​your results ​w/ your TA.** + + + + +====== Course Evaluations ​ ====== + +If you have time at the end of lab or just afterwards, submit a course evaluation through UVaCollab -- thus fulfilling the extra credit assignment for [[hw:​hw14|HW 14]]. Do this independently since the extra credit will be given to individuals not lab partners. You may also fill this out after the lab if you prefer. + +Please be thoughtful and provide substantive insight towards the goal of improving this course for future semesters. + + +!- +[[@../​html/​labs/​lab11/​example|Link]] to an example fit. Copy the data file and the plot file. You can cut/paste +commands from the plot file to run the demo. +-! + + +!- + +===== (2) Trying another fitting program ===== + +Now let's take a quick look at another fitting program. ​ This is one that's been used in particle physics for a couple of decades. ​ Start out by typing: + + cp$P2660LIB/​pawfit.kumac .
+
+Then start up the program by typing "​paw",​ which stands for "​Physics Analysis Workstation"​. ​ You'll see something like this:

-Next try your fifo or circular buffer program with the following data file.
<​code>​ <​code>​
-cp $LAB13/anim2.dat .+ paw + ​****************************************************** + ​* ​ * + ​* ​ W E L C O M E to P A W * + ​* ​ * + ​* ​ ​Version 2.14/04 12 January 2004 * + ​* ​ * + ​****************************************************** + ​Workstation type (?=HELP) <​CR>​=1 : </​code>​ </​code>​ -===== (C) A Little Challenge ===== +At the prompt, just press return, and you should see a new, blank window pop up. Now, at the "PAW >" command prompt, type: -8-) This one is pretty cool :!:+ -The following file is in a different format from the others: + exec pawfit h_10k.dat - ​cp$LAB13/​anim3.dat ​.+  ​
+Note that you didn't need to enter any guesses for the parameters Try it again by typing:

-Can you modify your program to read the frames ​and play the file? +  exec pawfit h_200.dat
-Note to make things easiereach frame is separated by a single "#"​ character ​and the "​%"​ marks the end of the data.  View the file in **emacs** to see the structure.+
+Paw will print out the best-fit parameters ​and other information in your terminal window, and also display ​the best-fit parameter values, ​the chi-squared value and the number of degrees of freedom ​in the upper corner of the graphics window.

-Note: the lines are very long in this file.  ​To run your program open a second terminal to connect to **galileo**then set the terminal font to a small size (I recommend a 5 or 6 point font) and stretch the window out so it is fairly wide and tall.  You'll need to do this to prevent the lines in the image from wrapping in your terminal. ​ Use your first terminal to do your compiling and run the program in this second window.+To exit from pawtype "​quit" ​or "​exit"​ at the command prompt.

-Hints:  +If you're curious, take a look inside ​the file pawfit.kumac,​ using emacs, to see what it does.  This is paw macro file that executes a few commands for you.
-  * increase your buffer size to 20000 characters to handle ​the large frames in this file or you will SEG FAULT +
-  * if the frames are very choppy due to sending the data over the nextworkyou can copy the files to your LAB PC and work locally +
-  * remember EVERY line except the last ends w/ LINE FEED, '​\n',​ character, even the frame spacer lines +

+
+==== A useful tip/​reminder - try this before you leave ====
+
+In **gunplot** it is easy to turn any plot on your screen into a graphic file.  This is very useful if you are manipulating a plot by hand and then want to save a file for printing. ​ When you complete the fit above simply to the following when you have a plot to save.
+<​code>​
+set term pdf
+set output "​myplot.pdf"​
+replot
+unset output
+set term x11
+  to view/check your pdf file from without leaving gnuplot, you may type
+!xpdf myplot.pdf
+</​code>​
+
+The //unset output// command closes the graphics file.  And //set term wxt// resets the default display to be your monitor.

-====== Course Evaluations ​ ======

-If you have time at the end of lab or just afterwards, please submit a course evaluation through UVaCollab. Open two browser windows (or tabs) so each partner can complete the questionnaire. ​ Or spread out to the free computers as necessary. You may also fill this out after the lab if you prefer.  +-!
-Thanks+