labs:lab13

# Differences

This shows you the differences between two versions of the page.

labs:lab13 [2012/04/21 08:01]
neu
labs:lab13 [2017/04/27 11:30] (current)
neu
Line 1: Line 1:
-====== Lab 13 Linked Lists ======+====== Lab 13 Fitting Data -- LAST LAB! ======

-{{labs:​13.jpg ​ }}Find a new partner and login at one of the Ubuntu machines. You can either make two terminals, for logging into each partner’s **galileo** account, or agree to work from one of your accounts. At the end of each lab, make sure that each partner has a copy of the code you wrote that day. Don’t delete your programs from the labs. It is important to have familiar and working code examples to refer to as you write new programs!+Find a new partner and login at one of the lab computers. You can either make two terminals, for logging into each partner’s **galileo** account, or agree to work from one of your accounts. At the end of each lab, make sure that each partner has a copy of the code you wrote that day. Don’t delete your programs from the labs. It is important to have familiar and working code examples to refer to as you write new programs!
+

Create a working directory called //lab13// for this lab session. ​ Create a working directory called //lab13// for this lab session. ​

+===== 0. Intro to fitting data =====

+To begin, simply run the following program: **//​lab13data//​** \\
+[[@../​html/​labs/​lab13/​lab13data.cpp|(source code)]]

-====== A Linked List Example ======+The program creates three data files in your working area: h_200.dat, h_10k.dat, h_1M.dat. These data files contain data from histograms randomly drawn from a Gaussian distribution with mean of 100 and sigma of 10. The data files are created from histogram dump_entries function: the 1st column contains bin numbers, the 2nd column contains x-coordinates,​ the 3rd column contains y-coordinates,​ and the 4th column contains errors on the y-values. ​ dump_entries only prints out bins w/ non-zero entries.

-In this lab we will explore simple examples of linked list +In gnuplot view your histogram as follows:
-data structures. ​ Begin by studying the program below. ​ This program  +
-implements a STACK type of list.  Can you see why?+

-<code c> +  gnuplotplot 'h_200.dat' using 2:3:4 with errorbars
-// stack.cpp +
-// A simple linked-list exampleSTACK +
-// +

-#include <stdio.h> +{{labs:​200_event_hist.jpg?​200|Click to enlarge}}
-#include <​stdlib.h>​ +
-#include <​unistd.h>​+

-const int NDATA=2000;+Make sure that you understand the plot command above. In gnuplot use **//help plot//** for more details and look at the histogram data files to see the data.  Ask your TA if you need assistance.

-struct Node{              // a data element +Next we will setup **gnuplot** to do fit to the data. **Fitting** is a common term for extracting parameters from data (for more information on the fitting tool you can always type in **gnuplot:​** ​//**help fit**//). The task is to vary parameters in some functional model to obtain the best representation ​of the data with that model. This is accomplished by choosing parameters that minimize the chi^2 between the data and the model. The actual work of minimization will be accomplished by **gnuplot**. You will learn about steering the fits towards the chi^2 minimum and interpreting the results in this lab.
-  int nbytes; +
-  char buffer[NDATA]; ​    // some data +
-  struct Node *next;      ​// a pointer ​to another element ​of this type +
-};+

-typedef struct Node Node;+First define the function you want to use to fit the data, your model for the data. We'll use a Gaussian to begin:

+  gnuplot> g(x) = a*exp(-(x-m)**2/​2/​s**2)

-int main(int argc,char *argv[]){ +We have defined the functional form of a Gaussianwith normalization,​ a mean m, and sigma s.
-  if (argc<​2){ +To check your function, define some values for the parameters as follows:
-    printf("​Usagestack <​datafile>​\n"​);​ +
-    return 1; +
-  }+

-  ​Node* head=NULL;    // pointer to the head of our list +  ​gnuplot> a=50
-                      // initially it points to nothing+  ​gnuplot>​ m=25
+  gnuplot> s=1
+  gnuplot> plot [0:100] g(x)

-  // open data file in binary mode +You should see Gaussian shaped curve plotted ​in the range [0:100with a mean of 25, a height of 50, and a sigma of 1.0
-  // The structure of our data file is: +
-  // An Integer: nbytes +
-  // char buffer[nbytes] +
-  // An Integernbytes +
-  // char buffer[nbytes] +
-  // ... +
-  FILE *input=fopen(argv[1],"​rb"​);​+

-  int nnodes=0; +Try to fit your 1st histogram using the fit command as follows:
-  while (!feof(input)){ +
-    // first we allocate memory and create a new data node +
-    Node *newNode; +
-    newNode = (Node*)malloc(sizeof(Node));​ +
-    int nbytes; ​         +
-    newNode->​nbytes=nbytes;​ +
-    nnodes++; ​                 // number of nodes created from file +
-  } +
-  printf("​created %d nodes\n",​nnodes);​+

-  ​// print the nodes +  ​gnuplot> fit [50:150] g(x) '​h_200.dat'​ using 2:3:4 via a,m,s
-  ​system("​clear"​);  ​// clear the screen +
-  Node* handle; +Let's break this command down...\\
-  ​handle=head;​ +[50:150] is a range option, it tells the fitting function to ignore any data with an x-value outside of this range.\\
-  ​while (handle ​!=0 ){ +g(xis the name of the function we will fit to the data\\
-    for (int i=0; i<handle->nbytes; i++) +'​h_200.dat'​ using 2:3:4 specifies the data file and the columns used in the fit with values x:​y:​sigma_y\\
-      ​putchar(handle->buffer[i]); +via a,m,s specifies the parameters of the function that gnuplot may vary in order to minimize chi^2\\
-    ​handle = handle->next; +
-    ​usleep(200000);  ​//0.2 seconds delay +It it is possible that your first attempt to fit the data failed. You may have seen an error message like the following: ​//“Undefined value during function evaluation”//,​ //​“Singular matrix”//,​ ...
-    ​system("​clear"​); +To get an idea of why your fit failed view your function and the data:
-  ​}+
+  ​gnuplot> plot [0:150] '​h_200.dat'​ using 2:3:4 with errorbars, g(x)
+
+The moral of this lesson is that fitting requires some finesse. While some programs are “smarter” than others, you can rarely plug in your data, push GO, and expect to get a meaningful fit as a resultYou must find a starting point for your parameters that puts your function somewhere near the data.
+
+(note: when using **gnuplot** to fit data you must be careful not to set the errors of any point to 0, otherwise you'll see lots of NaN values... can you figure out why this would be the case?
+
+Now look at your histogram and estimate reasonable starting values to use for your parameters. These estimates can be very rough, the program will do most of the work. In this exercise you know the parameters, but let's go through it assuming they are unknown. Look at the plot for your data. For example: We can easily read off some starting values for our parameters by looking at the data plot. For example a~25 from the highest point, m~96, because that's kind of in the middle of the distribution,​ and s~15, because that range of x-values that contains most of the data.
+Now try the fit again and it should converge:
+
+<code
+gnuplota=25
+gnuplot> m=96
+gnuplot> s=15
+gnuplot> fit [50:150g(x'​h_200.dat'​ using 2:3:4 via a,m,s
+
+Your output should look something like this after many messages pass by the screen:
+After 6 iterations the fit converged.
+final sum of squares of residuals : 17.7616
+rel. change during last iteration : -2.42676e-06
+
+degrees of freedom ​(ndf: 21
+rms of residuals ​     (stdfit) = sqrt(WSSR/ndf)      : 0.919669
+variance of residuals (reduced chisquare) = WSSR/ndf : 0.845791 ​ Reduced chi-square
+Final set of parameters ​           Asymptotic Standard Error
+======================= ​           ==========================
+
+a               = 16.0944 ​         +/- 1.507        ​(9.366%  Final parameter values
+m               = 99.0838 ​         +/- 0.9412 ​      ​(0.9499%) ​ and parameter uncertainty
+s               = 9.65835 ​         +/- 0.7679 ​      ​(7.951%)
+
+
+correlation matrix of the fit parameters:

-  return ​0; +                a      m      s
-}+a               ​1.000
+m              -0.101 ​ 1.000
+s              -0.529 -0.335 ​ 1.000
</​code>​ </​code>​

+Compare your results to the chi-square probability contours above. For your reduced chi-square and your number of degrees of freedom. What is the probability of observing a larger chi-square than the one you measured? Repeat your fit using the other data files. ​

-===== (AFIFO =====+Plot your fit on top of your data using the command:
+
+  gnuplot> plot [50:150] '​h_200.dat'​ using 2:3:4 with errorbars, g(x)
+
+Rerun the //​**lab13data**//​ program to regenerate the histograms two more times. And redo your fit to the h_200 data each time. Observe how the reduced chi-square fluctuates and where these results lie on the probability contours. Repeat your fit on the files h_10K.dat and h_1M.dat.
+
+Notice the uncertainties that gnuplot prints out beside the best-fit parameter values. ​ How do the uncertainties on the parameters change for the higher statistics samples?
+
+
+
+===== (1) Excluding a model for a data set =====
+
+Define a new fit function:
+  f(x) = a*exp(-abs(x-m)/​s**2)
+Use this function to fit h_200.dat. If your fit fails and gnuplot returns an error message, you should re-adjust the initial values of your parameters a, m, s and try the fit again. Remember, you may need to "poke around"​ for reasonable starting parameters before you can start fitting. ​ Try guessing at parameter values and plotting the data and the function together to see if the function is near the data... Look at the plot of your data and fit together.
+
+What is your reduced chi-squared for fitting this function to h_200.dat? In this case would you say that your data are consistent or inconsistent with the functional form. If we were handed these data without knowing how they were generated, could we argue convincingly that they (probably) did not come from a parent distribution modeled by f(x)? Repeat this test for h_10K.dat. What can you conclude now?
+
+
+
+===== (2Trying another fitting program ​=====
+
+Now let's take a quick look at another fitting program. ​ This is one that's been used in particle physics for a couple of decades. ​ Start out by typing:
+
+  cp $P2660LIB/​pawfit.kumac . + +Then start up the program by typing "​paw",​ which stands for "​Physics Analysis Workstation"​. ​ You'll see something like this: -Compile the above program [stack.cpp]. ​ Also download the following data file: <​code>​ <​code>​ -cp$LAB13/anim1.dat .+ paw
+ ​******************************************************
+ ​* ​                                                   *
+ ​* ​           W E L C O M E    to   P A W             *
+ ​* ​                                                   *
+ ​* ​      ​Version 2.14/04      12 January 2004         *
+ ​* ​                                                   *
+ ​******************************************************
+ ​Workstation type (?=HELP) <​CR>​=1 :
</​code>​ </​code>​

-Run the program with this data file as input. ​ You should see an animation of the world in motion.  ​The problem is that it's spinning ​the wrong way.  Copy this program to [fifo.cpp] and modify the code to process the data nodes in a FIFO structureso the world spins in the correct direction.+At the prompt, just press return, and you should see a new, blank window pop up.  ​Now, at the "PAW >" command prompttype:

-===== (B) Circular Buffer =====+  exec pawfit h_10k.dat
+
+Note that you didn't need to enter any guesses for the parameters. ​ Try it again by typing:

-Modify the program a second time [cbuff.cpp] to implement a circular buffer and make the world rotate 5 times by continually following the //next// pointers ​in your data structures. ​ You should be able to do this WITHOUT using the count of the number of nodes you read in from the data file!+  exec pawfit h_200.dat
+
+Paw will print out the best-fit parameters and other information ​in your terminal window, and also display ​the best-fit parameter values, the chi-squared value and the number of degrees of freedom ​in the upper corner of the graphics window.

-**Have your TA verify your results**+To exit from paw, type "​quit"​ or "​exit"​ at the command prompt.

-Next try your fifo or circular buffer program ​with the following ​data file.+If you're curious, take a look inside the file pawfit.kumac,​ using emacs, to see what it does.  This is a paw macro file that executes a few commands for you.
+
+**Discuss ​your results ​with your TA/​Instructor**
+
+==== A useful tip/​reminder - try this before you leave ====
+
+In **gunplot** it is easy to turn any plot on your screen into a graphic file.  This is very useful if you are manipulating a plot by hand and then want to save a file for printing. ​ When you complete the fit above simply to the following ​when you have a plot to save.
<​code>​ <​code>​
-cp $LAB13/​anim2.dat .+set term pdf +set output "​myplot.pdf" +replot +unset output +set term x11 + to view/check your pdf file from without leaving gnuplot, you may type +!xpdf myplot.pdf </​code>​ </​code>​ -===== (C) A Little Challenge ===== +The //unset output// command closes the graphics file. And //set term wxt// resets the default display to be your monitor. -8-) This one is pretty cool :!:+ -The following file is in a different format from the others: - cp$LAB13/​anim3.dat .

-Can you modify your program to read the frames and play the file?
-Note to make things easier, each frame is separated by a single "#"​ character and the "​%"​ marks the end of the data.  View the file in **emacs** to see the structure.

-Note: the lines are very long in this file.  ​To run your program ​open second terminal to connect ​to **galileo**, then set the terminal font to a small size (I recommend ​or 6 point font) and stretch ​the window ​out so it is fairly wide and tall.  ​You'll need to do this to prevent ​the lines in the image from wrapping in your terminal ​Use ​your first terminal ​to do your compiling ​and run the program in this second window.+===== (3) A fit with a large background component =====
+
+This exercise will give you some more practice in doing a fit to a data distribution. ​ **Recommendation:** open a second window on **galileo**,​ this will allow you to rerun the program to generate data without leaving **gnuplot**. ​ This is useful for the last part of this exercise. ​
+
+!-
+[[@../​html/​labs/​lab11/​example|Link]] to an example fit.  Copy the data file and the plot file.  You can cut/paste
+commands from the plot file to run the demo.
+-!
+
+Here you will perform an experiment where protons ​are colliding with anti-protons at very high energies. From these collisions, you choose to study cases where an electron and a positron are produced ​in the debris that come out of the collision. Assuming that this electron/​positron pair is the result of a very massive particle (The [[Wp>​Z_boson]]) that decays in your detector, you can use the energies of these very light particles to measure the mass of the //Z//If the //Z// particle is being produced in your experiment, you should see a peak or grouping of the electron+positron energies that corresponds to the mass of the Z.
+
+{{lab11:​particle.png}}
+
+Run the program: //​**lab11data**//​ to produce the files:
+  h_1k.dat
+  h_100k.dat
+
+These histograms show the rate of collisions observed (cross section) for various values of the electron+positron energy. The histograms have 100 and 100,000 entries respectively.
+
+Using gnuplot plot the histograms to your screen (for example):
+<​code>​
+plot "​h_1k.dat"​ using 2:3:4 with errorbars
+</​code>​
+Refer to last week's lab for more comments on data fitting.
+
+The energies here are expressed in units of GeV, or 10^9 electron Volts. ​ In the lowest statistics histogram you may or may not be able to discern a peak representing the Z-boson resonance in the cross section. ​  Run the program ​several times, you may see peak if the statistics favor you.
+
+Looking at the histogram with 100K entries, you will notice what looks like an excess of entries as compared ​to the trend just above 90 GeV. Perhaps something is going on here...
+
+**We will fit the data to determine the mass, width, and production rate of the //​Z//​. ​**
+
+First notice that two things are going on in your histogram. There is a background distribution in the cross section plus there is a clear peak around the mass for the //Z//. In order to fit these data you must account for both the background and signal components. Use the h_100k.dat file to perform your fit.
+
+Use Lorenzian as the functional form for the peak itself: \\ <​math>​\large A\frac{\Gamma}{2\pi((x-E_0)^2+(\Gamma/​2)^2)}\mbox{~~}</​math>​. ​ \\ The parameter "​**//​A//​**"​ is measure of how often the Z is produced, "​**//​E0//​**"​ measures its mass, "<​math>​\large\Gamma</​math>"​ measures the observed width. ​ The width can be related to the lifetime of the particle via the [[Wp>​Uncertainty_principle]] for Energy and Time, or it could be dominated by imperfect energy resolution/​noise in your detector (in this case, a Gaussian peak would often be more appropriate)
+
+For the background distribution,​ use a simple 2nd-order polynomial: p1 + p2*x + p3*x^2.
+
+To fit these functions to the data, try the following:​
+
+  - In gnuplot, define the function f(x) as the Lorentzian function above, with parameters a, e0 and gamma.
+  - Define ​the function g(x) as the polynomial for the background function.
+  - Start out by asking gnuplot to fit g(x) to the data: <​code>​fit g(x) "​h_100k.dat"​ using 2:3:4 via p1,​p2,​p3</​code>​
+  - After fitting, plot the data and g(x) together, to see how well it fits:<​code>​plot "​h_100k.dat" using 2:3:4 with errorbars, g(x)</​code>​
+  - Now try fitting the peak.  Tell gnuplot ​to just look at points around the peak, by specifying a narrow range:<​code>​fit [85:95] f(x) "​h_100k.dat"​ using 2:3:4 via a,​e0,​gamma</​code>​
+  - Now plot the sum of both fits together with the data:<​code>​plot "​h_100k.dat"​ using 2:3:4 with errorbars, f(x)+g(x)</​code>​
+  - Modifying the values of a, e0 and gamma by hand, and replot. ​ Repeat ​this until you have a pretty good fit.
+  - Now define a new function:<​code>​h(x) = f(x)+g(x)</​code>​
+  - Ask gnuplot to try fitting h(x) to the data:<​code>​fit h(x) "​h_100k.dat"​ using 2:3:4 via a,​e0,​gamma,​p1,​p2,​p3</​code>​
+  - Does your best fit show significant deviations ​from the data?
+  - Once you have a good fit, plot the following:<​code>​plot "​h_100k.dat"​ using 2:​($3-g($2)):​4 with erorbars,​f(x)</​code>​This shows the peak, with the background subtracted, and your best-fit Lorentzian curve superimposed.
+
+What are the values and errors on the values for "​**//​E0//​**"​ , the mass of the //Z//, and "<​math>​\large\Gamma</​math>"​ , the width of the Z from your fit, and **//A//** the quantity related ​to the production rate?
+
+Is your chi-square reasonable? How close can you get to a reduced chi^2 of 1.0?
+
+**Write down your best values ​and the errors from the fit for your three signal parameters.**
+
+Rerun the program ​to generate the histograms until you see a hint of a peak in the h_1k distribution.
+Try the your fit again and extract the parameters for your signal. ​ Do a relative measure of the rate of //Z// production from the two samples (e.g. compare A_h_100k/​100,​000 to A_h_1k/​1000). ​ Which one gives you a greater estimate of the rate of //Z// production? ​ Remake several more histograms and redo the fits.  Do you notice a trend in the comparison? ​ This is a common problem/​bias of low statistics! ​ What is the nature of this bias?
+

-Hints: ​
-  * increase your buffer size to 20000 characters to handle the large frames in this file or you will SEG FAULT
-  * if the frames are very choppy due to sending the data over the nextwork, you can copy the files to your LAB PC and work locally
-  * remember EVERY line except the last ends w/ a LINE FEED, '​\n',​ character, even the frame spacer lines

-[cbuff2.cpp]

====== Course Evaluations ​ ====== ====== Course Evaluations ​ ======

-If you have time at the end of lab or just afterwards, ​please ​submit a course evaluation through UVaCollab. Open two browser windows (or tabs) so each partner can complete ​the questionnaire Or spread out to the free computers as necessary. You may also fill this out after the lab if you prefer.  +If you have time at the end of lab or just afterwards, submit a course evaluation through UVaCollab ​-- thus fulfilling ​the extra credit assignment for [[hw:​hw14|HW 14]]Do this independently since the extra credit will be given to individuals not lab partners. You may also fill this out after the lab if you prefer.
-Thanks! ​+
+Please be thoughtful and provide substantive insight towards the goal of improving this course for future semesters.