Regression analysis. Regression analysis

Regression analysis examines the dependence of a certain quantity on another quantity or several other quantities. Regression analysis is mainly used in medium-term forecasting, as well as in long-term forecasting. Medium- and long-term periods make it possible to establish changes in the business environment and take into account the impact of these changes on the indicator under study.

To carry out regression analysis, it is necessary:

    availability of annual data on the studied indicators,

    availability of one-time forecasts, i.e. forecasts that do not improve with new data.

Regression analysis is usually carried out for objects that have a complex, multifactorial nature, such as the volume of investments, profits, sales volumes, etc.

At normative forecasting method the ways and terms of achieving the possible states of the phenomenon, taken as the goal, are determined. We are talking about predicting the achievement of desired states of the phenomenon on the basis of predetermined norms, ideals, incentives and goals. Such a forecast answers the question: in what ways can the desired be achieved? The normative method is more often used for programmatic or targeted forecasts. Both a quantitative expression of the standard and a certain scale of the possibilities of the evaluation function are used.

In the case of using a quantitative expression, for example, physiological and rational norms for the consumption of certain food and non-food products developed by specialists for various groups of the population, it is possible to determine the level of consumption of these goods for the years preceding the achievement of the specified norm. Such calculations are called interpolation. Interpolation is a way of calculating indicators that are missing in the time series of a phenomenon, based on an established relationship. Taking the actual value of the indicator and the value of its standards as the extreme members of the dynamic series, it is possible to determine the magnitude of the values ​​within this series. Therefore, interpolation is considered a normative method. The previously given formula (4), used in extrapolation, can be used in interpolation, where y n will no longer characterize the actual data, but the standard of the indicator.

In the case of using a scale (field, spectrum) of the possibilities of the evaluation function, i.e., the preference distribution function, in the normative method, approximately the following gradation is indicated: undesirable - less desirable - more desirable - most desirable - optimal (standard).

The normative forecasting method helps to develop recommendations for increasing the level of objectivity, and hence the effectiveness of decisions.

Modeling, perhaps the most difficult forecasting method. Mathematical modeling means the description of an economic phenomenon through mathematical formulas, equations and inequalities. The mathematical apparatus should accurately reflect the forecast background, although it is quite difficult to fully reflect the entire depth and complexity of the predicted object. The term "model" is derived from the Latin word modelus, which means "measure". Therefore, it would be more correct to consider modeling not as a forecasting method, but as a method for studying a similar phenomenon on a model.

In a broad sense, models are called substitutes for the object of study, which are in such a similarity with it that allows you to get new knowledge about the object. The model should be considered as a mathematical description of the object. In this case, the model is defined as a phenomenon (subject, installation) that is in some correspondence with the object under study and can replace it in the research process, presenting information about the object.

With a narrower understanding of the model, it is considered as an object of forecasting, its study allows obtaining information about the possible states of the object in the future and ways to achieve these states. In this case, the purpose of the predictive model is to obtain information not about the object in general, but only about its future states. Then, when building a model, it may be impossible to directly check its correspondence to the object, since the model represents only its future state, and the object itself may currently be absent or have a different existence.

Models can be material and ideal.

Ideal models are used in economics. The most perfect ideal model for a quantitative description of a socio-economic (economic) phenomenon is a mathematical model that uses numbers, formulas, equations, algorithms or a graphical representation. With the help of economic models determine:

    the relationship between various economic indicators;

    various kinds of restrictions imposed on indicators;

    criteria to optimize the process.

A meaningful description of an object can be represented in the form of its formalized scheme, which indicates which parameters and initial information must be collected in order to calculate the desired values. A mathematical model, unlike a formalized scheme, contains specific numerical data characterizing an object. The development of a mathematical model largely depends on the forecaster's idea of ​​the essence of the process being modeled. Based on his ideas, he puts forward a working hypothesis, with the help of which an analytical record of the model is created in the form of formulas, equations and inequalities. As a result of solving the system of equations, specific parameters of the function are obtained, which describe the change in the desired variables over time.

The order and sequence of work as an element of the organization of forecasting is determined depending on the forecasting method used. Usually this work is carried out in several stages.

Stage 1 - predictive retrospection, i.e., the establishment of the object of forecasting and the forecast background. The work at the first stage is performed in the following sequence:

    formation of a description of an object in the past, which includes a pre-forecast analysis of the object, an assessment of its parameters, their significance and mutual relationships,

    identification and evaluation of sources of information, the procedure and organization of work with them, the collection and placement of retrospective information;

    setting research objectives.

Performing the tasks of predictive retrospection, forecasters study the history of the development of the object and the forecast background in order to obtain their systematic description.

Stage 2 - predictive diagnosis, during which a systematic description of the object of forecasting and the forecast background is studied in order to identify trends in their development and select models and methods of forecasting. The work is performed in the following sequence:

    development of a forecast object model, including a formalized description of the object, checking the degree of adequacy of the model to the object;

    selection of forecasting methods (main and auxiliary), development of an algorithm and work programs.

3rd stage - patronage, i.e. the process of extensive development of the forecast, including: 1) calculation of predicted parameters for a given lead period; 2) synthesis of individual components of the forecast.

4th stage - assessment of the forecast, including its verification, i.e., determining the degree of reliability, accuracy and validity.

In the course of prospecting and evaluation, forecasting tasks and its evaluation are solved on the basis of the previous stages.

The indicated phasing is approximate and depends on the main forecasting method.

The results of the forecast are drawn up in the form of a certificate, report or other material and are presented to the customer.

In forecasting, the deviation of the forecast from the actual state of the object can be indicated, which is called the forecast error, which is calculated by the formula:

;
;
. (9.3)

Sources of errors in forecasting

The main sources can be:

1. Simple transfer (extrapolation) of data from the past to the future (for example, the company does not have other forecast options, except for a 10% increase in sales).

2. The inability to accurately determine the probability of an event and its impact on the object under study.

3. Unforeseen difficulties (disruptive events) affecting the implementation of the plan, for example, the sudden dismissal of the head of the sales department.

In general, the accuracy of forecasting increases with the accumulation of experience in forecasting and the development of its methods.

Regression analysis

regression (linear) analysis- a statistical method for studying the influence of one or more independent variables on a dependent variable. Independent variables are otherwise called regressors or predictors, and dependent variables are called criteria. Terminology dependent And independent variables reflects only the mathematical dependence of the variables ( see Spurious correlation), rather than a causal relationship.

Goals of regression analysis

  1. Determination of the degree of determinism of the variation of the criterion (dependent) variable by predictors (independent variables)
  2. Predicting the value of the dependent variable using the independent variable(s)
  3. Determination of the contribution of individual independent variables to the variation of the dependent

Regression analysis cannot be used to determine whether there is a relationship between variables, since the existence of such a relationship is a prerequisite for applying the analysis.

Mathematical definition of regression

Strictly regressive dependence can be defined as follows. Let , be random variables with a given joint probability distribution. If for each set of values ​​a conditional expectation is defined

(general regression equation),

then the function is called regression Y values ​​by values ​​, and its graph - regression line by , or regression equation.

Dependence on is manifested in the change in the average values ​​of Y when changing . Although for each fixed set of values, the quantity remains a random variable with a certain dispersion.

To clarify the question of how accurately the regression analysis estimates the change in Y with a change, the average value of the variance of Y is used for different sets of values ​​(in fact, we are talking about the measure of dispersion of the dependent variable around the regression line).

Least squares method (calculation of coefficients)

In practice, the regression line is most often sought in the form linear function(linear regression) that best approximates the desired curve. This is done using the least squares method, when the sum of the squared deviations of the actually observed from their estimates is minimized (meaning estimates using a straight line that claims to represent the desired regression dependence):

(M - sample size). This approach is based on known fact that the sum appearing in the above expression takes the minimum value precisely for the case when .

To solve the problem of regression analysis by the least squares method, the concept is introduced residual functions:

The condition for the minimum of the residual function:

The resulting system is the system linear equations with unknown

If we represent the free terms of the left side of the equations by the matrix

and the coefficients of the unknowns on the right side of the matrix

then we get the matrix equation: , which is easily solved by the Gauss method. The resulting matrix will be a matrix containing the coefficients of the regression line equation:

To obtain the best estimates, it is necessary to fulfill the LSM prerequisites (Gauss–Markov conditions). In the English literature, such estimates are called BLUE (Best Linear Unbiased Estimators) - the best linear unbiased estimates.

Interpreting Regression Parameters

The parameters are partial correlation coefficients; is interpreted as the proportion of the variance of Y explained by fixing the influence of the remaining predictors, that is, it measures the individual contribution to the explanation of Y. In the case of correlated predictors, there is a problem of uncertainty in the estimates, which become dependent on the order in which the predictors are included in the model. In such cases, it is necessary to apply the analysis methods of correlation and stepwise regression analysis.

Speaking about non-linear models of regression analysis, it is important to pay attention to whether we are talking about non-linearity in independent variables (from a formal point of view, easily reduced to linear regression), or non-linearity in estimated parameters (causing serious computational difficulties). With the first type of nonlinearity, from a meaningful point of view, it is important to single out the appearance in the model of members of the form , , indicating the presence of interactions between features , etc. (see Multicollinearity).

see also

Links

  • www.kgafk.ru - Lecture on "Regression Analysis"
  • www.basegroup.ru - methods for selecting variables in regression models

Literature

  • Norman Draper, Harry Smith Applied regression analysis. Multiple regression= Applied Regression Analysis. - 3rd ed. - M .: "Dialectics", 2007. - S. 912. - ISBN 0-471-17082-8
  • Sustainable Methods for Estimating Statistical Models: Monograph. - K. : PP "Sansparelle", 2005. - S. 504. - ISBN 966-96574-0-7, UDC: 519.237.5:515.126.2, LBC 22.172 + 22.152
  • Radchenko Stanislav Grigorievich, Regression Analysis Methodology: Monograph. - K. : "Korniychuk", 2011. - S. 376. - ISBN 978-966-7599-72-0

Wikimedia Foundation. 2010 .

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a 2D scatter plot and say we have linear relationship if the data is approximated by a straight line.

If we assume that y depends on x, and the changes in y caused by changes in x, we can define a regression line (regression y on x), which best describes the straight-line relationship between these two variables.

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that while tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" and "moved back" to the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still tall) sons, and short fathers have taller (but still rather short) sons.

regression line

Mathematical equation that evaluates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y is the dependent or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the predicted value y»

  • a- free member (crossing) of the evaluation line; this value Y, When x=0(Fig.1).
  • b- slope or gradient of the estimated line; it is the amount by which Y increases on average if we increase x for one unit.
  • a And b are called the regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intersection of a and the slope b (the amount of increase in Y when x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a And b- sample estimates of the true (general) parameters, α and β , which determine the line of linear regression in the population (general population).

The simplest method for determining the coefficients a And b is least square method(MNK).

The fit is evaluated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observable y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with depicted residuals (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the residual is equal to the difference and the corresponding predicted one. Each residual can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

  • The residuals are normally distributed with zero mean;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (eg, use a logarithmic transformation, etc.).

Abnormal values ​​(outliers) and points of influence

An "influential" observation, if omitted, changes one or more model parameter estimates (ie slope or intercept).

An outlier (an observation that contradicts most of the values ​​in the dataset) can be an "influential" observation and can be well detected visually when looking at a 2D scatterplot or a plot of residuals.

Both for outliers and for "influential" observations (points), models are used, both with their inclusion and without them, pay attention to the change in the estimate (regression coefficients).

When doing an analysis, do not automatically discard outliers or influence points, as simply ignoring them can affect the results. Always study the causes of these outliers and analyze them.

Linear regression hypothesis

When constructing a linear regression, the null hypothesis is checked that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which obeys a distribution with degrees of freedom, where the standard error of the coefficient


,

- estimation of the variance of the residuals.

Usually, if the significance level reached is the null hypothesis is rejected.


where is the percentage point of the distribution with degrees of freedom which gives the probability of a two-tailed test

This is the interval that contains the general slope with a probability of 95%.

For large samples, let's say we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Evaluation of the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as changes , and we call this the variation that is due to or explained by the regression. The residual variation should be as small as possible.

If so, then most of the variation will be explained by the regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of the total variance that is explained by the regression is called determination coefficient, usually expressed as a percentage and denoted R2(in paired linear regression, this is the value r2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference is the percentage of variance that cannot be explained by regression.

With no formal test to evaluate, we are forced to rely on subjective judgment to determine the quality of the fit of the regression line.

Applying a Regression Line to a Forecast

You can use a regression line to predict a value from a value within the observed range (never extrapolate beyond these limits).

We predict the mean for observables that have a certain value by substituting that value into the regression line equation.

So, if predicting as We use this predicted value and its standard error to estimate the confidence interval for the true population mean.

Repeating this procedure for different values ​​allows you to build confidence limits for this line. This is a band or area that contains a true line, for example, with a 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values ​​P , such as 7, 4 and 9, and the design includes a first order effect P , then the design matrix X will be

and the regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P , such as a quadratic effect, then the values ​​in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the encoding method chosen, the values ​​of the continuous variables are incremented by the appropriate power and used as the values ​​for the X variables. In this case, no conversion is performed. In addition, when describing regression plans, you can omit consideration of the plan matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data provided in the table:

Rice. 3. Table of initial data.

The data is based on a comparison of the 1960 and 1970 censuses in 30 randomly selected counties. County names are represented as observation names. Information regarding each variable is presented below:

Rice. 4. Variable specification table.

Research objective

For this example, the correlation between the poverty rate and the power that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor ) as a dependent variable.

One can put forward a hypothesis: the change in the population and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to an outflow of population, hence there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng ) as a predictor variable.

View Results

Regression coefficients

Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and Param. the non-standardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374 . This means that for every unit decrease in population, there is an increase in the poverty rate of .40374. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Distribution of variables

Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by county. To do this, we will build a histogram of the Pt_Poor variable.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the right two columns) have a higher percentage of families that are below the poverty line than expected in a normal distribution, they appear to be "inside the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be taken into account if an observation (or observations) does not fall within the interval (mean ± 3 times standard deviation). In this case, it is worth repeating the analysis with and without outliers to make sure that they do not have a serious effect on the correlation between members of the population.

Scatterplot

If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the plot of the corresponding scatterplot.

Rice. 8. Scatterplot.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., with 95% probability the regression line passes between the two dashed curves.

Significance criteria

Rice. 9. Table containing the significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Outcome

This example showed how to analyze a simple regression plan. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the response distribution of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.

Regression and correlation analysis - statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using concrete practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when they are combined.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how the number of economically active population depends on the number of enterprises, wages, and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to prioritize. And based on the main factors, to predict, plan the development of priority areas, make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Consider the example of building a regression model in Excel and interpreting the results. Let's take a linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of employees who left were analyzed. It is necessary to determine the dependence of the number of retired employees on the average salary.

The linear regression model has the following form:

Y \u003d a 0 + a 1 x 1 + ... + a k x k.

Where a are the regression coefficients, x are the influencing variables, and k is the number of factors.

In our example, Y is the indicator of quit workers. The influencing factor is wages (x).

Excel has built-in functions that can be used to calculate the parameters of a linear regression model. But the Analysis ToolPak add-in will do it faster.

Activate a powerful analytical tool:

Once activated, the add-on will be available under the Data tab.

Now we will deal directly with the regression analysis.



First of all, we pay attention to the R-square and coefficients.

R-square is the coefficient of determination. In our example, it is 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model. Good - above 0.8. Poor - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the less quit. Which is fair.



Correlation analysis in Excel

Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict the possible value of another.

The correlation coefficient is denoted r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient value is 0, there is no linear relationship between the samples.

Consider how to use Excel to find the correlation coefficient.

The CORREL function is used to find the paired coefficients.

Task: Determine if there is a relationship between the operating time of a lathe and the cost of its maintenance.

Put the cursor in any cell and press the fx button.

  1. In the "Statistical" category, select the CORREL function.
  2. Argument "Array 1" - the first range of values ​​- the time of the machine: A2: A14.
  3. Argument "Array 2" - the second range of values ​​- the cost of repairs: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use "Data Analysis" ("Analysis Package" add-on). In the list, you need to select a correlation and designate an array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this one:

Correlation-regression analysis

In practice, these two techniques are often used together.

Example:


Now the regression analysis data is visible.

1. For the first time the term "regression" was introduced by the founder of biometrics F. Galton (XIX century), whose ideas were developed by his follower K. Pearson.

Regression analysis- a method of statistical data processing that allows you to measure the relationship between one or more causes (factorial signs) and a consequence (effective sign).

sign- this is the main distinguishing feature, feature of the phenomenon or process being studied.

Effective sign - investigated indicator.

Factor sign- an indicator that affects the value of the effective feature.

The purpose of the regression analysis is to evaluate the functional dependence of the average value of the effective feature ( at) from factorial ( x 1, x 2, ..., x n), expressed as regression equations

at= f(x 1, x 2, ..., x n). (6.1)

There are two types of regression: paired and multiple.

Paired (simple) regression- equation of the form:

at= f(x). (6.2)

The resultant feature in pairwise regression is considered as a function of one argument, i.e. one factor.

Regression analysis includes the following steps:

definition of the function type;

determination of regression coefficients;

Calculation of theoretical values ​​of the effective feature;

Checking the statistical significance of the regression coefficients;

Checking the statistical significance of the regression equation.

Multiple regression- equation of the form:

at= f(x 1, x 2, ..., x n). (6.3)

The resultant feature is considered as a function of several arguments, i.e. many factors.

2. In order to correctly determine the type of function, it is necessary to find the direction of the connection based on theoretical data.

According to the direction of the connection, the regression is divided into:

· direct regression, arising under the condition that with an increase or decrease in the independent value " X" values ​​of the dependent quantity " at" also increase or decrease accordingly;

· reverse regression, arising under the condition that with an increase or decrease in the independent value "X" dependent value " at" decreases or increases accordingly.

To characterize the relationships, the following types of paired regression equations are used:

· y=a+bxlinear;

· y=e ax + b – exponential;

· y=a+b/x – hyperbolic;

· y=a+b 1 x+b 2 x 2 – parabolic;

· y=ab x – exponential and etc.

Where a, b1, b2- coefficients (parameters) of the equation; at- effective sign; X- factor sign.

3. The construction of the regression equation is reduced to estimating its coefficients (parameters), for this they use least square method(MNK).

The least squares method allows you to obtain such estimates of the parameters, in which the sum of the squared deviations of the actual values ​​of the effective feature " at»from theoretical « y x» is minimal, that is

Regression Equation Options y=a+bx by the least squares method are estimated using the formulas:

Where A - free coefficient, b- regression coefficient, shows how much the resultant sign will change y» when changing the factor attribute « x» per unit of measure.

4. To assess the statistical significance of the regression coefficients, Student's t-test is used.

Scheme for checking the significance of regression coefficients:

1) H 0: a=0, b=0 - regression coefficients are insignificantly different from zero.

H 1: a≠ 0, b≠ 0 - regression coefficients are significantly different from zero.

2) R=0.05 – significance level.

Where m b,m a- random errors:

; . (6.7)

4) t table(R; f),

Where f=n-k- 1 - number of degrees of freedom (table value), n- number of observations, k X".

5) If , then deviates, i.e. significant coefficient.

If , then is accepted, i.e. coefficient is insignificant.

5. To check the correctness of the constructed regression equation, the Fisher criterion is used.

Scheme for checking the significance of the regression equation:

1) H 0: the regression equation is not significant.

H 1: the regression equation is significant.

2) R=0.05 – significance level.

3) , (6.8)

where is the number of observations; k- the number of parameters in the equation with variables " X"; at- the actual value of the effective feature; y x- the theoretical value of the effective feature; - coefficient of pair correlation.

4) F table(R; f 1 ; f2),

Where f 1 \u003d k, f 2 \u003d n-k-1- number of degrees of freedom (table values).

5) If F calc >F table, then the regression equation is chosen correctly and can be applied in practice.

If F calc , then the regression equation is chosen incorrectly.

6. The main indicator reflecting the measure of the quality of regression analysis is coefficient of determination (R 2).

Determination coefficient shows what proportion of the dependent variable " at» is taken into account in the analysis and is caused by the influence of the factors included in the analysis.

Determination coefficient (R2) takes values ​​in the range . The regression equation is qualitative if R2 ≥0,8.

The determination coefficient is equal to the square of the correlation coefficient, i.e.

Example 6.1. Based on the following data, construct and analyze the regression equation:

Solution.

1) Calculate the correlation coefficient: . The relationship between the signs is direct and moderate.

2) Build a paired linear regression equation.

2.1) Make a calculation table.

X at Hu x 2 y x (y-y x) 2
55,89 47,54 65,70
45,07 15,42 222,83
54,85 34,19 8,11
51,36 5,55 11,27
42,28 45,16 13,84
47,69 1,71 44,77
45,86 9,87 192,05
Sum 159,45 558,55
Average 77519,6 22,78 79,79 2990,6

,

Paired linear regression equation: y x \u003d 25.17 + 0.087x.

3) Find theoretical values ​​" y x» by substituting actual values ​​into the regression equation « X».

4) Plot graphs of actual " at" and theoretical values ​​" y x» effective feature (Figure 6.1): r xy =0.47) and a small number of observations.

7) Calculate the coefficient of determination: R2=(0.47) 2 =0.22. The constructed equation is of poor quality.

Because calculations during regression analysis are quite voluminous, it is recommended to use special programs ("Statistica 10", SPSS, etc.).

Figure 6.2 shows a table with the results of the regression analysis carried out using the program "Statistica 10".

Figure 6.2. The results of the regression analysis carried out using the program "Statistica 10"

5. Literature:

1. Gmurman V.E. Probability Theory and Mathematical Statistics: Proc. manual for universities / V.E. Gmurman. - M.: Higher school, 2003. - 479 p.

2. Koichubekov B.K. Biostatistics: Textbook. - Almaty: Evero, 2014. - 154 p.

3. Lobotskaya N.L. Higher Mathematics. / N.L. Lobotskaya, Yu.V. Morozov, A.A. Dunaev. - Minsk: Higher School, 1987. - 319 p.

4. Medic V.A., Tokmachev M.S., Fishman B.B. Statistics in Medicine and Biology: A Guide. In 2 volumes / Ed. Yu.M. Komarov. T. 1. Theoretical statistics. - M.: Medicine, 2000. - 412 p.

5. Application of statistical analysis methods for the study of public health and health care: textbook / ed. Kucherenko V.Z. - 4th ed., revised. and additional - M.: GEOTAR - Media, 2011. - 256 p.