Data for regression. Fundamentals of data analysis

As a result of studying the material of chapter 4, the student should:

know

  • basic concepts of regression analysis;
  • methods of estimation and properties of estimates of the method of least squares;
  • basic rules for significance testing and interval estimation of the equation and regression coefficients;

be able to

  • find estimates of the parameters of two-dimensional and multiple models of regression equations from sample data, analyze their properties;
  • check the significance of the equation and regression coefficients;
  • find interval estimates of significant parameters;

own

  • the skills of statistical estimation of the parameters of the two-dimensional and multiple regression equations; skills to check the adequacy of regression models;
  • skills in obtaining a regression equation with all significant coefficients using analytical software.

Basic concepts

After conducting a correlation analysis, when the presence of statistically significant relationships between variables has been identified and the degree of their tightness has been assessed, they usually proceed to a mathematical description of the type of dependencies using regression analysis methods. For this purpose, a class of functions is selected that links the effective indicator at and arguments„ calculate estimates of the parameters of the constraint equation and analyze the accuracy of the resulting equation .

Function| describing the dependence of the conditional average value of the effective feature at from the given values ​​of the arguments, is called regression equation.

The term "regression" (from lat. regression- retreat, return to something) was introduced by the English psychologist and anthropologist F. Galton and is associated with one of his first examples, in which Galton, processing statistical data related to the question of the heredity of growth, found that if the height of fathers deviates from the average height of all fathers by X inches, then the height of their sons deviates from the average height of all sons by less than x inches The identified trend was named regression to the mean.

The term "regression" is widely used in the statistical literature, although in many cases it does not accurately characterize the statistical dependence.

For an accurate description of the regression equation, it is necessary to know the conditional law of distribution of the effective indicator y. In statistical practice, it is usually impossible to obtain such information, therefore, they are limited to finding suitable approximations for the function f(x u X 2, .... l *), based on a preliminary meaningful analysis of the phenomenon or on the original statistical data.

Within the framework of individual model assumptions about the type of distribution of the vector of indicators<) может быть получен общий вид regression equations, Where. For example, under the assumption that the studied set of indicators obeys the ()-dimensional normal distribution law with the vector of mathematical expectations

Where, and by the covariance matrix,

where is the variance y,

The regression equation (conditional expectation) has the form

Thus, if a multivariate random variable ()

obeys the ()-dimensional normal distribution law, then the regression equation of the effective indicator at in explanatory variables has linear in X view.

However, in statistical practice, one usually has to limit oneself to finding suitable approximations for the unknown true regression function f(x), since the researcher does not have exact knowledge of the conditional law of the probability distribution of the analyzed performance indicator at for the given values ​​of the arguments X.

Consider the relationship between true, model, and regression estimates. Let the performance indicator at associated with the argument X ratio

where is a random variable with a normal distribution law, moreover. The true regression function in this case is

Suppose that we do not know the exact form of the true regression equation, but we have nine observations on a two-dimensional random variable related by the relations shown in Fig. 4.1.

Rice. 4.1. The relative position of the truef(x) and theoreticalwowregression models

Location of points in fig. 4.1 allows us to confine ourselves to the class of linear dependencies of the form

Using the least squares method, we find an estimate for the regression equation.

For comparison, in Fig. 4.1 shows graphs of the true regression function and the theoretical approximating regression function. The estimate of the regression equation converges in probability to the latter wow with an unlimited increase in the sample size ().

Since we mistakenly chose a linear regression function instead of a true regression function, which, unfortunately, is quite common in the practice of statistical research, our statistical conclusions and estimates will not have the consistency property, i.e. no matter how much we increase the volume of observations, our sample estimate will not converge to the true regression function

If we had chosen the class of regression functions correctly, then the inaccuracy in the description using wow would be explained only by the limitedness of the sample and, therefore, it could be made arbitrarily small with

In order to best restore the conditional value of the effective indicator and the unknown regression function from the initial statistical data, the following are most often used: adequacy criteria loss functions.

1. Least square method, according to which the squared deviation of the observed values ​​of the effective indicator, , from the model values ​​is minimized, where the coefficients of the regression equation; are the values ​​of the vector of arguments in "-M observation:

The problem of finding an estimate of the vector is being solved. The resulting regression is called mean square.

2. Method of least modules, according to which the sum of absolute deviations of the observed values ​​of the effective indicator from the modular values ​​is minimized, i.e.

The resulting regression is called mean absolute(median).

3. minimax method is reduced to minimizing the maximum deviation module of the observed value of the effective indicator y, from the model value, i.e.

The resulting regression is called minimax.

In practical applications, there are often problems in which the random variable is studied y, depending on some set of variables and unknown parameters. We will consider () as (k + 1)-dimensional general population, from which a random sample of volume P, where () is the result of the /-th observation,. It is required to estimate unknown parameters based on the results of observations. The task described above refers to the tasks of regression analysis.

regression analysis call the method of statistical analysis of the dependence of a random variable at from variables considered in regression analysis as non-random variables, regardless of the true distribution law

RESULTS

Table 8.3a. Regression statistics
Regression statistics
Multiple R 0,998364
R-square 0,99673
Normalized R-square 0,996321
standard error 0,42405
Observations 10

Let's first look at the upper part of the calculations presented in Table 8.3a, the regression statistics.

The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .

In most cases, the R-squared value is between these values, called extremes, i.e. between zero and one.

If the value of the R-square is close to one, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, an R-squared value close to zero means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

Multiple R- coefficient of multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).

Multiple R equals square root from the coefficient of determination, this value takes values ​​in the range from zero to one.

In a simple linear regression analysis, the multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients
Odds standard error t-statistic
Y-intersection 2,694545455 0,33176878 8,121757129
Variable X 1 2,305454545 0,04668634 49,38177965
* A truncated version of the calculations is given

Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

If the sign at regression coefficient- positive, the relationship of the dependent variable with the independent will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

In table 8.3c. the results of the output of the residuals are presented. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.

REMAINING WITHDRAWAL

Table 8.3c. Remains
Observation Predicted Y Remains Standard balances
1 9,610909091 -0,610909091 -1,528044662
2 7,305454545 -0,305454545 -0,764022331
3 11,91636364 0,083636364 0,209196591
4 14,22181818 0,778181818 1,946437843
5 16,52727273 0,472727273 1,182415512
6 18,83272727 0,167272727 0,418393181
7 21,13818182 -0,138181818 -0,34562915
8 23,44363636 -0,043636364 -0,109146047
9 25,74909091 -0,149090909 -0,372915662
10 28,05454545 -0,254545455 -0,636685276

Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value

Modern political science proceeds from the position on the relationship of all phenomena and processes in society. It is impossible to understand events and processes, predict and manage the phenomena of political life without studying the connections and dependencies that exist in the political sphere of society. One of the most common tasks of policy research is to study the relationship between some observable variables. A whole class of statistical methods of analysis, united by the common name "regression analysis" (or, as it is also called, "correlation-regression analysis"), helps to solve this problem. However, if correlation analysis makes it possible to assess the strength of the relationship between two variables, then using regression analysis it is possible to determine the type of this relationship, to predict the dependence of the value of any variable on the value of another variable.

First, let's remember what a correlation is. Correlative called the most important special case of statistical relationship, which consists in the fact that equal values ​​of one variable correspond to different average values another. With a change in the value of the attribute x, the average value of the attribute y naturally changes, while in each individual case the value of the attribute at(with different probabilities) can take on many different values.

The appearance of the term “correlation” in statistics (and political science attracts the achievement of statistics for solving its problems, which, therefore, is a discipline related to political science) is associated with the name of the English biologist and statistician Francis Galton, who proposed in the 19th century. theoretical foundations of correlation-regression analysis. The term "correlation" in science was known before. In particular, in paleontology back in the 18th century. it was applied by the French scientist Georges Cuvier. He introduced the so-called correlation law, with the help of which, according to the remains of animals found during excavations, it was possible to restore their appearance.

There is a well-known story associated with the name of this scientist and his law of correlation. So, on the days of a university holiday, students who decided to play a trick on a famous professor pulled a goat skin with horns and hooves over one student. He climbed into the window of Cuvier's bedroom and shouted: "I'll eat you." The professor woke up, looked at the silhouette and replied: “If you have horns and hooves, then you are a herbivore and cannot eat me. And for ignorance of the law of correlation you will get a deuce. He turned over and fell asleep. A joke is a joke, but in this example we are seeing a special case of using multiple correlation-regression analysis. Here the professor, based on the knowledge of the values ​​of the two observed traits (the presence of horns and hooves), based on the law of correlation, derived the average value of the third trait (the class to which this animal belongs is a herbivore). In this case, we are not talking about the specific value of this variable (i.e., this animal could take on different values ​​on a nominal scale - it could be a goat, a ram, or a bull ...).

Now let's move on to the term "regression". Strictly speaking, it is not connected with the meaning of those statistical problems that are solved with the help of this method. An explanation of the term can only be given on the basis of knowledge of the history of the development of methods for studying the relationships between features. One of the first examples of studies of this kind was the work of statisticians F. Galton and K. Pearson, who tried to find a pattern between the growth of fathers and their children according to two observable signs (where X- father's height and U- children's growth). In their study, they confirmed the initial hypothesis that, on average, tall fathers raise averagely tall children. The same principle applies to low fathers and children. However, if the scientists had stopped there, their works would never have been mentioned in textbooks on statistics. The researchers found another pattern within the already mentioned confirmed hypothesis. They proved that very tall fathers produce children that are tall on average, but not very different in height from children whose fathers, although above average, are not very different from average height. The same is true for fathers with very small stature (deviating from the average of the short group) - their children, on average, did not differ in height from peers whose fathers were simply short. They called the function that describes this regularity regression function. After this study, all equations describing similar functions and constructed in a similar way began to be called regression equations.

Regression analysis- one of the methods of multivariate statistical data analysis, combining a set of statistical techniques designed to study or model relationships between one dependent and several (or one) independent variables. The dependent variable, according to the tradition accepted in statistics, is called the response and is denoted as V The independent variables are called predictors and are denoted as x. During the course of the analysis, some variables will be weakly related to the response and will eventually be excluded from the analysis. The remaining variables associated with the dependent may also be called factors.

Regression analysis makes it possible to predict the values ​​of one or more variables depending on another variable (for example, the propensity for unconventional political behavior depending on the level of education) or several variables. It is calculated on PC. To compile a regression equation that allows you to measure the degree of dependence of the controlled feature on the factor ones, it is necessary to involve professional mathematicians-programmers. Regression analysis can provide an invaluable service in building predictive models for the development of a political situation, assessing the causes of social tension, and in conducting theoretical experiments. Regression analysis is actively used to study the impact on the electoral behavior of citizens of a number of socio-demographic parameters: gender, age, profession, place of residence, nationality, level and nature of income.

In relation to regression analysis, the concepts independent And dependent variables. An independent variable is a variable that explains or causes a change in another variable. A dependent variable is a variable whose value is explained by the influence of the first variable. For example, in the presidential elections in 2004, the determining factors, i.e. independent variables were indicators such as stabilization of the financial situation of the population of the country, the level of popularity of candidates and the factor incumbency. In this case, the percentage of votes cast for candidates can be considered as a dependent variable. Similarly, in the pair of variables “age of the voter” and “level of electoral activity”, the first one is independent, the second one is dependent.

Regression analysis allows you to solve the following problems:

  • 1) establish the very fact of the presence or absence of a statistically significant relationship between Ci x;
  • 2) build the best (in the statistical sense) estimates of the regression function;
  • 3) according to the given values X build a prediction for the unknown At
  • 4) evaluate the specific weight of the influence of each factor X on At and, accordingly, exclude insignificant features from the model;
  • 5) by identifying causal relationships between variables, partially manage the values ​​of P by adjusting the values ​​of explanatory variables x.

Regression analysis is associated with the need to select mutually independent variables that affect the value of the indicator under study, determine the form of the regression equation, and evaluate parameters using statistical methods for processing primary sociological data. This type of analysis is based on the idea of ​​the form, direction and closeness (density) of the relationship. Distinguish steam room And multiple regression depending on the number of studied features. In practice, regression analysis is usually performed in conjunction with correlation analysis. Regression Equation describes a numerical relationship between quantities, expressed as a tendency for one variable to increase or decrease while another increases or decreases. At the same time, razl and h a yut l frost And non-linear regression. When describing political processes, both variants of regression are equally found.

Scatterplot for the distribution of interdependence of interest in political articles ( U) and education of respondents (X) is a linear regression (Fig. 30).

Rice. thirty.

Scatterplot for the distribution of the level of electoral activity ( U) and age of the respondent (A) (conditional example) is a non-linear regression (Fig. 31).


Rice. 31.

To describe the relationship of two features (A "and Y) in a paired regression model, a linear equation is used

where a, is a random value of the error of the equation with variation of features, i.e. deviation of the equation from "linearity".

To evaluate the coefficients A And b use the least squares method, which assumes that the sum of the squared deviations of each point on the scatter plot from the regression line should be minimal. Odds a h b can be calculated using the system of equations:

The method of least squares estimation gives such estimates of the coefficients A And b, for which the line passes through the point with coordinates X And y, those. there is a ratio at = ax + b. The graphical representation of the regression equation is called theoretical regression line. With a linear dependence, the regression coefficient represents on the graph the tangent of the slope of the theoretical regression line to the x-axis. The sign at the coefficient shows the direction of the connection. If it is greater than zero, then the relationship is direct; if it is less, it is inverse.

The following example from the study "Political Petersburg-2006" (Table 56) shows a linear relationship between citizens' perceptions of the degree of satisfaction with their lives in the present and expectations of changes in the quality of life in the future. The connection is direct, linear (the standardized regression coefficient is 0.233, the significance level is 0.000). In this case, the regression coefficient is not high, but it exceeds the lower limit of the statistically significant indicator (the lower limit of the square of the statistically significant indicator of the Pearson coefficient).

Table 56

The impact of the quality of life of citizens in the present on expectations

(St. Petersburg, 2006)

* Dependent variable: "How do you think your life will change in the next 2-3 years?"

In political life, the value of the variable under study most often simultaneously depends on several features. For example, the level and nature of political activity are simultaneously influenced by the political regime of the state, political traditions, the peculiarities of the political behavior of people in a given area and the social microgroup of the respondent, his age, education, income level, political orientation, etc. In this case, you need to use the equation multiple regression, which has the following form:

where coefficient b.- partial regression coefficient. It shows the contribution of each independent variable to determining the values ​​of the independent (outcome) variable. If the partial regression coefficient is close to 0, then we can conclude that there is no direct relationship between the independent and dependent variables.

The calculation of such a model can be performed on a PC using matrix algebra. Multiple regression allows you to reflect the multifactorial nature of social ties and clarify the degree of influence of each factor individually and all together on the resulting trait.

Coefficient denoted b, is called the coefficient of linear regression and shows the strength of the relationship between the variation of the factor attribute X and variation of the effective feature Y This coefficient measures the strength of the relationship in absolute units of measurement of features. However, the closeness of the correlation of features can also be expressed in terms of the standard deviation of the resulting feature (such a coefficient is called the correlation coefficient). Unlike the regression coefficient b the correlation coefficient does not depend on the accepted units of measurement of features, and therefore, it is comparable for any features. Usually, the connection is considered strong if /> 0.7, medium tightness - at 0.5 g 0.5.

As you know, the closest connection is a functional connection, when each individual value Y can be uniquely assigned to the value x. Thus, the closer the correlation coefficient is to 1, the closer the relationship is to a functional one. The significance level for regression analysis should not exceed 0.001.

The correlation coefficient has long been considered as the main indicator of the closeness of the relationship of features. However, later the coefficient of determination became such an indicator. The meaning of this coefficient is as follows - it reflects the share of the total variance of the resulting feature At, explained by the variance of the feature x. It is found by simply squaring the correlation coefficient (changing from 0 to 1) and, in turn, for a linear relationship reflects the share from 0 (0%) to 1 (100%) characteristic values Y, determined by the values ​​of the attribute x. It is recorded as I 2 , and in the resulting tables of regression analysis in the SPSS package - without a square.

Let us denote the main problems of constructing a multiple regression equation.

  • 1. Choice of factors included in the regression equation. At this stage, the researcher first compiles a general list of the main causes that, according to the theory, determine the phenomenon under study. Then he must select the features in the regression equation. The main selection rule is that the factors included in the analysis should correlate as little as possible with each other; only in this case it is possible to attribute a quantitative measure of influence to a certain factor-attribute.
  • 2. Selecting the Form of the Multiple Regression Equation(in practice, linear or linear-logarithmic is more often used). So, to use multiple regression, the researcher must first build a hypothetical model of the influence of several independent variables on the resulting one. For the obtained results to be reliable, it is necessary that the model exactly matches the real process, i.e. the relationship between the variables must be linear, not a single significant independent variable can be ignored, in the same way, not a single variable that is not directly related to the process under study can be included in the analysis. In addition, all measurements of variables must be extremely accurate.

From the above description follows a number of conditions for the application of this method, without which it is impossible to proceed to the procedure of multiple regression analysis (MRA). Only compliance with all of the following points allows you to correctly carry out regression analysis.

In statistical modeling, regression analysis is a study used to evaluate the relationship between variables. This mathematical method includes many other methods for modeling and analyzing multiple variables when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps you understand how the typical value of the dependent variable changes if one of the independent variables changes while the other independent variables remain fixed.

In all cases, the target score is a function of the independent variables and is called the regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a function of regression, which can be described using a probability distribution.

Tasks of regression analysis

This statistical research method is widely used for forecasting, where its use has a significant advantage, but sometimes it can lead to illusion or false relationships, so it is recommended to use it carefully in this question, since, for example, correlation does not mean causation.

Developed big number methods for performing regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its function to lie in a certain set of functions, which can be infinite-dimensional.

As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of the data process generating is typically an unknown number, data regression analysis often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is enough data available. Regression models are often useful even when assumptions are moderately violated, although they may not perform at their best.

In a narrower sense, regression can refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The case of a continuous output variable is also called metric regression to distinguish it from related problems.

Story

The most early form regression is the well-known method of least squares. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mainly comets, but later also newly discovered minor planets). Gauss published a further development of the theory of least squares in 1821, including a variant of the Gauss-Markov theorem.

The term "regression" was coined by Francis Galton in the 19th century to describe a biological phenomenon. The bottom line was that the growth of descendants from the growth of ancestors, as a rule, regresses down to the normal average. For Galton, regression had only this biological meaning, but later his work was taken up by Udni Yoley and Karl Pearson and taken to a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is considered to be Gaussian. This assumption was rejected by Fischer in the papers of 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this regard, Fisher's suggestion is closer to Gauss's 1821 formulation. Prior to 1970, it sometimes took up to 24 hours to get the result of a regression analysis.

Regression analysis methods continue to be an area of ​​active research. In recent decades, new methods have been developed for robust regression; regressions involving correlated responses; regression methods that accommodate various types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured with error; regressions with more predictors than observations; and causal inferences with regression.

Regression Models

Regression analysis models include the following variables:

  • Unknown parameters, denoted as beta, which can be a scalar or a vector.
  • Independent variables, X.
  • Dependent variables, Y.

IN various areas sciences where regression analysis is applied use different terms instead of dependent and independent variables, but in all cases the regression model relates Y to a function of X and β.

The approximation is usually in the form E (Y | X) = F (X, β). To perform regression analysis, the form of the function f must be determined. More rarely, it is based on knowledge about the relationship between Y and X that does not rely on data. If such knowledge is not available, then a flexible or convenient form F is chosen.

Dependent variable Y

Let us now assume that the vector of unknown parameters β has length k. To perform a regression analysis, the user must provide information about the dependent variable Y:

  • If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.
  • If exactly N = K are observed, and the function F is linear, then the equation Y = F(X, β) can be solved exactly, not approximately. This boils down to solving a set of N-equations with N-unknowns (the elements of β) that has a unique solution as long as X is linearly independent. If F is non-linear, a solution may not exist, or there may be many solutions.
  • The most common situation is where there are N > points to the data. In this case, there is enough information in the data to estimate the unique value for β that best fits the data, and the regression model when applied to the data can be seen as an overridden system in β.

In the latter case, regression analysis provides tools for:

  • Finding a solution for unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
  • Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about the unknown parameters β and the predicted values ​​of the dependent variable Y.

Required number of independent measurements

Consider a regression model that has three unknown parameters: β 0 , β 1 and β 2 . Let's assume that the experimenter makes 10 measurements in the same value of the independent variable of the vector X. In this case, the regression analysis does not give a unique set of values. The best that can be done is to estimate the mean and standard deviation of the dependent variable Y. Similarly, measuring the two different values X, you can get enough data for a regression with two unknowns, but not for three or more unknowns.

If the experimenter's measurements were taken at three different values ​​of the independent vector variable X, then the regression analysis would provide a unique set of estimates for the three unknown parameters in β.

In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

Statistical Assumptions

When the number of measurements N is greater than the number of unknown parameters k and the measurement errors ε i , then, as a rule, then the excess information contained in the measurements is distributed and used for statistical predictions regarding unknown parameters. This excess of information is called the degree of freedom of the regression.

Underlying Assumptions

Classic assumptions for regression analysis include:

  • Sampling is representative of inference prediction.
  • The error is a random variable with a mean value of zero, which is conditional on the explanatory variables.
  • The independent variables are measured without errors.
  • As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
  • The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each non-zero element is the variance of the error.
  • The error variance is constant across observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

These sufficient conditions for the least squares estimate have the required properties, in particular these assumptions mean that the parameter estimates will be objective, consistent and efficient, especially when taken into account in the class of linear estimates. It is important to note that the actual data rarely satisfies the conditions. That is, the method is used even if the assumptions are not correct. Variation from assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Statistical analysis reports typically include analysis of tests against sample data and methodology for the usefulness of the model.

In addition, variables in some cases refer to values ​​measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic weighted regression is the only method that deals with such data.

In linear regression, the feature is that the dependent variable, which is Y i , is a linear combination of parameters. For example, in simple linear regression, n-point modeling uses one independent variable, x i , and two parameters, β 0 and β 1 .

In multiple linear regression, there are several independent variables or their functions.

When randomly sampled from a population, its parameters make it possible to obtain a sample of a linear regression model.

In this aspect, the least squares method is the most popular. It provides parameter estimates that minimize the sum of squares of the residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set linear equations with parameters that are solved to obtain parameter estimates.

Assuming further that population error generally propagates, the researcher can use these estimates of standard errors to create confidence intervals and perform hypotheses testing about its parameters.

Nonlinear Regression Analysis

An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized with an iterative procedure. This introduces many complications that define the differences between linear and non-linear least squares methods. Consequently, the results of regression analysis when using a non-linear method are sometimes unpredictable.

Calculation of power and sample size

Here, as a rule, there are no consistent methods regarding the number of observations compared to the number of independent variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t^n, where N is the sample size, n is the number of explanatory variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one explanatory variable. For example, a researcher builds a linear regression model using a dataset that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately determine the line (m), then the maximum number of explanatory variables that the model can support is 4.

Other Methods

Although the parameters of a regression model are usually estimated using the least squares method, there are other methods that are used much less often. For example, these are the following methods:

  • Bayesian methods (for example, the Bayesian method of linear regression).
  • A percentage regression used for situations where reducing percentage errors is considered more appropriate.
  • The smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
  • Nonparametric regression requiring a large number of observations and calculations.
  • The distance of the learning metric that is learned in search of a meaningful distance metric in the given input space.

Software

All major statistical software packages are performed using least squares regression analysis. Simple linear regression and multiple regression analysis can be used in some spreadsheet applications as well as some calculators. While many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software has been developed for use in areas such as survey analysis and neuroimaging.

The main feature of regression analysis is that it can be used to obtain specific information about the form and nature of the relationship between the variables under study.

The sequence of stages of regression analysis

Let us briefly consider the stages of regression analysis.

    Task formulation. At this stage, preliminary hypotheses about the dependence of the studied phenomena are formed.

    Definition of dependent and independent (explanatory) variables.

    Collection of statistical data. Data must be collected for each of the variables included in the regression model.

    Formulation of a hypothesis about the form of connection (simple or multiple, linear or non-linear).

    Definition regression functions (consists in the calculation of the numerical values ​​of the parameters of the regression equation)

    Evaluation of the accuracy of regression analysis.

    Interpretation of the obtained results. The results of the regression analysis are compared with preliminary hypotheses. The correctness and plausibility of the obtained results are evaluated.

    Prediction of unknown values ​​of the dependent variable.

With the help of regression analysis, it is possible to solve the problem of forecasting and classification. Predictive values ​​are calculated by substituting the values ​​of the explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and the part of the set where the value of the function is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

Tasks of regression analysis

Consider the main tasks of regression analysis: establishing the form of dependence, determining regression functions, an estimate of the unknown values ​​of the dependent variable.

Establishing the form of dependence.

The nature and form of the relationship between variables can form the following types of regression:

    positive linear regression (expressed as a uniform growth of the function);

    positive uniformly accelerating regression;

    positive uniformly increasing regression;

    negative linear regression (expressed as a uniform drop in function);

    negative uniformly accelerated decreasing regression;

    negative uniformly decreasing regression.

However, the varieties described are usually not found in pure form, but in combination with each other. In this case, one speaks of combined forms of regression.

Definition of the regression function.

The second task is to find out the effect on the dependent variable of the main factors or causes, all other things being equal, and subject to the exclusion of the impact on the dependent variable of random elements. regression function defined as a mathematical equation of one type or another.

Estimation of unknown values ​​of the dependent variable.

The solution of this problem is reduced to solving a problem of one of the following types:

    Estimation of the values ​​of the dependent variable within the considered interval of the initial data, i.e. missing values; this solves the problem of interpolation.

    Estimating the future values ​​of the dependent variable, i.e. finding values ​​outside the given interval of the initial data; this solves the problem of extrapolation.

Both problems are solved by substituting the found estimates of the parameters of the values ​​of the independent variables into the regression equation. The result of solving the equation is an estimate of the value of the target (dependent) variable.

Let's look at some of the assumptions that regression analysis relies on.

Linearity assumption, i.e. it is assumed that the relationship between the variables under consideration is linear. So, in this example, we built a scatterplot and were able to see a clear linear relationship. If, on the scatterplot of variables, we see a clear absence of a linear relationship, i.e. there is a non-linear relationship, non-linear methods of analysis should be used.

Normality Assumption leftovers. It assumes that the distribution of the difference between predicted and observed values ​​is normal. To visually determine the nature of the distribution, you can use histograms leftovers.

When using regression analysis, one should take into account its main limitation. It consists in the fact that regression analysis allows you to detect only dependencies, and not the relationships that underlie these dependencies.

Regression analysis makes it possible to assess the degree of association between variables by calculating the expected value of a variable based on several known values.

Regression equation.

The regression equation looks like this: Y=a+b*X

Using this equation, the variable Y is expressed in terms of the constant a and the slope of the line (or slope) b multiplied by the value of the variable X. The constant a is also called the intercept, and the slope is the regression coefficient or B-factor.

In most cases (if not always) there is a certain scatter of observations about the regression line.

Remainder is the deviation of an individual point (observation) from the regression line (predicted value).

To solve the problem of regression analysis in MS Excel, select from the menu Service"Analysis Package" and the Regression analysis tool. Specify the X and Y input intervals. The Y input interval is the range of dependent data being analyzed and must include one column. The input interval X is the range of independent data to be analyzed. The number of input ranges must not exceed 16.

At the output of the procedure in the output range, we get the report given in table 8.3a-8.3v.

RESULTS

Table 8.3a. Regression statistics

Regression statistics

Multiple R

R-square

Normalized R-square

standard error

Observations

First, consider the upper part of the calculations presented in table 8.3a, - regression statistics.

Value R-square, also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .

In most cases, the value R-square is between these values, called extreme, i.e. between zero and one.

If the value R-square close to unity, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, the value R-square, close to zero, means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

plural R - coefficient of multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).

Multiple R equal to the square root of the coefficient of determination, this value takes values ​​in the range from zero to one.

In simple linear regression analysis plural R equal to the Pearson correlation coefficient. Really, plural R in our case, it is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients

Odds

standard error

t-statistic

Y-intersection

Variable X 1

* A truncated version of the calculations is given

Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between the variables is determined based on the signs (negative or positive) of the regression coefficients (coefficient b).

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

IN table 8.3c. output results are presented leftovers. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.

REMAINING WITHDRAWAL

Table 8.3c. Remains

Observation

Predicted Y

Remains

Standard balances

Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value remainder in our case - 0.778, the smallest - 0.043. For a better interpretation of these data, we will use the graph of the original data and the constructed regression line presented in Fig. rice. 8.3. As you can see, the regression line is quite accurately "fitted" to the values ​​of the original data.

It should be taken into account that the example under consideration is quite simple and it is far from always possible to qualitatively construct a linear regression line.

Rice. 8.3. Initial data and regression line

The problem of estimating unknown future values ​​of the dependent variable based on the known values ​​of the independent variable remained unconsidered, i.e. forecasting task.

Having a regression equation, the forecasting problem is reduced to solving the equation Y= x*2.305454545+2.694545455 with known values ​​of x. The results of predicting the dependent variable Y six steps ahead are presented in table 8.4.

Table 8.4. Y variable prediction results

Y(predicted)

Thus, as a result of using regression analysis in the Microsoft Excel package, we:

    built a regression equation;

    established the form of dependence and the direction of the relationship between the variables - a positive linear regression, which is expressed in a uniform growth of the function;

    established the direction of the relationship between the variables;

    assessed the quality of the resulting regression line;

    were able to see the deviations of the calculated data from the data of the original set;

    predicted the future values ​​of the dependent variable.

If regression function is defined, interpreted and justified, and the assessment of the accuracy of the regression analysis meets the requirements, we can assume that the constructed model and predictive values ​​are sufficiently reliable.

The predicted values ​​obtained in this way are the average values ​​that can be expected.

In this paper, we reviewed the main characteristics descriptive statistics and among them such concepts as average value,median,maximum,minimum and other characteristics of data variation.

There was also a brief discussion of the concept emissions. The considered characteristics refer to the so-called exploratory data analysis, its conclusions may not apply to the general population, but only to a data sample. Exploratory data analysis is used to draw primary conclusions and form hypotheses about the population.

The basics of correlation and regression analysis, their tasks and possibilities of practical use were also considered.