Daily update September 17 2023

In the last class I have learnt about

Overfitting: When a model learns the training data too well but struggles to generalize to new data due to its excessive complexity.

Underfitting: When a model is too simple to capture the patterns in the training data and performs poorly both on the training and new data.

to train any data In our overfitting and underfitting guide the selection of appropriate model complexity. Data scientists use techniques like cross-validation and regularization to prevent overfitting (excessive model complexity) or underfitting (overly simplistic models). Hyperparameter tuning is essential to fine-tune models and strike the right balance between these extremes. Regular model evaluation and maintenance ensure that the chosen model continues to generalize well and perform effectively on new data throughout the project.

.

Daily update (online class) September 15 2023

Instead of the Breusch-Pagan test, I intend to apply the White test and Goldfeld-Quandt test, identify any limits of each test when determining the heteroscedasticity for the diabetes and inactivity data models, and compare the results to the Breusch-Pagan test. This aids in my decision-making during analysis.

In the upcoming class, I will surely ask the instructor for clarification on any areas that were confusing to me throughout my analysis.

Daily Update 09/13/23

As I mentioned in my previous blog post, I have looked into and comprehended the ideas of Kurtosis and the least squares linear regression method.

The scatterplot’s relationship between variables is represented by the least squares linear regression approach. The method minimizes the sum of the squares and the vertical separation between the line and the data points while fitting the line to the data points. It is also referred to as a trendline or the line of best fit.

The linear equation, y = b + mx 

Where, y = dependent variable  

X = Independent variable 

B = Y intercept 

M = Slope of the line 

 

To get the value of m, below formula is required 

M = NΣ(xy) – ΣxΣy/ NΣ(x2) – (Σx)2 

To get the value of m, below formula is required 

B = Σy – mΣx/N 

Where, N = Number of observations 

  • A statistical term known as kurtosis characterizes a probability distribution’s form. It offers details on the distribution’s peakedness and tails. Kurtosis aids in the analysis of a data set’s outcomes. The degree to which data values are clustered around the mean determines the peaking of a data distribution.Positive kurtosis denotes heavier tails and more peak distribution, indicating that the kurtosis is greater than that of the normal distribution, which has a kurtosis of 3.
    A distribution with negative kurtosis has flatter tails and a lower kurtosis value than a distribution with a typical kurtosis of 3.
    Kurtosis is equal to the normal distribution, which has a kurtosis of 3 and zero kurtosis suggests moderate tails and medium-height peaks.

 

In regression analysis, the Breusch-Pagan test is used to check for heteroscedasticity.

As reviewed in class today, the Breusch-Pagan test is used to determine whether a coin is fair, that is, whether it has an equal chance of landing heads or tails. If we choose to toss a coin 100 times and note the results.

There are two types of hypotheses, as follows.

Null Hypothesis (H0): The coin toss data do not exhibit heteroscedasticity, indicating that the variance of the results (heads or tails) is constant over all tosses.

Alternative Hypothesis (Ha): The statistics from the coin toss show heteroscedasticity, indicating that the variance of the results (heads or tails) may not be constant and may change from throw to toss.

  • If the p-value is greater than significance level (e.g., 0.05), we do not have enough evidence to reject the null hypothesis. This suggests that the variance of the coin toss outcomes remains relatively constant, and there is no significant heteroscedasticity. 

 

  • If the p-value is less than significance level, we may reject the null hypothesis in favor of the alternative hypothesis. This implies that there is evidence of heteroscedasticity, indicating that the variance of coin toss outcomes may not be constant, and there could be variations in the coin’s behavior across the tosses. 

Daily Update Monday 11 sept

This is Aakash Kothapally, I am providing an update regarding the recent two sessions of MTH 522 Course.

In these sessions, our respected professor presented us with a dataset of immense importance – the CDC 2018 Diabetes Data. This dataset comprises three crucial variables, namely data pertaining to diabetes, physical inactivity, and obesity. The main goal, as outlined by our professor, is to conduct an extensive analysis of this dataset. Following that, we will strive to build predictive models that will allow us to forecast different outcomes based on the provided data.