We discussed a troubling revelation about deadly police shootings in the US that was made by The Washington Post during our class discussion. There have been an astounding 8,770 fatal police shootings since 2015; 928 of these occurrences have been documented in the last 12 months alone. The substantial undercounting problem—only 33% of these instances are included in the FBI database—is especially concerning because of inconsistent reporting from different law enforcement organizations. In addition, the data highlights a troubling racial disparity: although making up only 14% of the population, Black Americans are fatally shot at a rate more than double that of White Americans. Likewise, Hispanic Americans are disproportionately impacted. This highlights the urgent need for more accountability and systemic changes to close this stark disparity in police shootings. The Washington Post’s database, which included thorough documentation of these occurrences, proved to be an invaluable tool for our conversation, illuminating the pressing need for reform and serving as a reminder of this enduring problem in our culture.
Project 1
September 22 2023 daily post
I conducted a correlation study on three datasets pertaining to diabetes, obesity, and inactivity. The FIPS code served as the connecting factor in the research, which revealed a substantial relationship between all three datasets. I saw the need for a more detailed investigation and therefore integrated these three datasets into a single Excel file.
I wrote some code to combine the three datasets, and it turned out that there were 356 shared data points. After that, the Excel file needed to be cleaned up, which involved removing duplicate columns containing information on the county, state, and year. In order to make the data easier to understand, I removed these columns.
To do this, I am researching a variety of tests, including the T-test and Bruesch-Pagan Test. I have a rough T-test code, but I haven’t tried it on the dataset yet.
October 4th daily post
I learned a lot about the bootstrap method while utilizing it for my project. It gave me the opportunity to examine the data from the Centers for Disease Control and Prevention (CDC) in great detail and get insightful knowledge. The bootstrap method’s adaptability and flexibility were among the first things I observed. I could work with my data as it was instead of presuming that it followed a particular distribution, and this freedom was wonderful.
The complexity of real-world data struck me as I began the data preprocessing phase. There were several different data formats to deal with. But I discovered that the bootstrap approach gave me an efficient way to deal with these difficulties.
10-02-2023
Using a variety of criteria, including the physical environment, transportation, economics, and food access, I tried to build a model based on equation i. in the current analysis to predict inactivity and obesity. However, I’m running into a problem where the R2 values are less than the equation’s R2 value:
y(diabetes)=(obesity)X1β1+(inactivity)X2β2
Furthermore, I am still debugging some ongoing problems with the code. Separately, because this is an experimental project, I intend to use Weighted Least Squares (WLS) to finish it and investigate ways to improve the model’s accuracy. This will enable me to finish the project.
09-29-2023
Introduction
The accuracy of a prediction model is assessed using the Mean Squared Error (MSE), a metric that is frequently used in the fields of statistics and machine learning. It calculates the mean squared difference between the values in the dataset that are actual and those that are predicted. The MSE is a useful metric for assessing the quality of regression models since a lower MSE indicates a better fit of the model to the data.
Example
In this practical example, we assess the regression model’s precision using the mean square error (MSE) measure to estimate property values based on square footage. We determined the squared error for each observation using a data set that contained five data points with the true value and the matching estimate by taking the square of the difference between the predicted value and the true value. Finally, we get the MSE by averaging these squared errors. The determined MSE of 440,000 in this scenario provides a quantitative assessment of how well the model tracks actual home price trends, with lower values suggesting a better fit. This graph demonstrates how MSE can be used to estimate and fit regression models for precise predictions.
09-27-2023
The k-fold method of cross-validation is a way for assessing how well a machine learning model is working. In this method, the dataset is split into ‘k’ equal-sized subsets or folds. The process is repeated ‘k’ times, with a new fold being used for validation each time. The model is trained on ‘k-1’ folds and validated on the final one. Once the findings have been averaged, a single performance metric is obtained.
For instance, the data is divided into 5 pieces for a 5-fold cross-validation. The model is tested on the fifth part after being trained on the first four. The validation set is used once for each phase of the operation, which is repeated five times. The final performance metric, which offers a reliable evaluation of the model’s generalization capacity, is the average of the metrics acquired in each of the 5 validation steps.
09-25-2023 daily update
Redistributing or repeating the original data results in fresh samples or datasets that are used to act on and analyze data using resampling methods. These techniques are particularly helpful for inference-making, model performance evaluation, and statistician estimation.
By randomly choosing data points with replacements, we can generate several bootstrap samples from the original dataset using the resampling process known as bootstrapping. When we have a small sample size, it is frequently used to estimate population statistics or evaluate the uncertainty of a statistic.
Another resampling method that is frequently employed in machine learning is cross-validation. It entails segmenting the dataset into various subsets and employing various subsets iteratively for model training and testing. It enables us to assess the model’s
performance in generalization and spot problems like overfitting.
An essential stage in evaluating a predictive model’s effectiveness is estimating prediction error. It aids in our comprehension of how well our model will likely function given new and untested data. Depending on our dataset and objectives, there are a variety of ways for estimating prediction error.
9/20/23 Discussion
In today’s lesson, we examined an analysis of a dataset that included measurements of the diameters of crab shells before and after molting. The following are the main ideas from today’s session:
Dataset and Linear Model: We looked at a dataset that had pairs of values for the pre- and post-molt sizes of crab shells. Using the Linear Model Fit function, a linear model was developed to forecast pre-molt size based on post-molt size.
Pearson’s r-squared: A remarkable high correlation between post-molt and pre-molt sizes was shown by the Pearson’s r-squared value of 0.980833, suggesting a significant linear association.
Descriptive Statistics: Descriptive statistics, which include information on central trends, variability, skewness, and kurtosis, were computed for both post-molt and pre-molt data.
Histograms and quantile plots were utilized to illustrate the post-molt and pre-molt data distributions, which revealed negative skewness and high kurtosis, two signs of non-normality.
T-Test: T-tests were developed for comparing means between two groups. They are an essential statistical tool. We specifically discussed:
Separate Samples Comparing the means of two separate groups using the T-test.
Samples in Pairs Comparing means between matched measures using the T-test.
One-Sample T-Test: This test compares the mean of one sample to a predetermined or speculative value.
This thorough investigation has shed important light on the relationship between crab shell sizes and the statistical techniques employed in such data analysis.
Daily update September 17 2023
In the last class I have learnt about
Overfitting: When a model learns the training data too well but struggles to generalize to new data due to its excessive complexity.
Underfitting: When a model is too simple to capture the patterns in the training data and performs poorly both on the training and new data.
to train any data In our overfitting and underfitting guide the selection of appropriate model complexity. Data scientists use techniques like cross-validation and regularization to prevent overfitting (excessive model complexity) or underfitting (overly simplistic models). Hyperparameter tuning is essential to fine-tune models and strike the right balance between these extremes. Regular model evaluation and maintenance ensure that the chosen model continues to generalize well and perform effectively on new data throughout the project.
.