加州大学统计R代写范文展示_代写_论文代写_ESSAY PhD团队论文、作业代写-不是中介更加专业

这是EssayPhD论文代写团队统计PhD在2017为美国加州大学欧文分校的学弟做的统计 R 代写论文。

如何真正感受到服务的质量，对于各行各业的公司来说都是意义重大的。传统的方法是一些公司专门设立了调查部门，通过问卷的方式去获得客户的反馈。随着网络的发展，公司获得反馈变得越来越简单。这篇统计 R 代写论文就是通过数据分析来更好的分析客户的满意度。以下是这篇统计 R 代写的范文。

Abstract

For the airline, it is of great importance to accurately judge the satisfaction of the customer to improve the service and the competitiveness of the flight. In this paper, I employed seven different methods, respectively, decision tree, logistic models, support vector machine, linear discriminant analysis, quadratic discriminant analysis, naive Bayesian and weighted k-nearest neighbor classifier to find a relatively better emotional classifier to judge the sentiment of customer. The results indicated that naïve bayesian model and weighted k-nearest neighbor classifier model have totally same accuracy. These two model have worse performance related to other models. The decision tree, lda and qda have better performance.

Introduction

How to get the true mood for the service means a lot for the company in all walks of life. Traditionally, some companies have set up a dedicated department to investigate customer follow-up satisfaction through a series of questionnaires, which require a lot of manpower and resources, and now with the development of the network, not only through the form of network questionnaire, You can go to a number of forums and other social networkings through a certain technical means to obtain customer satisfaction after the service, which greatly improve the convenience of feedback, but also allows the company to more effectively improve the entire service chain and enhance their own competition force.

In this paper, we use the data derived from the twitter, which is related to the United Arline. We know, United Airlines is the world’s third largest airlines. So for such a large airline company, which have a large number of customers from all over the world, we should employ a more convenient way to get their opinions.

Data

Twitter is widely used web platform, where people put their sentiment about their satisfaction or dissatisfaction and any emotion. So we decide to get the valuable data from this website. We first should to create a twitter account and watch UA on twitter. Then we get the data about the opinions related to UA. The data is saved as the format of txt which is very easy to import them to the data analysis platform. Here we use to statistical software R to conduct the next all analysis. We use the read.delim() function to load the data and name the data tweets. The tweets data sets concludes 2691 rows and 3 columns. For the 3 rows, they are relatively the Number, which is useless, the polarity and the opinions. The latter two are very important to our following study.

Method

Text processing

We use the tm package in R language to conduct the text the processing in order to get the document-term matrix, then we base on the matrix to construct various models to judge the emotion of customers. First we use the Corpus() and VectorSource() function to build corpora which are collection of documents containing natural language text, which have two types of metadata comtaining corpus specific metadata in form of label-value pairs. Then we use tm_map() function to process the corpus such as transformation to lower, removing numbers, and removing punctuation and so on.

Decision Tree

Decision tree is based on the probability of occurrence of various situations, by constructing a decision tree to get the expected value greater than or equal to the probability of zero. In this paper, we use the rpart() function in the rpart package. Due to the fact that the response variable is a classified variable, we set the parameter of method as class.

Logistic Models

Logistic model is one of the discrete classifier, also the earliest discrete selection model. One reason of the wide application of Logistic model is mainly due to the explicit characteristics of the probability, this model run fast and is easy to use. In this paper, we use the glm() function to fit the model. We also set the family parameter as binomial, and max parameter as 100.

Support Vector Machine

Support Vector Machine is the first invented by Corinna Cortes and Vapnik in 1995. It has a lot of advantages in solving small sample, nonlinear and high dimensional pattern dataset. In machine learning field, support vector machine is a supervised learning model relative with the relevant learning algorithm, which can analyze data and use for regression and classification analysis. In this paper, we use the svm() function in the package of e1071, with the default parameters.

Linear Discriminant Analysis

Linear discriminant analysis referred to as LDA, is a classification algorithm. LDA through the historical data projection, to ensure that the same type of data after projection as close as possible, different types of data as far as possible apart and generate a linear discriminant model to separate and predict the newly generated data.

quadratic discriminant analysis

compared with the linear discriminant analysis, this method is different linear discriminant algorithm, which holds similar algorithm characteristics. And difference is when the covariance matrix of different classifications is the same, linear discriminant method is used. When the covariance matrix of the sample is different, then the second discrimination should be used.

Naïve Bayesian

Naive Bayesian method is a classification method based on Bayesian theorem and feature condition independent hypothesis. The model model requires few parameters to be estimated, less sensitive to missing data, and simpler algorithm. In the analysis, we use the NaiveBayes() function, which have been used widely.

Weighted K-Nearest Neighbor Classifier

K-Nearest Neighbor (KNN) classifier is a relatively mature method and one of the simplest machine learning model. The thought of this method is that the sample also belongs to this class if the majority of k samples of the sample in the feature space belong to a particular category. In this paper, we use the kknn() function in the kknn software package.

Results and analysis

Here, after we get the document-term matrix, we use the findFreqTerms() function to the highest-frequency fifty words which is displayed in the table below. Then we use the wordcloud() function to draw the picture, where different size of words indicates the occurring frequency. The word “flight” has highest frequency, with far larger size than the others, followed by “help”, ”need”, “airport” and so on. The table are corresponding to the picture below.

加州大学欧文分校统计 R 代写

If we just conduct the document-term matrix calculation, the matrix have a high sparsity of 100%, which affect in some extent on the accuracy of the prediction. So we use removeSparseTerms() function to get the matrix with little sparsity. After that, there are only eight terms left.

In this section, we first use sample() function to randomly split the tweets data set into two parts, one train dataset, the other test dataset with the rate of 8:2. Then we use the train dataset to build the models and test dataset to predict the emotion, and test the accuracy.

By the decision tree model, we can find that, in the test dataset, for the opinions with the polarity of 0, there are 584 opinions which are predicted correctly, with 12 not. And for the opinions with the polarity of 1, there are 28 opinions which are predicted, with 115 not. And the wrong rate is 0.172.

By the logistic model, we can find that, in the test dataset, for the opinions with the polarity of 0, there are 586 opinions which are predicted correctly, with 10 not. And for the opinions with the polarity of 1, there are 24 opinions which are predicted, with 119 not. And the wrong rate is 0.175.

By the support vector machine model, we can find that, in the test dataset, for the opinions with the polarity of 0, there are 585 opinions which are predicted correctly, with 11 not. And for the opinions with the polarity of 1, there are 23 opinions which are predicted, with 120 not. And the wrong rate is 0.177.

By the linear discriminant analysis model, we can find that, in the test dataset, for the opinions with the polarity of 0, there are 584 opinions which are predicted correctly, with 12 not. And for the opinions with the polarity of 1, there are 28 opinions which are predicted, with 115 not. And the wrong rate is 0.172.

By the quadratic discriminant analysis model, we can find that, in the test dataset, for the opinions with the polarity of 0, there are 584 opinions which are predicted correctly, with 12 not. And for the opinions with the polarity of 1, there are 28 opinions which are predicted, with 115 not. And the wrong rate is 0.172.

By the naïve bayesian model, we can find that, in the test dataset, for the opinions with the polarity of 0, there are 547 opinions which are predicted correctly, with 49 not. And for the opinions with the polarity of 1, there are 37 opinions which are predicted, with 106 not. And the wrong rate is 0.210.

By the weighted k-nearest neighbor classifier model, we can find that, in the test dataset, for the opinions with the polarity of 0, there are 596 opinions which are predicted correctly, with 0 not. And for the opinions with the polarity of 1, there are 7 opinions which are predicted, with 136 not. And the wrong rate is 0.210.

Conclusion

These seven models have similar performance, with small-difference accuracy. But for different situation, there are significantly different. For example, for naïve bayesian model and weighted k-nearest neighbor classifier model, they have totally same accuracy. But the detailed predicting results have great difference. In total, these two model have worse performance related to other models. The decision tree, lda and qda have better performance.

以上便是统计 R 代写论文全文，通过R 数据分析，我们运用了七种不同的方法来判断客户的情绪。最终得出结论，需要统计 R 代写的同学欢迎联系我们。

Reference

Boyles S, Fajardo D, Waller S T. A naive Bayesian classifier for incident duration prediction[C]//86th Annual Meeting of the Transportation Research Board, Washington, DC. 2007.

Breiman L, Friedman J, Stone C J, et al. Classification and regression trees[M]. CRC press, 1984.

Chang C C, Lin C J. LIBSVM: a library for support vector machines[J]. ACM transactions on intelligent systems and technology (TIST), 2011, 2(3): 27.

Chen P H, Fan R E, Lin C J. Working set selection using the second order information for training svm[J]. Journal of Machine Learning Research, 2005, 6: 1889-1918.

Dudani S A. The distance-weighted k-nearest-neighbor rule[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1976 (4): 325-327.

Izenman A J. Linear discriminant analysis[M]//Modern multivariate statistical techniques. Springer New York, 2013: 237-280.

McCullagh P. Generalized linear models[J]. European Journal of Operational Research, 1984, 16(3): 285-292.

Mika S, Ratsch G, Weston J, et al. Fisher discriminant analysis with kernels[C]//Neural Networks for Signal Processing IX, 1999. Proceedings of the 1999 IEEE Signal Processing Society Workshop. IEEE, 1999: 41-48.