Please use RapidMiner Studio to analyze the data set
Please read the attached Project Outline document to ensure that you are complying with all of the requirements (you are only completing the Written Report section)
Please read the attached Course Concepts documents to ensure that you are using the analytic techniques and RapidMiner operators we learned in class (Linear Regression, Nominal to Numerical, Cross Validation, Apply Model, Performance, k-NN, Decision Tree, Remap Binomials, Logistic Regression, etc.)
Course Project Description
In this project, groups of 4-5 students will research and obtain a data set relevant to a public or private organization or a specific problem domain, and will apply analytics techniques covered in the course to yield actionable insights.Students may form their own groups. Groups will be required to submit a 1-page project proposal, a 2-page outline, deliver a 7-10 minute presentation in class, and submit a final written report of their work, as detailed below.
Please sign up for a project group using the following link:https://tinyurl.com/to4b58t
Projects must incorporate both descriptive and predictiveanalytics techniques, and must include a clear explanation of how the results of the project will influence decision making (either an individual decision or a specific type of decision). That is, groups are required to state clearly and specifically how their results will help the people or organization(s) involved.
There is no a priori restriction on the setting or domain of the project. Groups are encouraged to choose topics that are of interest to them. If in doubt about whether a topic would make an appropriate project, ask! The quality of the project depends on many factors, including but not limited to: collection of a sufficient quantity of usable data, appropriate application of the tools and methods learned in class, generation of actionable insights, and clear communication of the group’s work both orally and in writing.
The data set is required to have at least 500 usable data points at a minimum (more is preferable). Smaller data sets are unlikely to yield fruitful results from several methods covered in the course.
The timeline and deliverables for the project are as follows:
Project Proposal (2%):
The project proposal should be a Word document no longer than one page that briefly describes what the group plans to do. It must include:
The names of all of the group members
The organization or domain on which the project will focus
The question or challenge relevant to the organization/domain that the project will address
A brief description of the data the group would like to use
Project Outline (3%):
The project outline should be a Word document no longer than two pages that describes the data set and next steps. It must include:
A description of the data set (source, number of data points, variables, etc.)
Which analytics techniques the group plans to apply
What kinds of results or insights the group expects/hopes that the analysis will reveal
Presentations should be 7-10 minutes in length. The 10-minute upper bound is a hard constraint. No handouts nor supporting files other than a single .ppt file are permitted. It will be graded both on the correctness/appropriateness of the content, and clarity with which it is presented.The presentation should include:
A brief description of the organization and the issue being analyzed
A description & snapshot of the data set
An explanation of any data or modeling challenges that the group encountered
An explanation of the methods used to analyze the data set
An explanation of the results and their implications for the organization
All group members must present at least one slide.
Written Report (14%):
The written report should be considered a complete report of the work done by the group. It will be graded based on the correctness and thoroughness of the analysis, as well as the clarity with which it is explained. It should include all of the items listed above for the presentation, incorporating any additional supporting material or detailed explanations that were left out of the presentation. It should also incorporate any changes made since the presentation was delivered. Written reports of more than 10 pages, inclusive of tables and charts, are discouraged. Do notshow the entire data set; a snapshot of it is sufficient.
Groups are encouraged to work on a draft of the written report in parallel with the presentation; this saves time and tends to result in higher quality reports.
Recommended format & structure for the written report:
Use double spacing, 11-point Times New Roman font, and 1-inch margins. Be sure to reference any external sources used for quotes or information using a common style guide such as APA (http://www.apastyle.org/manual/index.aspx).
The written report must look professional. Grammatical mistakes, typos, inconsistent formatting, etc., will result in points being deducted.
FREQUENTLY ASKED QUESTIONS:
Q: Can we use data sets from Kaggle?
A: Yes, you can use data sets from Kaggle (or similar ones found through Google dataset searches, etc.). HOWEVER, an important part of the project is answering the “so what?” question. There needs to be a compelling story, and some kind of applicable insight that comes out of your analysis. “The organization should focus more on X3 because that’s what the models said” is not all that useful or intriguing.
Q: Can we do predictive analytics on the stock market?
A: I strongly discourage trying to predict the stock market (or any major financial market). There’s an entire industry devoted to this problem, and the models covered in class have already been thoroughly applied. It’s unlikely that anything new will come out of such a project; seemingly valuable results tend to be either random noise or an already well-known relationship.
Q: Can we collect our own data?
A: Yes. Be aware, though, that obtaining a sufficiently large data set might be extremely time consuming.
Q: Can we use data from our own internship, job, other project, etc.?
A: Working on a project that will have a real-world application is great and encouraged. However, you cannot reuse previous work for this project, nor reuse this project in future academic work.
Q: What are some examples of successful past projects?
A: I am hesitant to give these, because I don’t want groups getting anchored on them and trying to do something similar. Virtually any topic can lead to a successful project; how the group conducts and presents their analysis matters much more than their choice of topic. With that caution, a few successful projects in recent years have been: predicting top 10 songs on Spotify, predicting Olympic medalists, understanding the effects of several factors on Walmart’s sales, and employment growth in DC industries.
Q: Our dataset is huge and overwhelming! What should we do?
A: Don’t worry, that is often a good problem to have. Huge overwhelming datasets are exactly why we need analytics! If you have more than 10,000 data points, you will need to obtain an educational license for RapidMiner, which is free for students and good for 1 year. If you have so many data points that methods from class won’t run in a reasonable amount of time, consider using only data points within certain categories, or even simply a smaller (but still large) random sample ofpoints. If you have lots of variables, DO NOT eliminate some variables just to make things simpler. Most methods covered in class have no problem dealing with a large number of variables, and you don’t know before doing the analysis which ones will turn out to be important. If your dataset is huge and overwhelming, and ALSOvery messy or poorly formatted, cleaning it might require an inordinate amount of work; in that case, consider using a different dataset.
The fine print:
All members of a group will receive the same grades for the project, barring exceptional circumstances.
Late proposals, outlines, or written reports will be docked 20% of the points available per day, or fraction thereof, overdue.
-Predict the dependent variable (Y) using a linear equation containing the independent variables
-Excel & RapidMiner output
-R Square (Excel only): % of variability of Y that’s captured by the model
-Standard error (Excel only): a measure of error associated with the model’s predictions
-Significance F (Excel only): a p-value for the model as a whole
-Coefficients: the coefficients of the regression equation
-Sign indicates direction of that variable’s relationship with Y
-The change in prediction for Y if the variable goes up by 1
–Linear Regression operator
–feature selection parameter: sets how insignificant variables are removed
-Many predictive methods can’t handle qualitative variables
-Convert a qualitative variable in a set of binary 0-1 “dummy variables”
-Leave out one possible value as the “base case” for that qualitative variable
-Those dummy variables can then be used as independent variables
–Nominal to Numerical operator
-Use comparison groups to set the base case
-A way to compare the accuracies of different predictive methods
-Split the dataset up into chunks (“folds”)
-Run a method many times, each time leaving a fold out of the model and instead
making predictions for the outcomes of the data points in that fold
-Then aggregate all of those errors to see how good the method was
-Root mean squared error (RMSE)
–Cross Validation operator
-A nested operator, meaning it has its own process within it
-Set up the predictive method on the left side
-Use Apply Model &Performance operators on the right side to determine how
well the method did at predicting outcomes for the fold that was omitted
-To make a prediction, find the k most similar points in the dataset
-Use the average outcome of those k points as the prediction
-Low k -> similar points, but subject to more randomness
-High k -> not all similar points, but less affected by random noise
-SQRT(n) is a good value of k to try first
-Can be a good predictive method, but does not include a “model” that gives informative output
the way regression does
-Build a tree by splitting the data up (“branching”) repeatedly
-Each split is based on whether one variable’s value is above or below a threshold
-Tries to obtain groups of data points that have similar outcomes
-To make a prediction for a new data point, find which endpoint (“leaf”) of the tree it goes into
-Use the average outcome of the data points in that leaf as the prediction
–Decision Tree operator
–criterion parameter: set to least_square if predictive something numeric
–minimal gain parameter: controls the size of the tree (low value -> large tree)
-When predicting something binary rather than numeric
-Regression, k-NN, and regression trees can all be adapted for classification
–Remap Binominals operator in RapidMiner, to make sure it interprets 0 & 1 correctly
-Evaluate using classification accuracy (% of correct classifications) instead of RMSE
-The Performance operator will give us more detailed information
-Includes a 2×2 table of how often 0/1 classifications were made, and how often
the actual results were 0 or 1.
–k-NN works almost exactly the same way
-Instead of the average outcome, classify based on the majority vote of the k points
-Use a “classification tree” instead of a regression tree
-Same operator in RapidMiner (Decision Tree)
–criterion parameter should be set to gain_ratio (which it is by default)
-Each leaf of the tree will have a 0/1 classification based on majority vote among those
-Logistic regression is a modification of linear regression used to estimate a probability
-Gives output that looks similar to linear regression output
-p-values have the same meaning
-Only the sign of the coefficient is meaningful, not the actual number
-When doing classifications, typically classify a new point as 1 if P(1) ≥ 0.5
Advanced Predictive Methods
-“Ensemble methods” create lots of weaker predictive models and aggregate their predictions
-Work well if the weaker models are independent and sufficiently different from one
-They are harder to interpret and explain than, e.g., a regression model
-Can’t easily look at model output and know which variables matter
-Can be used for numeric predictions or binary classifications
-Create a model many times, each with a different random sample from the dataset
-Can be used for any type of predictive model
-Particularly useful if there are any extreme outliers in the data
-Nested operator in RapidMiner called Bagging
-Specific to decision trees
-Create lots of trees, but each time, weight the data points that had large errors in the
previous tree more heavily
-Leads to trees paying more attention to the points that are hard to predict
-Tends to perform well across a wide range of datasets
–Gradient Boosted Trees operator in RapidMiner
–learning rate parameter controls how fast the trees will adapt
-Specific to decision trees
-Create lots of trees, but each time, use only a random sample of the attributes
-Tends to perform well when the number of relevant attributes is very large
–Random Forests operator in RapidMiner
–subset ratio parameter sets the proportion of attributes used in each tree
-If the outcomes are numeric, set criterion to least_square
-Check “apply prepruning” to access several of the Decision Tree parameters
Written Communication Assessment Rubric: ITEC 320, Spring 2020
|Writing demonstrates sophisticated understanding of content through strong analysis of the key issues and concepts.||Writing demonstrates understanding through solid analysis of key issues and concepts with only minor lapses.||Writing conveys basic but incomplete or unsophisticated understanding of key issues and concepts. There is some elementary analysis.||Writing is unfocused, contains errors in logic, or fails to demonstrate awareness of important issues.|
|Conclusions or Recommendations (when applicable)||Writing demonstrates thoughtful, creative, and well-developed recommendations based on the key issues identified.||Writing demonstrates solid recommendations based on the key issues identified.||Writing has recommendations but they may lack explanation, or be incomplete or inadequate based on the key issues identified.||Writing has vague or no recommendations based on the key issues identified, or the recommendations are unrelated to the key issues.|
|Writing Style||Writing demonstrates sophisticated language that is clear, concise, and correct. The tone persuades the audience, has error-free grammar, and the text is precise and dynamic.||Writing demonstrates a solid level of clarity, conciseness, and correctness. The tone is appropriate for the audience, and the grammar has few errors.||Writing conveys basic clarity, but may be wordy or vague, and contain some errors in grammar and usage. The tone may be too informal or less appropriate for the audience in places.||Writing lacks clarity, is wordy or vague, or contains numerous errors. Tone and word choices are inappropriate or arbitrary for the audience.|
|Organization||The text is extremely well-organized. Paragraphs are led by effective topic sentences, and sentence structure is varied and dynamic. Document is extremely easy to follow and understand.||The text is generally well-organized. Paragraph and sentence structure is effective in conveying information. Document is generally easy to follow.||The text has some basic organization overall, but lacks consistency. Paragraphs may lack effective topic sentences or be long and unfocused; sentences may ramble. Document is understandable, but requires substantial effort.||The text is mostly or entirely unorganized. Paragraphs lack effective topic sentences and/or may be too long and unfocused. Sentences fail to convey ideas concisely. Document is very difficult or impossible to follow.|
|Technical Level||Writing demonstrates a clear understanding of the technical level of the audience, and expresses concepts and results appropriate for that level.||Writing demonstrates a general understanding of the technical level of the audience, and mostly expresses results and concepts at the appropriate level.||Writing demonstrates a basic grasp of the audience’s technical level, but expresses some concepts and results at a clearly inappropriate level.||Writing demonstrates little to no understanding of the audience’s technical level, and expresses concepts and results in a way that is clearly inappropriate.|
Try it now!
How it works?
Follow these simple steps to get your paper done
Place your order
Fill in the order form and provide all details of your assignment.
Proceed with the payment
Choose the payment system that suits you most.
Receive the final file
Once your paper is ready, we will email it to you.