No amount of theory can replace hands-on practice .
Textbooks and courses will make you think you are proficient , Because the material is right in front of you . But when you try to apply it , You may find it more difficult than it looks . and 「 project 」 Can help you quickly improve the application ML Skill , It also gives you the opportunity to explore interesting topics .
Besides , You can add projects to your portfolio , To find a job more easily , Find cool career opportunities , Even negotiate a higher salary .
In this article , We will introduce it to beginners 8 An interesting machine learning project . You can do any of them in one weekend , Or if you like them , It can be extended to longer projects .
We affectionately call it 「 Machine learning Gladiator 」, But it's not new . This is built around machine learning practical One of the quickest ways of intuition .
The goal is to adopt out of the box models and apply them to different data sets . This is a great project 3 A major reason :
First , You will build intuition about how the model fits the problem . Which models are robust to missing data ? Which models can handle classification features well ? Yes , You can look through the textbook to find the answer , But you will learn better through practical operation .
secondly , This project will teach you valuable skills for rapid prototyping . In the real world , If you don't simply try them , It is often difficult to know which model performs best .
Last , This exercise can help you master the workflow of model building . for example , You will begin to practice ……
Import data
Clean up the data
Split it into workouts / Test or cross validation set
Preprocessing
The transformation of
Feature Engineering
Because you will use the model out of the box , You will have the opportunity to focus on honing these key steps .
see sklearn (Python) or caret Documentation page for instructions . You should practice regression 、 Classification and clustering algorithm .
• Python: sklearn – sklearn package The official course of
• Use Scikit-Learn Predicting wine quality —— A step-by-step tutorial for training machine learning models
• R: caret – from caret Webinar provided by the package author
• UCI Machine learning repository ——350 Multiple searchable datasets , Covers almost all topics . You will find the data set you are interested in .
• Kaggle Data sets ——Kaggle Uploaded by the community 100 Multiple datasets . Here are some very interesting datasets , Include PokemonGo Spawning sites and tortillas in San Diego .
• data.gov —— An open data set released by the US government . If you are interested in Social Sciences , You can check it out .
stay 《 Penalty kicks turn into gold 》 In a Book , Auckland A The team revolutionized baseball by analyzing players and scouts . They have built a competitive team , It only costs the Yankees and other large market teams to pay their salaries 1/3.
First , If you haven't read this book yet , You should go and see . This is one of our favorites !
Fortunately, , There is a great deal of data available in the sports world . The team 、 match 、 Scores and player data can be tracked online and obtained for free .
For beginners , There are many interesting machine learning projects . for example , You can try ……
• Sports betting …… Predict box scores based on available data before each new game .
• Talent scout …… Use University statistics to predict which players will have the best careers .
• Integrated management … Create player clusters based on their strengths , To build a comprehensive team .
Sports is also a great area for practicing data visualization and exploratory analysis . You can use these skills to help you decide what types of data to include in your analysis .
• Sports statistics database —— Sports statistics and historical data , It covers many professional sports and some college sports . A clean interface makes web pages easier to crawl .
• Sports Reference – Another sports statistics database . The interface is more cluttered , However, you can export a single table as CSV file .
• cricsheet.org – International and IPL The ball by ball data of a cricket match . Provide IPL and T20 International competition CSV file .
For any data scientist interested in Finance , The stock market is like a candy paradise .
First , You have many types of data to choose from . You can find the price 、 Fundamentals 、 Global macroeconomic indicators 、 Volatility index, etc …… be too numerous to enumerate .
secondly , The data can be very fine . You can easily access every company by day ( Even by minute ) Time series data of , So that you can think creatively about trading strategies .
Last , Financial markets usually have a short feedback cycle . therefore , You can quickly validate your predictions for new data .
Some examples of machine learning projects that you can try for beginners include ……
• Quantitative value investment …… According to the fundamental indicators of the company's quarterly report 6 The price trend of the last month .
• forecast …… Build a time series model based on the difference between implied volatility and actual volatility , Even a recurrent neural network .
• Statistical arbitrage …… Find similar stocks based on price movements and other factors , And look for periods when prices diverge .
An obvious disclaimer : Building a trading model to practice machine learning is simple . Making them profitable is extremely difficult . There is no financial advice here , We don't recommend trading real money .
• Python: sklearn for Investing – Applying machine learning to investment YouTube Video series .
• R: Quantitative Trading with R – Use R Detailed class notes for quantitative finance .
• Quandl – Free of charge ( And quality ) Data market for financial and economic data . for example , You can download in batches 3000 End of day stock prices of several U.S. companies Or the Federal Reserve's economic data .
• Quantopian – Quantify the financial community , Provide a free platform for developing trading algorithms . Include datasets .
• US Fundamentals Archive – 5000 Many American companies 5 Annual fundamental data .
Neural network and deep learning are two successful cases of modern artificial intelligence . They are used in image recognition 、 Great progress has been made in automatic text generation and even in autonomous vehicle .
Get involved in this exciting field , You should start with manageable datasets .
MNIST The handwritten numeral classification challenge is a classic entry point . Image data is usually smaller than 「 Plane 」 Relational data is more difficult to handle .MNIST Data is very friendly to beginners , And small enough to fit on a computer .
Handwriting recognition challenges you , But it doesn't need high computing power .
First , We recommend using the first chapter of the following tutorial . It will teach you how to build neural networks from scratch , Solve with high precision MNIST Challenge .
• Neural networks and deep learning ( Online books ) —— The first 1 This chapter describes how to Python Neural network is written from the beginning , To come from MNIST The number of . The author also gives a good explanation for the intuition behind the neural network .
• MNIST – MNIST It is a modified subset of two data sets collected by the National Institute of standards and technology . It contains 70,000 Handwritten digital images with labels .
Enron scandal and bankruptcy are the biggest in history One of the business collapses .
2000 year , Enron is one of the largest energy companies in the United States . then , After being exposed for fraud , It spiraled into bankruptcy within a year .
Fortunately, , We have an Enron email database . It contains 150 Former Enron employees ( Mainly senior management ) Between 50 Million emails . It is also the only large public database of real e-mail , This makes it more valuable .
in fact , Data scientists have been using this data set for education and research for many years .
Examples of beginner machine learning projects you can try include ……
• Anomaly detection …… Mapping and receiving e-mail by hour , And try to detect abnormal behaviors that lead to public scandals .
• Social network analysis …… Build a network diagram model among employees to find key influencers .
• natural language processing …… Analyze the body message with e-mail metadata , To classify emails according to their purpose .
• Enron email dataset —— This is from CMU Managed Enron email archive .
• Enron data description (PDF) – Exploratory analysis of Enron e-mail data , Can help you get the foundation .
Writing machine learning algorithms from scratch is an excellent learning tool , There are two main reasons .
First , There is no better way to build a true understanding of their mechanisms . You will be forced to consider every step , This will lead to real mastery .
secondly , You will learn how to convert mathematical instructions into working code . When adjusting the algorithm from academic research , You will need this skill .
We suggest choosing a less complex algorithm . Even the simplest algorithm , You also need to make many subtle decisions . Once you are familiar with building simple algorithms , Try extending them for more functionality . for example , This paper attempts to expand the ordinary logistic regression algorithm into a lasso by adding regularization parameters / Ridge return .
Last , This is a hint that every beginner should know : Don't be discouraged. , Because your algorithm is not as fast or fancy as the algorithm in the existing software package . These software packages are the result of years of development !
• Python: Logical regression from zero
• Python: From scratch k- Nearest neighbor
• R: Logical regression from zero
Due to the huge amount of user generated content , Social media has almost become 「 big data 」 The pronoun of .
Mining this wealth of data can prove that you can master ideas in an unprecedented way 、 Trends and public sentiment .Facebook、Twitter、YouTube、 WeChat 、WhatsApp、Reddit…… The list continues .
Besides , Each generation spends more time on social media than their predecessors . This means that social media data will be linked to marketing 、 The brand is more relevant to the whole business .
Although there are many popular social media platforms , but Twitter It is a classic entry point for practicing machine learning .
Use Twitter data , You can get data ( Tweet content ) And metadata ( Location 、 Theme Tags 、 user 、 Forward tweets, etc ) Interesting mix of , It opens up almost endless paths for analysis .
• Python: mining Twitter data —— How to Twitter Data for emotional analysis
• R: Using machine learning for sentiment analysis —— A short and sweet emotional analysis course
• Twitter API – twitter API Is the classic source of streaming data . You can track tweets 、 Theme labels, etc .
• StockTwits API – StockTwits It's like Twitter for traders and investors . You can extend this data set in many interesting ways by connecting it to a time series data set using timestamps and stock symbols .
Because of machine learning , Another industry undergoing rapid change is global health and healthcare .
In most countries , It takes years of education to become a doctor . This is a demanding 、 Long working hours 、 High risk 、 Enter areas with higher barriers .
therefore , Recently, with the help of machine learning, great efforts have been made to reduce the workload of doctors and improve the overall efficiency of the health care system .
Use cases include :
• Preventive care …… Predict disease outbreaks at the individual and community levels .
• Diagnostic care … Automatically classify image data , For example, scanning 、X Rays, etc .
• insurance …… Adjust the premium according to the public risk factors .
As hospitals continue to modernize patient records , And as we collect more detailed health data , Data scientists will have plenty of opportunities at their fingertips .
• R: Build a meaningful machine learning model for disease prediction
• Machine learning in healthcare —— Wonderful speech from Microsoft Research
• Large health data sets —— A collection of large health-related datasets
• data.gov/health – Health and healthcare related data sets provided by the U.S. government .
• Health, nutrition and demographics —— Global health provided by the world bank 、 Nutrition and demographic data .