Friday, May 12, 2017

Data Scientists' Notes on Machine Learning

Note: I am currently work on some notes on biological image analysis (April 19, 2019).

The goal of this note is to introduce the “science” of data science.  It explains the core machine learning theories for data scientists who previously learned some machine learning techniques.  Understand the origins of machines learning concepts such as over-fitting, regularization and cross validation prepares data scientists to quickly master newly emerging machine learning techniques, as their theoretical foundations likely remain unchanged.  A grasp of the learning theory also enables data scientists to evaluate, modify and invent appropriate new machine learning techniques as needed, with confidence.

Part I.  Machine Learning



The note contains six chapters.  The first chapter explains the basic concepts of machine learning, followed by the core learning theory in chapter 2 & 3.  Chapter 4 & 5 focus on key machine learning techniques, i.e., application of the learning theories.  Chapter 6 clarifies some additional concepts and provides a few areas for future learning.

The goals of each chapter is outlined as the following:


Chapter 1. What is Machine Learning?

  • Introduction
  • Is Machine Learning Really a Hype?
  • What is Learning?
  • Learning Resource
  • The Learning Problem
This chapter explains what machine learning is and formulate the learning problem we are going to discuss in this note.  We try to clarify a few confusing concepts: artificial intelligence, machine learning and deep learning.  The chapter provides pointers to valuable learning resource.

Chapter 2. Learning Theory


  • Evaluating a Single Hypothesis - Hoeffding Inequality
  • Hypothesis Set and Learning Algorithm
  • VC Dimension
  • Where Does Over-fitting Come From?
  • Regularization
  • The Learning Curve

This chapter explains the core of the machine learning theory from frequentists' point of view - why machines can learn.  It introduces the intuition behind the VC theory to explain why our typical models only contain finite number of effective hypotheses.  The multiple-test problem is originated from the multiple hypotheses we use in the learning, which directly leads to a regularization error.  It is important that we do not just minimize in-sample-error, which will lead to over-fitting, instead the best model should be identified by minimizing the sum of in-sample error and regularization error.

Chapter 3. Regularization and Cross Validation


  • Bayesian Interpretation of Regularization
  • Bias and Variance
  • Learning Theory Review
  • Unbiased $E_{out}$ Estimation with Leave-One-Out Cross Validation
  • Hyper-parameter Selection and Model Selection

This chapter first completes the learning theory, it continues to emphasize the necessity of the regularization error term.  Model hyper-parameters, such as the $\lambda$ within the regularization error term, should be determined by using a validation data set.  To make the most effective usage of our limited training data set, Leave-One-Out Cross Validation scheme is used to provide the closest unbiased performance estimation for our final hypothesis.  In practice, we may need to use two separate workflows, one to identify the best hypothesis, another to estimate its performance.

Chapter 4. Linear Models


  • Linear Classification - Accuracy
  • Logistic Regression - Cross Entropy Lost Function
  • Linear Regression - Maximum Likelihood
  • Feature Selection Capability of Linear Models
  • Non-linear Classifier
  • Regularization Again
This chapter discusses three popular applications of linear models, for classification, for probabilistic classification, and for regression.  We discuss why linear model has built-in feature selection capability and how it provides a general framework that can be extended into non-linear models through a mapping function.

Extra: Multiclass Classifier


Chapter 5. Support Vector Machines


  • Maximum Margin Linear Classifier
  • SVM Engine
  • Cross Validation of SVM
  • Non-linear SVM and Kernel
  • RBF Kernel - One Kernel for All
  • RBF Kernel for Higher-dimensional Input
  • Classification with Probability
This chapter discusses how SVM works.  We will analyze the popular KBF kernel in great detail. Through the construction of the infinite linear space for the KBF kernel, we explain why KBF kernel can model both linear models and polynomial models and why this is one kernel for all.

Chapter 6. The Path Forward


  • Review
  • Bayesian and Graphical Models
  • Deep Learning and Feature Learning
  • Data Engineering: Big Data and Cloud
  • Summary
This chapter outlines the ML topics that are not covered by this Note, in particular, it explains three different paths for further development: Bayesian graphical model, deep learning and data engineering.  We also try to explain why three confusing concepts, data engineering, big data and Cloud, tend to go hand-in-hand.



Part II.  Deep Learning



To a large extent, the recent data science revolution is the deep learning revolution.  Deep Learning offers a few frameworks that can be applied to very complicated problems such as image classification, language translation, gaming, etc.  Most traditional machine learning techniques are either inapplicable to such problems or perform poorly.

We will gradually cover the basics in this new field.  Instead of discussing the implementation techniques (e.g., TensorFlow), we focus on explaining the concepts related to why and how.  Due to the immaturity of its theoretical foundation, the so-called explanations are often people's or my own "arguments".  Be warned that I am just a new learner, my opinions could be quite wrong.  Nevertheless, I hope my gut feelings can help it look a bit less mysterious.


Chapter 7.  Introduction to Deep Learning


  • Resource
  • Feature Learning
  • GO Game as a Function
  • Deep Learning Implementation
  • A Blackbox DNN Solution

This chapter explains why deep learning is also known as feature learning.  Deep learning is basically a function approximation technique, in fact, games as complicated as GO can be modeled as a function.  We outline some implementation details and propose one can write a generic blackbox deep learning solution that is capable of solving traditional machine learning problems.

Bioinformatics Application: Gene Expression Inference with Deep Learning.

Chapter 8.  Convolutional Neural Network (CNN)


  • Introduction
  • CNN for Face Recognition
  • VGGNet for ImageNet
  • Miscellaneous Topics

This chapter explains how convolutional neural network (CNN) constructs a hierarchy of features recognizing spatial patterns starting from simple one (such as line or corners) to more complex ones (such as eyes, wheels), which addresses the translation-invariant requirement in computer vision.  The concept of different CNN neurons responding to individually specific patterns is very interesting, as it seems to shed lights on how our biological vision recognition system might function.

Bioinformatics Application: Motif Discovery for DNA- and RNA-binding Proteins by Deep Learning.


Chapter 9. Recurrent Neural Network (RNN)



  • Introduction
  • Architecture
  • Memory
  • Running Average and Derivative
  • RNN Applications
  • LSTM


This chapter explains how to handle variable-length input sequences, where the response of a system does not only just depends on the current input, but is also influenced by the historical inputs.  The new RNN system can maintain a state vector encoding the trajectory of all past inputs.  The memory capability enables RNN to handle many of the most exciting deep learning applications such as translation, video classification, sentiment analysis, etc.

Bioinformatics Application: LSTM Networks for Predicting Subcellular Localization of Proteins

25 comments:

Tejuteju said...

Thank you. Well, it was the nice to post and very helpful information on Data Science online Training Hyderabad

akhilapriya404 said...

Thanks for posting such a great article.you done a great job machine learning online course

priya rajesh said...
This comment has been removed by the author.
Karthik said...

Your blog has very useful information about this technology which i am searching now, i am eagerly waiting to see your next post as soon
Data science training in chennai
Data science course in chennai
Data science training in Anna nagar
Data science training in Adyar
Data science training in T Nagar
Cloud computing courses in chennai
Cloud computing training in chennai
Cloud computing training in Tambaram

Senthil said...

Thanks for sharing your views about the concept which you know much better. Its easy to read and understand by the way you wrote the blog contents.
German Classes in Anna nagar
IELTS Coaching in Anna nagar
Spoken English Class in Anna Nagar
French Classes in Anna nagar

Raju said...


Your blog has very useful information about this technology which i am searching now, i am eagerly waiting to see your next post as soon
Java Training in Anna nagar
Data Science Training in Anna nagar
Data Science Course in Anna nagar
Devops Training in Anna nagar
Digital Marketing Course in anna nagar
Data science course in chennai
RPA Training in Anna nagar
Blue Prism Training in Anna nagar

Kerrthika K said...

It's great post and more effective ...informative blog!
Hadoop Admin Training in Chennai
Hadoop administration in Chennai
Big Data Analytics Courses in Chennai
Blockchain course
Informatica MDM Training in Chennai
Informatica Training in Chennai

ExcelR Solutions said...

Its very good to see this kind of information. I love to thank you for providing that information. Machine Learning Training in Bangalore

Komal said...

nice information on data science has given thank you very much.
Data Science coaching in Hyderabad

Techdatasolutionsblog said...

Very Good Information...
Data science Course in Mumbai


Thank You Very Much For Sharing These Nice Tips..

Datasciencesastraining said...

Really useful information.

Machine Learning Training in Pune

Thank You Very Much For Sharing These Nice Tips.

Techdatasolutionsblog said...



Really useful information.

Data Science Training in Mumbai

Thank You Very Much For Sharing These Nice Tips.

High Technologies Solutions said...

Python web development is quite in demand and a very good option for Python developers. In over the span of 25 years, Python has managed to reach a level that is high above others making it the fastest growing language.

Best Python Training Center in Delhi, India

Advanced Python Training Institute in Delhi
Advanced Python Training Institute in Noida

abid said...

Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one.
cyber security course training in Guwahati

Techdata Solution said...

You have share informative information. Thank you. Machine learning course in Mumbai

Techdata Solution said...

Techdata Solutions also provide Data Science and Machine learning course in Mumbai and Pune.

Data science course in Mumbai
Data science course in Pune
Machine learning course in Pune
SAS training in Mumbai
RPA training in Mumbai
Blockchain training in Mumbai

BestTrainingKolkata said...

Your article has all the necessary information. It is a change of taste from other supposed informational content with some accurate points which needs to be focussed on to get the details about the topic.
SAP training in Kolkata
SAP course in kolkata

BestTrainingMumbai said...

While reading this wonderful article, I came across many aspects on which I coincide with you. It made my head bound to ponder over the topic and read it over again.
SAP training in Mumbai
SAP course in Mumbai

Techi Top said...

thanks for sharing this information.
techitop
pdfdrive
jio rockers telugu
www.mpl.live
filmy4wap.xyz
extratorrents proxy

Unknown said...

An amazing offer! I have quite recently sent this onto a collaborator who had been directing a little schoolwork on this. What's more, he indeed got me supper because of the way that I discovered it for him... haha. So permit me to rephrase this.... Much obliged to YOU for the dinner!! Be that as it may, definitely, thanx for investing energy to talk about this point here on your site. news updates

Anonymous said...

Thank you for posting this blog on the notes of data science, if you want you can check out
data science course in bangalore
data science course

Aaron jhonson said...

Excellent article... Thank you for providing such valuable information; the contents are quite intriguing. I'll be waiting for the next post on Big Data Engineering Services with great excitement.

Tamil novels said...

Very good concept and your presentation is very nice. Keep sharing with us.
ramanichandran novels
muthulakshmi raghavan novels pdf
sashi murali novels
tamil novels
srikala novels free download
mallika manivannan novels pdf download
tamil novel writers




hanumanchalisa said...

Excellent post thanks for sharing this post

Hanuman Chalisa Lyrics Pdf
Hanuman Chalisa Tamil Pdf
Hanuman Chalisa English Pdf
Hanuman Chalisa Telugu pdf

chaixiaoqi said...

Useful post thanks for sharing

gold price in chennai
gold rate today namakkal
gold price today salem