# Sentiment Classification on Large Movie Reviews with BigDL

## Sentiment Classification on Large Movie Reviews

Sentiment Analysis is understood as a classic natural language processing problem. In this example, a large moview review dataset was chosen from IMDB to do a sentiment classification task with some deep learning approaches. The labeled data set consists of 50,000 IMDB movie reviews (good or bad), in which 25000 highly polar movie reviews for training, and 25,000 for testing. The dataset is originally collected by Stanford researchers and was used in a 2011 paper, and the highest accuray of 88.33% was achieved without using the unbalanced data. This example illustrates some deep learning approaches to do the sentiment classification with BigDL python API.

The IMDB dataset need to be loaded into BigDL, note that the dataset has been pre-processed, and each review was encoded as a sequence of integers. Each integer represents the index of the overall frequency of dataset, for instance, ‘5’ means the 5-th most frequent words occured in the data. It is very convinient to filter the words by some conditions, for example, to filter only the top 5,000 most common word and/or eliminate the top 30 most common words. Let’s define functions to load the pre-processed data.

Processing text dataset
finished processing text


In order to set a proper max sequence length, we need to go througth the property of the data and see the length distribution of each sentence in the dataset. A box and whisker plot is shown below for reviewing the length distribution in words.

Review length:
Mean 233.76 words (172.911495)


Looking the box and whisker plot, the max length of a sample in words is 500, and the mean and median are below 250. According to the plot, we can probably cover the mass of the distribution with a clipped length of 400 to 500. Here we set the max sequence length of each sample as 500.

The corresponding vocabulary sorted by frequency is also required, for further embedding the words with pre-trained vectors. The downloaded vocabulary is in {word: index}, where each word as a key and the index as a value. It needs to be transformed into {index: word} format.

Let’s define a function to obtain the vocabulary.

Processing vocabulary
finished processing vocabulary


### Text pre-processing

Before we train the network, some pre-processing steps need to be applied to the dataset.

Next let’s go through the mechanisms that used to be applied to the data.

• We insert a start_char at the beginning of each sentence to mark the start point. We set it as 2 here, and each other word index will plus a constant index_from to differentiate some ‘helper index’ (eg. start_char, oov_char, etc.).

• A max_words variable is defined as the maximum index number (the least frequent word) included in the sequence. If the word index number is larger than max_words, it will be replaced by a out-of-vocabulary number oov_char, which is 3 here.

• Each word index sequence is restricted to the same length. We used left-padding here, which means the right (end) of the sequence will be keep as many as possible and drop the left (head) of the sequence if its length is more than pre-defined sequence_len, or padding the left (head) of the sequence with padding_value.

start transformation
finish transformation


### Word Embedding

Word embedding is a recent breakthrough in natural language field. The key idea is to encode words and phrases into distributed representations in the format of word vectors, which means each word is represented as a vector. There are two widely used word vector training alogirhms, one is published by Google called word to vector, the other is published by Standford called Glove. In this example, pre-trained glove is loaded into a lookup table and will be fine-tuned during the training process. BigDL provides a method to download and load glove in news20 package.

loading glove


For each word whose index less than the max_word should try to match its embedding and store in an array.

With regard to those words which can not be found in glove, we randomly sample it from a [-0.05, 0.05] uniform distribution.

BigDL usually use a LookupTable layer to do word embedding, so the matrix will be loaded to the LookupTable by seting the weight.

processing glove
finish processing glove


### Build models

Next, let’s build some deep learning models for the sentiment classification.

As an example, several deep learning models are illustrated for tutorial, comparison and demonstration.

LSTM, GRU, Bi-LSTM, CNN and CNN + LSTM models are implemented as options. To decide which model to use, just assign model_type the corresponding string.

### Optimization

Optimizer need to be created to optimise the model.

Here we use the CNN model.

creating: createSequential
creating: createLookupTable
creating: createRecurrent
creating: createGRU
creating: createSelect
creating: createLinear
creating: createDropout
creating: createReLU
creating: createLinear
creating: createSigmoid
creating: createBCECriterion
creating: createMaxEpoch
creating: createOptimizer
creating: createEveryEpoch


To make the training process be visualized by TensorBoard, training summaries should be saved as a format of logs.

With regard to the usage of TensorBoard in BigDL, please refer to BigDL Wiki.

creating: createTrainSummary
creating: createSeveralIteration
creating: createValidationSummary



Now, let’s start training!

Optimization Done.
CPU times: user 196 ms, sys: 24 ms, total: 220 ms
Wall time: 51min 52s


### Test

Validation accuracy is shown in the training log, here let’s get the accuracy on validation set by hand.

Predict the test_rdd (validation set data), and obtain the predicted label and ground truth label in the list.

Then let’s see the prediction accuracy on validation set.

Prediction accuracy on validation set is:  0.89312


Show the confusion matrix

<matplotlib.axes._subplots.AxesSubplot at 0x7fa643fe0c50>


Because of the limitation of ariticle length, not all the results of optional models can be shown respectively. Please try other provided optional models to see the results. If you are interested in optimizing the results, try different training parameters which may make inpacts on the result, such as the max sequence length, batch size, training epochs, preprocessing schemes, optimization methods and so on. Among the models, CNN training would be much quicker. Note that the LSTM and it variants (eg. GRU) are difficult to train, even a unsuitable batch size may cause the model not converge. In addition it is prone to overfitting, please try different dropout threshold and/or add regularizers (abouth how to add regularizers in BigDL please see BigDL Wiki).

### Summary

In this example, you learned how to use BigDL to develop deep learning models for sentiment analysis including:

• How to load and review the IMDB dataset
• How to do word embedding with Glove
• How to build a CNN model for NLP with BigDL
• How to build a LSTM model for NLP with BigDL
• How to build a GRU model for NLP with BigDL
• How to build a Bi-LSTM model for NLP with BigDL
• How to build a CNN-LSTM model for NLP with BigDL
• How to train deep learning models with BigDL

# A Small Program generating Fractals with Haskell

This blog is about how to use the functional programming language Haskell to write a small program to generate Fractals pictures.
If you have never seen a fractal before, there are two famous examples shown in the following pictures (Mandelbrot and Julia):

You can think of the fractals in this exercise as the graph of a complicated mathematical function. You may be familiar with graphs of functions such as $f(x) = x^2$ or $g(x) = a * x + b$. A fractal is a graph of a more complicated function over more dimensions. One interesting property of fractals is that they are self-similar if you zoom close enough, you will see copies of the fractal again. This pattern arises in nature as well: every cauliflower floret has the same shape as an entire cauliflower.

The code

# Deep Belief Network

One of the most popular unsupervised learning models is Restricted Boltzmann Machines (RBM), which has been used for many types of data including images [Hinton06], speech representation [Mohamed11] and moving ratings [Salakhutdinov07] . It was another success in building a generative model for bags of words that represents documents [31], which indicate its potential capacity to build a model to predict the stock with financial documents. The most widely usage of RBM is composed to build a Deep Belief Network (DBN). The learning process of RBM is usually contrastive divergence. Experiences are needed to decide the setting of numerical parameters, such the batch size, the learning rate, the momentum, the number of epochs to iterate, the layers of hidden units, and how many units in each hidden layer, the update methods being deterministically or stochastically and the initialization of weights and bias. It also needs to be considered the way to monitor the learning process and stop criteria. This section will introduce the key components and training method of our DBN model, based on the guide in [Hinton10].

## Restricted Boltzmann Machines

Energy based model assigns an energy value to each possible configuration of units by an energy function. The function will be trained to make it has some properties such as making the energy of desirable configuration has energy as low as possible. Boltzmann Machines (BMs) are a type of energy model in a form of Markov Ran- dom Field whose energy function is linear for its parameters. In order to make it sufficiently powerful to express complex functions, some invisible variables as hid- den units are added. The Restricted Boltzmann Machines is a restricted version of BMs, which constrain the BMs without the connections between different vision units or connections between di↵erent hidden units. A example graph is shown in figure below.

Figure 1.1: A Restricted Boltzmann Machine

The energy function is shown below:
$$E(v,h) = - \sum_{j \in hidden}a_jhj - \sum{i \in visible} b_ivi - \sum{i,j}h_jviw{ij}$$

where $h_j$ $v_i$ represent the value of the hidden and visible units. $a_j$ is the offset of the visible layer and $bi$ is the offsets of the hidden layer. $w{ij}$ is the weights connecting the hidden and visible units. The energy model defines the probability distribution via the energy function:

$$p(\pmb{v,h}) = \frac{1}{Z}e^{-E(\pmb{v,h})}$$

where $Z$ called partition function is a sum of all possible combinations of visible and hidden vectors:

$$Z=\sum_{\pmb{v,h}} e^{-E(\pmb{v,h})}$$

The probability of a given vector can be calculated by summing all hidden units:

$$p(\pmb{v}) = \frac{1}{Z}\sum_{\pmb{h}}e^{-E(\pmb{v,h})}$$

According to the free energy function of Energy-Based Models $G(\pmb{v}) = -\log\sum_{\pmb{h}}e^{-E(\pmb{v,h})}$, we can get the free energy of RBM model:

$$G(v) = - bv - \sum{i} \log\sum{h_i} e^{h_i(a_i + W_i v)}$$

The probability assigned to a training input can be increased via adjusting $W$, $\pmb{a}$ and $\pmb{b}$ to make the energy of the training examples lower and other examples in the input space higher. Both of the visible units $\pmb{v}$ and hidden units are conditionally independent of one anther, by reason of its specific structure. It can be mathematically expressed as:

$$P(h_i = 1|\pmb{v}) = sigmoid(W_i\pmb{v} + a_i)$$
$$P(v_j = 1|\pmb{h}) = sigmoid(W_j\pmb{h} + b_j)$$

Then the free energy of a RBM can be simplified to the following equation:
$$G(\pmb{v}) = -\pmb{bv} -\sum_{i}\log{1 + e^{a_i + W_i\pmb{v}}}$$

Figure 1.2: A Restricted Boltzmann Machine

##Deep Belief Networks
Deep Belief Networks (DBNs)[Hinton06] is a greedy layer-wise form network consists of stacked RBMs, and its hierachical architecture is shown in Figure~\ref{DBN}. It is also a graphical model which learns a multi-layer representation of the training examples.

Figure 1.3: A Deep Belief Network

As a stacking autoencoder, an DBN is stacked with RBMs by greedy layer wise unsupervised feature learning[Bengio07] , and each layer is a building block trained by RBM. After the hidden layer $\pmb{h^{i}}$ is trained, in next train step it will be as the visible layer, and the successive layer $h^{i + 1}$ will be the hidden layer. The joint distribution between input $\pmb{x}$ and all hidden layers $\pmb{h^{i}} (i\in [1,n])$ can be expressed in Equation~\ref{joint} below. $n$ is the number of RBM layrs. $x$ is the first layer so it can also be viewed as $\pmb{h^{0}}$. $P(\pmb{h^{n-1},h^{n}})$ is the joint probability between visible layer $\pmb{h^{n-1}}$ and hidden layer $\pmb{h^{n}}$ which is on the top of the DBN. $P(\pmb{h^{i-1}|h^i})$ is the distribution for the visible layer $\pmb{h^{i-1}}$ conditioned on layer $\pmb{h^i}$ at level $i$.

$$P(\pmb{x, h^1,..,h^n}) = P(\pmb{h^{n-1}, h^n})\left(\prod_{i=1}^{n-1}P(\pmb{h^{i-1}|h^{i}})\right)$$
The training process of DBN is shown in Algorithm:

For the fine-tuning part in this project, a logistic regression classifier was extended to the last layer. A training label will be assigned to each input and the parameters of the DBN will be tuned by gradient descent algorithm based on the negative log likelihood cost function.

[Hinton06] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[Mohamed11] Abdel-rahman Mohamed, Tara N Sainath, George Dahl, Bhuvana Ramabhad- ran, Geo↵rey E Hinton, Michael Picheny, et al. Deep belief networks using discriminative features for phone recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5060– 5063. IEEE, 2011.

[Salakhutdinov07] Ruslan Salakhutdinov, Andriy Mnih, and Geo↵rey Hinton. Restricted boltz- mann machines for collaborative filtering. In Proceedings of the 24th interna- tional conference on Machine learning, pages 791–798. ACM, 2007.

[Hinton10] Geoffrey Hinton. A practical guide to training restricted boltzmann machines. Momentum, 9(1):926, 2010.

[Bengio07] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153, 2007.

# Rotate String

Given a string and an offset, rotate string by offset. (rotate from left to right)

Example
Given “abcdefg”.

offset=0 => "abcdefg"
offset=1 => "gabcdef"
offset=2 => "fgabcde"
offset=3 => "efgabcd"


## Solution:

For the example offset=3 => “efgabcd”.
First reverse “abcd” in “abcdefg”, we can get “dcbaefg”.
Then reverse “efg” in “dcbaefg”, we can get “dcbagfe”.
Finally, reverse the whole string “dcbagfe”, we can get “efgabcd”.

# Ugly Number II

Ugly number is a number that only have factors 2, 3 and 5.

Design an algorithm to find the nth ugly number. The first 10 ugly numbers are 1, 2, 3, 4, 5, 6, 8, 9, 10, 12…

# Digit Count

Count the number of k’s between 0 and n. k can be 0 - 9.\

Example
if n=12, k=1 in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12], we have FIVE 1’s (1, 10, 11, 12)

Solution:

# Java 8 Lambda Expressions

There is a general usage in Swing library: we need a action listener to response a user operation.
Before Java 8, we have to create a new inner class which implements ActionListener interface and overide its unique method actionPerformed like the following code:

There are two problem:

1. So many template codes
2. The codes are not easy to read without expressing the purpose of the programmers


With Java 8, we can create a lambda expression:

Instead of passing in an object that implements an interface, we are passing in a bock of code : a function without a name.
In this code, we do not need to explicitly declare the type of event. In Lambda Expression, java compiler can infer the type of variable event from the context.

Note: For the sake of readability and familiarity, you have the option to include the type declarations, and sometimes the compiler
just can’t work it out!

### How to Spot a Lambda in a Haystack

The following code are five forms of lambda expressions as an example:

### Using Values, Not Varialbes

When you want to pass a varible outside the anonymous method into the method, you should declared the variable as final. Making a variable final means that you can’t reassign to that variable.

For lambda expression, although you do not need to set the outside variable as final, but you can still not change the value. If you changed, there will be an complie error.

### Default Methods for Interfaces

Java 8 enables us to add non-abstract method implementations to interfaces by utilizing the default keyword. This feature is also known as Extension Methods. Here is our first example:

### Functional Interfaces

A functional interface is an interface with a single abstract method that is used as the type of a lambda expression.

How does lambda expressions fit into Javas type system? Each lambda corresponds to a given type, specifibd by an interface. A so called functional interface must contain exactly one abstract method declaration. Each lambda expression of that type will be matched to this abstract method. Since default methods are not abstract you’re free to add default methods to your functional interface.

To ensure that your interface meet the requirements, you should add the @FunctionalInterface annotation. The compiler is aware of this annotation and throws a compiler error as soon as you try to add a second abstract method declaration to the interface.

Example:

Keep in mind that the code is also valid if the @FunctionalInterface annotation would be ommited.

Interface name Arguments Returns Example
Predicate T boolean Has this album been released yet?
Consumer T void Printing out a value
Function T R Get the name from an Artist object
Supplier None T A factory method
UnaryOperator T T Logical not (!)
BinaryOperator (T, T) T Multiplying two numbers (*)

# Java 8 new properties

Java 8 Lambdas: Functional Programming For The Masses by Richard Warburton gives a short, concise and surprisingly good coverage of how Java 8 lambdas (or “closures”) work and intended to be used, with focus on simplified and safer parallelism. In later context, also gives nice overview of emerging “reactive programming” and event-driven frameworks like Vert.x.

This blog summerizes some notes about the new properties in Java 8. It is a reading note of the book ‘Java 8 Lambdas: Functional Programming for Masses’. The content links are shown below.

1. [Lambda Expressions](./)
2. [Stream]()
3. [Libraries]()
5. [Data Parallelism]()
6. [Test, Debug and Refactor]()
7. [The Principle of Design and Architect]()


# Longest Substring Without Repeating Characters

Given a string, find the length of the longest substring without repeating characters. For example, the longest substring without repeating letters for “abcabcbb” is “abc”, which the length is 3. For “bbbbb” the longest substring is “b”, with the length of 1.



# Markdown Tutorial

When $a \ne 0$, there are two solutions to (ax^2 + bx + c = 0) and they are
$$x = {-b \pm \sqrt{b^2-4ac} \over 2a}.$$

# Welcome to md

This is a overview of how and what you can do with the markdown language.

After every blank line there will be a paragraph

Another paragraph.

Italic, bold, and monospace.

Basic lists:

• first item
• second item

This is a quote block
and is amazing.

A new line can be created with a empty line like the one above

Unicode is amazing. ☺😄😎😙

## This is a subtitle

Let’s create a ordered list:

1. first item
2. second item
3. third item

### This is a slighter smaller title

Nested lists? Of course!

1. Some recipe ingredients:

• carrots
• celery
• lentils

Let’s check tables:

size material color
9 leather brown
10 hemp canvas natural
11 glass transparent

A horizontal rule

And note that you can backslash-escape any punctuation characters
which you wish to be displayed literally, ex.: foo, *bar*, etc.

Latex plugin:
mathjax

$$a_1 + a_2 = a_3 + a_4$$
$a_1$

# Mac Configurationd

When we use a new Mac OS, we need to do the following configuration:

• Homebrew
A package manager for OS X

• NodeJS
Node.js® is a JavaScript runtime built on Chrome’s V8 JavaScript engine. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient. Node.js’ package ecosystem, npm, is the largest ecosystem of open source libraries in the world.

• Git
Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

• Hexo
Hexo is a fast, simple and powerful blog framework. You write posts in Markdown (or other languages) and Hexo generates static files with a beautiful theme in seconds.

• Vimrc
A good running configuartion file for vim environment.

• ANACONDA
Anaconda is a completely free Python distribution (including for commercial use and redistribution). It includes more than 300 of the most popular Python packages for science, math, engineering, and data analysis. See the packages included with Anaconda and the Anaconda changelog.

• MongoDB
MongoDB 3.2 is a giant leap forward that helps organizations standardize on a single, modern database for their new, mission-critical applications.

• How to Set \$JAVA_HOME environment variable on Mac OS X

• Installing Apache Maven
Jenkins Alternote

• Keycastr
KeyCastr, an open-source keystroke visualizer

• MySQL

• Mac Terminal

# Hexo

Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask questions on GitHub.

# CPP Review note

C version:

C++ version:

## ￼Sizes of types

• The size of types (in bits/bytes) can vary in C/C++
• For different compilers/operating systems
• In Java, sizes are standardised, across O/Ss
• An int changes size more than other types!
– Used for speed (not portability), but VERY popular! (fast) – Uses the most efficient size for the platform
– 16 bit operating systems usually use 16 bit int
– 32 bit operating systems usually use 32 bit int
– 64 bit operating systems usually use 64 bit int
• sizeof() operator exists to tell us the size

# Chinese interview questions about technical basic knowledge

## Java 基础：

1. equals 与 == 的区别
2. HashMap, HashTable, ConcurrentHashMap 三者的区别
4. Collection 包结构与 Collections 的区别
• java.util.Collection 是一个集合接口。它提供了对集合对象进行基本操作的通用接口方法。Collection接口在Java 类库中有很多具体的实现。Collection接口的意义是为各种具体的集合提供了最大化的统一操作方式。
• java.util.Collections 是一个包装类。它包含有各种有关集合操作的静态多态方法。此类不能实例化，就像一个工具类，服务于Java的Collection框架。
5. Java 面向对象的三个特征及其含义
7. Interface 和 absctract class 的区别
8. java 多态的实现原理
9. 锁的等级： 方法锁， 对象锁， 类锁
10. wait（）和 sleep（） 的区别
11. java 反射 (reflection) 的作用和原理
*
12. Concurrent 包里
13. for each 和 for 循环的效率比较
14. 设计模式：单例， 工厂， 适配器， 责任链， 观察者等
17. hashcode and equals
18. 什么情况下一个object不会被回收
19. 写一个服务器log程序
20. 内存溢出(out of memory)和内存泄漏(memory leak):
21. 找公共父节点
22. Java Object 自带方法
23. String Buffer 和 String Builder 的区别

## 多线程

1. Synchronize 的机制
2. volatile 关键字
3. Synchronized 适用情况
4. 写出单例线程安全的懒汉模式
5. 写出生产者消费者模型
6. 解决死锁

1. ## 数据库

2. SQL 索引
MySQL索引原理及慢查询优化 MySQL索引类型总结和使用技巧以及注意事项
理解MySQL——索引与优化 MySQL引擎
*mysql索引总结—-mysql 索引类型以及创建
3. mySQL事务关系, 数据库事务

2. python 面试题

1. ## JVM

2. 内存溢出(out of memory)和内存泄漏(memory leak):
3. 类加载器：
4. JVM调优

clone 方法的实现，

##

1. ## Data Structure and algorithm

2. Priority Queue
3. TreeMap and TreeSet
4. 海量数据处理问题

1. 线程与进程的区别
2. 进程之间的通讯方式
3. 函数调用过程栈帧变化
4. 线程之间访问共享资源

## 网络协议：

1. 三次握手
参考
2. 从客户端发送信息到服务器端的过程模拟
3. get, put, post 的区别
GET，POST，PUT，DELETE的区别

# LRU Cache

Design and implement a data structure for Least Recently Used (LRU) cache. It should support the following operations: get and set.

get(key) - Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.
set(key, value) - Set or insert the value if the key is not already present. When the cache reached its capacity, it should invalidate the least recently used item before inserting a new item.

## Solution:

To increase the efficiency, use a hash map to record each key to the corresponding node.
Use a double linked the list to store the key value pair, because it is quick for inserting and deleting nodes.
For each get operation, move node which contains the key to the head of the list.
For each set operation, if it is a new key, and the size will be greater than the capacity,
then delete the last node, and insert the new node to the head.

# Reorder List

Given a singly linked list L: L0→L1→…→Ln-1→Ln,
reorder it to: L0→Ln→L1→Ln-1→L2→Ln-2→…

You must do this in-place without altering the nodes’ values.

For example,
Given {1,2,3,4}, reorder it to {1,4,2,3}.


## Solution:

First, we can find the middle point of the list.
Then reverse the list behind the middle point.
Finally merge them together.

Given a linked list, return the node where the cycle begins. If there is no cycle, return null.

Note: Do not modify the linked list.

Can you solve it without using extra space?

## Solution:

We can use two pointers to go through the list.
The fast pointer has the twice speed as the slow pointer.
If their is a circle, they will meet eventrually.
Assume the length of the head to the loop start point is k, and the loop size is n.
When the slow pointer get the beginning loop node, it has gone k nodes, and the fast pointer go 2k nodes.
The distance between the two pointers is k % n, in other sise is (n - k % n).
When the fisrt meet on the circle, they will meet at the point which is (n - k % n) nodes far from the beginning loop point.
Then let the slow pointer go back to the head, and they start to move with the same speed.
When the slow pointer goes k nodes distance, it get to the beginning loop point.
The fast pointer goes further (k % n) distance, which is (n - k % n ＋ k % n) = n, so it is also the beginning loop point, and also they meet up.

Given a linked list, determine if it has a cycle in it.
Example
Given -21->10->4->5, tail connects to node index 1, return true

Challenge
Can you solve it without using extra space?

## Solution:

This is an easy problem. Just use two pointers to point to head.
One pointer go twice speed of the second pointer.
If they meet before the fast touch the end of the list, return true;
Else return false;

# Copy List with Random Pointer

A linked list is given such that each node contains an additional random pointer which could point to any node in the list or null.

Return a deep copy of the list.

## Solution:

Use a hash to record each node in the origin list as a key
The value is the deep copy of corresponding node.
Go through the list, for each node and random reference, check the map,
if the key was included in the map, get the value, or create a new node with the same label, and put it into the map.

# Reverse Nodes in k-Group

Given a linked list, reverse the nodes of a linked list k at a time and return its modified list.

If the number of nodes is not a multiple of k then left-out nodes in the end should remain as it is.

You may not alter the values in the nodes, only nodes itself may be changed. Only constant memory is allowed.

Example

For k = 2, you should return: 2->1->4->3->5

For k = 3, you should return: 3->2->1->4->5


## Solution:

Go through the list to obtain the length of the list.
Reverse the successive k nodes if the lenght of the right side is greater than k.

# Swap Nodes in Pairs

Example
Given 1->2->3->4, you should return the list as 2->1->4->3.

Challenge
Your algorithm should use only constant space. You may not modify the values in the list, only nodes itself can be changed.

## Solution:

This is an easay question.
If there are more than two nodes behind, exchange the order of the following two nodes,
and then move the http://www.lintcode.com/en/problem/swap-nodes-in-pairs/pointer 2 places.

# Rotate List

Given a list, rotate the list to the right by k places, where k is non-negative.

Have you met this question in a real interview? Yes

Example
Given 1->2->3->4->5 and k = 2, return 4->5->1->2->3.


## Solution:

Because k may be greater the length of the list, we need to caculate the length of the list first.
Then we can say rotate the list by (k % length) places.
Next, we Link the head and tail of the list as a circle.
Continuely, go through (length - k % length) places, and break the connection at that node.

# Remove Duplicates from Sorted List II

Given a sorted linked list, delete all nodes that have duplicate numbers, leaving only distinct numbers from the original list.

Have you met this question in a real interview? Yes
Example
Given 1->2->3->3->4->4->5, return 1->2->5.
Given 1->1->1->2->3, return 2->3.

## Solution:

Go through the list, use a variable ‘prevPre’ as a ponter to point to the node before previous node.
If the middle two nodes have the same value, delete the two nodes and record the value. Continue on the list, if the following nodes have the same value as this value, delete the following nodes.
Iterate the previous steps untils there are no node behind.

# String to Integer(atoi)

Implement function atoi to convert a string to an integer.

If no valid conversion could be performed, a zero value is returned.

If the correct value is out of the range of representable values, INT_MAX (2147483647) or INT_MIN (-2147483648) is returned.

Example
"10" => 10

"-1" => -1

"123123123123123" => 2147483647

"1.0" => 1


## Solution:

Detailed problem.
Three derails should be noted:

1. valid char c : c - '0',
2. not valid char: stop
3. '+' or '-' sign at the start, record it
4. Overflow: INT_MAX (2147483647) or INT_MIN (-2147483648) is returned


# Remove Duplicates from Sorted List

### Problem:

Given a sorted linked list, delete all duplicates such that each element appear only once.

Example
Given 1->1->2, return 1->2.
Given 1->1->2->3->3, return 1->2->3.

## Solution:

This is an eassy proble, just go through the list, and remove the node whose value equals the value of previous node.

# Partition List

## Problem:

Given a linked list and a value x, partition it such that all nodes less than x come before nodes greater than or equal to x.

You should preserve the original relative order of the nodes in each of the two partitions.

For example
Given 1->4->3->2->5->2 and x = 3,
return 1->2->2->4->3->5.

## Solution:

It is an easy problem.
Finally, join them together.

## Problem:s

Reverse a linked list from position m to n.

*Example*
Given 1->2->3->4->5->NULL, m = 2 and n = 4, return 1->4->3->2->5->NULL.


Note
Given m, n satisfy the following condition: 1 ≤ m ≤ n ≤ length of list.

Challenge
Reverse it in-place and in one-pass

## Solution:

This problem is complex which has many edge cases so that it is not easy to write a bug free program within a short time.
The idea of the solution is to go through the list, to the m, store the previous node before m, and start reversing from m to n.
Then link the non-reversed and reversed part together.

## Problem:

You have two numbers represented by a linked list, where each node contains a single digit. The digits are stored in forward order, such that the 1’s digit is at the head of the list. Write a function that adds the two numbers and returns the sum as a linked list.

Example
Given 6->1->7 + 2->9->5. That is, 617 + 295.

Return 9->1->2. That is, 912.


## Solution:

Revese the given two linked list.
Then use the solution presented in the solution of “Add Two Numbers” to add then together.
Finally, reverse the result linked list back.

## Problem:

You have two numbers represented by a linked list, where each node contains a single digit. The digits are stored in reverse order, such that the 1’s digit is at the head of the list. Write a function that adds the two numbers and returns the sum as a linked list.

Have you met this question in a real interview? Yes

Example
Given 7->1->6 + 5->9->2. That is, 617 + 295.

Return 2->1->9. That is 912.

Given 3->1->5 and 5->9->2, return 8->0->8.


## Solution:

This is an easay problem. Use a carry varible to store the carray and go through the two list from right to left doing addtion.

# Single Number III

### Problem:

Given 2*n + 2 numbers, every numbers occurs twice except two, find them.

Have you met this question in a real interview? Yes

Example
Given [1,2,2,3,4,4,5,3] return 1 and 5


Note:

1. The order of the result is not important. So in the above example, [5, 1] is also correct.
2. Your algorithm should run in linear runtime complexity. Could you implement it using only constant space complexity?

Challenge
O(n) time, O(1) extra space.

### Solution:

This problem is also about bit operation.
First we xor all the numbers in the array.
Then we can get the last digits which equals 1.
The bit represent the last different bit of the two exceptional numbers.
Next, we use the bit to separate the numbers in the array into two group, and use xor to link the numbers in each group.
After the iteration stop, each number left in each group should be the two results.

# Single Number II

## Problems:

Given 3*n + 1 numbers, every numbers occurs triple times except one, find it.

Example
Given [1,1,2,3,3,3,2,2,4,1] return 4


Challenge
One-pass, constant extra space.

## Solution:

This problem examines the ability of applying bit operation.
First, we create an integer array with size of 32, each int represent a bit of an integer.
Then for each bit position of an integer, we go through the input numbers, and accululate the value of the correpsonding bit.
As we know, the sum of each bit shold be dividable by 3 except one number.
So we mod 3 for each bit, then the left number is the result.

# Single number

## Problem:

Given 2*n + 1 numbers, every numbers occurs twice except one, find it.

Have you met this question in a real interview? Yes
Example
Given [1,2,2,1,3,4,3], return 4

Challenge
One-pass, constant extra space.

## Solution:

As we know, each two same integer after xor operation will lead to 0x00000000.
And if x is an integer, 0x00000000 ^ x = x.
So we can take the adantage of xor property to solve this problem.

# Candy

## Problem:

There are N children standing in a line. Each child is assigned a rating value.

You are giving candies to these children subjected to the following requirements:

Each child must have at least one candy.

Children with a higher rating get more candies than their neighbors.

What is the minimum candies you must give?

Have you met this question in a real interview? Yes

Example
Given ratings = [1, 2], return 3.

given ratings = [1, 1, 1], return 3.

Given ratings = [1, 2, 2], return 4. ([1,2,1]).


## Solution:

Initialize the candy array to 1s.
First go through from left to right;
Second go through form right to left with the following condition:

• If the current child’s rating is greater than the previous one, the current candies should be at least 1 more than the previous one

# Gas Station

## Problem:

There are N gas stations along a circular route, where the amount of gas at station i is gas[i].

You have a car with an unlimited gas tank and it costs cost[i] of gas to travel from station i to its next station (i+1). You begin the journey with an empty tank at one of the gas stations.

Return the starting gas station’s index if you can travel around the circuit once, otherwise return -1.

Have you met this question in a real interview? Yes
Example
Given 4 gas stations with gas[i]=[1,1,3,1], and the cost[i]=[2,2,1,1]. The starting gas station’s index is 2.

Note
The solution is guaranteed to be unique.

Challenge
O(n) time and O(1) extra space

## Solution:

Use a ‘total’ variable to record the total amount of gas left through the circus.
Use a ‘sum’ variable to record the current gas left in the tank.
Use a pointer to record the starting label.
Once the there is no gass left in the car before next gas station available, set the label next gas station as the new starting gas station.

# Set Matrix Zeroes

## Problem:

Given a m x n matrix, if an element is 0, set its entire row and column to 0. Do it in place.

Have you met this question in a real interview? Yes
Example
Given a matrix

[
[1,2],
[0,3]
],
return [ [0,2], [0,0] ]


Challenge
Did you use extra space? A straight forward solution using O(mn) space is probably a bad idea. A simple improvement uses O(m + n) space, but still not the best solution. Could you devise a constant space solution?

## Solution:

The first method used two boolean arrays to store the columns or rows should be set to zero.
To make it use constant space, the second method use the first row and column to store the information that rows and columns should be set to zero.

# Gray Code

## Problem:

The gray code is a binary numeral system where two successive values differ in only one bit.

Given a non-negative integer n representing the total number of bits in the code, find the sequence of gray code. A gray code sequence must begin with 0 and with cover all 2n integers.

Have you met this question in a real interview? Yes
Example
Given n = 2, return [0,1,3,2]. Its gray code sequence is:

00 - 0
01 - 1
11 - 3
10 - 2


Note
For a given n, a gray code sequence is not uniquely defined.

[0,2,3,1] is also a valid gray code sequence according to the above definition.

Challenge
O(2n) time.

## Solution:

Use recursion to generate a new gray code from the (n-1) as the picture shown in the Figure below.


/*************************************************************************
> File Name: GrayCode.java
> Author: Yao Zhang
> Mail: psyyz10@163.com
> Created Time: Fri 13 Nov 17:22:56 2015
************************************************************************/

public class GrayCode{
List<Integer> grayCode(int n){
ArrayList<Integer> result = new ArrayList<Integer>();
if (n == 0)
return result;
if (n == 1){
return result;
}
result = (ArrayList<Integer>) grayCode(n - 1);
ArrayList<Integer> temp = reverse(result);
for (int i = 0; i < temp.size(); i++){
temp.set(i,(1 << (n - 1)) + temp.get(i));
}

return result;
}

public ArrayList<Integer> reverse(ArrayList<Integer> result){
ArrayList<Integer> newResult = new ArrayList<Integer>();
for (int i = result.size() - 1; i >= 0; i--){
}
return newResult;
}
}


# Climbing Stairs

## Problem:

You are climbing a stair case. It takes n steps to reach to the top.

Each time you can either climb 1 or 2 steps. In how many distinct ways can you climb to the top?

## Solution:

The idea behind this problem is the same as Fibonacci which is f(n) = f(n-1) + f(n-2).
There are two solutions making the time complexity as O(N).

1. The first one is an iteration. Time complexity O(n), space complexity O(1).
2. The second one is recurstive version. But the normal recurtion is very expensive, roughly O(2^n).
So we used dynamic programming by take a buffer stores the value has gone.
The time complexity is O(n), space complexity is O(n) too.

There are two solutions.
/*

> File Name: ClimbingStairs.java
> Author: Yao Zhang
> Mail: psyyz10@163.com
> Created Time: Fri 13 Nov 16:12:24 2015


**/

public class ClimbingStairs{
// Solution 1
public int climbStairs(int n){
if (n <= 1)
return n;

   int prev = 1;
int last = 1;
int current = 0;
for (int i = 2; i <= n; i++){
current = prev + last;
last = prev;
prev = current;
}

return current;
}

//Solution2
public int climbStairs(int n){
if (n < 0)
return 0;

int[] buffer = new int[n];

return climbStairsHelper(n, buffer);
}
public int climbStairsHelper(int n, int[] buffer){
if (n < 0){
return 0;
}else if (n == 0){
return 1;
} else if (buffer[n - 1] != 0){
return buffer[n - 1];
} else {
buffer[n - 1] = climbStairsHelper(n - 1, buffer) + climbStairsHelper(n - 2, buffer);
return buffer[n - 1];
}
}


}

# Plus One

## Problem:

Given a non-negative number represented as an array of digits, plus one to the number.

The digits are stored such that the most significant digit is at the head of the list.

Example
Given [1,2,3] which represents 123, return [1,2,4].

Given [9,9,9] which represents 999, return [1,0,0,0].


## Solution:

It is a essay problem, just add carry though the list, until the carry equals 0.

# Rotate Image

## Problem:

You are given an n x n 2D matrix representing an image. Rotate the image by 90 degrees (clockwise).

Example
Given a matrix

[
[1,2],
[3,4]
]


rotate it by 90 degrees (clockwise), return

[
[3,1],
[4,2]
]


## Solution:

There are two solutions.
The idea of the first method is to divide the matrix into four smaller matrix, then move each sub-matrix to the rotated position
The idea of the second method is to rotate layer by layer, from the outmost layer to the inner most layer.

# Trapping Rain Water

## Problem:

Given n non-negative integers representing an elevation map where the width of each bar is 1, compute how much water it is able to trap after raining.

For example,
Given [0,1,0,2,1,0,1,3,2,1,2,1], return 6.


The above elevation map is represented by array [0,1,0,2,1,0,1,3,2,1,2,1]. In this case, 6 units of rain water (blue section) are being trapped. Thanks Marcos for contributing this image!

## My solution:

For each bar, find the highest bars of the left side and right side respectively.
The trapping water for the bar is min(leftMax,rightmax) - its height.

There are two solutions.

• Find the highest bar
• Process the left and right of the highest bar respectively
• Go through the bars from left to right, calculate the left highest bar for each bar
• Go through the bars form right to left, calculate the left highest bar for each bar
• Calculate the min(leftMax,rightmax) - its height for each bar.

# Valid Sudoku

## Problem:

Determine if a Sudoku is valid, according to: Sudoku Puzzles - The Rules.

The Sudoku board could be partially filled, where empty cells are filled with the character ‘.’.

A partially filled sudoku which is valid.

Note:
A valid Sudoku board (partially filled) is not necessarily solvable. Only the filled cells need to be validated.

## My Solution:

This question is straight, no tricky ideas needed.
But it reviews the capacity of detailed implementation.

Related Problems:

# Permutation Sequence

## Problem:

Given n and k, return the k-th permutation sequence.

Example
For n = 3, all permutations are listed as follows:

"123"
"132"
"213"
"231"
"312"
"321"
If k = 4, the fourth permutation is "231"


Note
n will be between 1 and 9 inclusive.

Challenge
O(n*k) in time complexity is easy, can you do it in O(n^2) or less?

## My Solution:

The most intuitive way is calling nextPermuation method for k times. It takes O(k * n);
Assume there are n non-repeated numbers,whose k-th permuation is a1,a2,a3, …, an.
For each increassing of a1, there are (n-1)! possibilities for a2 to an. So a1 = k / (n-1)!;
Similarly we can get the following as following:

k_2 = k % (n-1)!;
a_2 = k_2 / (n-2)!;
...
k_{n-1} = k_{n-2} / 2!;
a_{n-1} = k-{n-1} / 1!;
a_n = 1;


# Next Permutation

## Problem:

Implement next permutation, which rearranges numbers into the lexicographically next greater permutation of numbers.

If such arrangement is not possible, it must rearrange it as the lowest possible order (ie, sorted in ascending order).

The replacement must be in-place, do not allocate extra memory.

Here are some examples. Inputs are in the left-hand column and its corresponding outputs are in the right-hand column.
1,2,3 → 1,3,2
3,2,1 → 1,2,3
1,1,5 → 1,5,1


## Solution:

Image from http://fisherlei.blogspot.com/2012/12/leetcode-next-permutation.html

Related Problems:

# Remove Element

## Problem:

Given an array and a value, remove all instances of that value in place and return the new length.

The order of elements can be changed. It doesn’t matter what you leave beyond the new length.

Subscribe to see which companies asked this question

## Solution:

Use a ponter to point to the end index of the array. Go through the array, if the current element equals to the target value, swap the current value with the value at the end index, and then move the end pointer a unit towards the start direction. Then repeat.

# 4Sum

## Problem:

Given an array S of n integers, are there elements a, b, c, and d in S such that a + b + c + d = target? Find all unique quadruplets in the array which gives the sum of target.

Note:
Elements in a quadruplet (a,b,c,d) must be in non-descending order. (ie, a ≤ b ≤ c ≤ d)
The solution set must not contain duplicate quadruplets.
For example, given array S = {1 0 -1 0 -2 2}, and target = 0.

A solution set is:
(-1,  0, 0, 1)
(-2, -1, 1, 2)
(-2,  0, 0, 2)


## My solution:

The solution is nearly the same as 3sum

Related Problem:
Two Sum
3Sum
3Sum Closest

# 3Sum Closest

## Problem:

Given an array S of n integers, find three integers in S such that the sum is closest to a given number, target. Return the sum of the three integers. You may assume that each input would have exactly one solution.

For example, given array S = {-1 2 1 -4}, and target = 1.

The sum that is closest to the target is 2. (-1 + 2 + 1 = 2).


## Solution:

Sort the list first, which takes O(Nlog N). Then go through the sorted list， for each integer “nums[i]” in the list,
narrow the range from the integers at index i and the end of the list. It takes O(n^2)

Use a ‘min’ variable to record the minimum difference, and a ‘result’ variable to record the sum with the minimum difference.

# 3Sum

## Problem:

Given an array S of n integers, are there elements a, b, c in S such that a + b + c = 0? Find all unique triplets in the array which gives the sum of zero.

Note:
Elements in a triplet (a,b,c) must be in non-descending order. (ie, a ≤ b ≤ c)
The solution set must not contain duplicate triplets.
For example, given array S = {-1 0 1 2 -1 -4},

A solution set is:
(-1, 0, 1)
(-1, -1, 2)


## My Solution:

Sort the list first, which takes O(Nlog N). Then go through the sorted list， for each integer “nums[i]” in the list,
narrow the range from the integers at index i and the end of the list. It takes O(n^2)

Related Problem:
Two Sum
4Sum
3Sum Closest

# Two Sum

## Problem:

Given an array of integers, find two numbers such that they add up to a specific target number.

The function twoSum should return indices of the two numbers such that they add up to the target, where index1 must be less than index2. Please note that your returned answers (both index1 and index2) are not zero-based.

You may assume that each input would have exactly one solution.

Input: numbers={2, 7, 11, 15}, target=9
Output: index1=1, index2=2


## Solution

We can use a hash map to record the occurrence and index of integers.

When we go through the list, the key (target - current integer) exits in the hash map, then we can return current index and index corresponding to (target - current integer).

Related Problem:
3Sum
4Sum
3Sum Closest

# Median of Two Sorted Arrays

## Problem:

There are two sorted arrays nums1 and nums2 of size m and n respectively. Find the median of the two sorted arrays. The overall run time complexity should be O(log (m+n)).

Example
Given A=[1,2,3,4,5,6] and B=[2,3,4,5], the median is 3.5.

Given A=[1,2,3] and B=[4,5], the median is 3.


Challenge
The overall run time complexity should be O(log (m+n)).

## My Solution:

This is a classic problem, which can be viewed as another form: “Given two sorted array, how to find the k-th largest number among both arrays?”.

The k in this problem is “total length / 2” for odd number. If the total length is even, we need to find “total length / 2” and “total length / 2 + 1” number, and then get the average value.

The most intuitive solution is to merge them and find the k-th number, which takes O(M + N);

There is a better solution. Assume both amount of elements in A and B are greater than k / 2. If we compare the (k / 2)-th value in A (A[k / 2 - 1]) and the (k / 2)-th value in B (B[k / 2 - 1]), we have three possilbe cases:

• If A[k / 2 - 1] == B[k / 2 - 1], we can say the value is the k-th value we want.
• If A[k / 2 - 1] < B[k / 2 - 1], it means integer from A[0] to A[k / 2 - 1] should be smaller than the k-th value, so we can drop these values.
• If A[k / 2 - 1] > B[k / 2 - 1], it means integer from B[0] to B[k / 2 - 1] should be smaller than the k-th value, so we can drop these values.

To drop the values in A and B, we can use two pointers to point to the start index of A and B respectively.
We can build a recursion whose stop points are:

• If the start index pointer of A or B exceed the length, return B[BStart + k - 1] or A[AStart + k - 1];
• If k == 1, return the minimum of A[AStart] or B[BStart]
• If A[AStart + k / 2 -1] == B[BStart + k / 2 -1], return the value.

# Search in Rotated Sorted Array II

## Problem:

Follow up for “Search in Rotated Sorted Array”:
What if duplicates are allowed?

Would this affect the run-time complexity? How and why?

Write a function to determine if a given target is in the array.

## My Solution:

Compared to the “Search in Rotated Sorted Array”, the elements can be duplicate, so we can not know the start point is on the left part or right part, even if we have “nums[left] <= nums[mid]”.

We can only judge when “nums[left] < nums[mid]”. So if “nums[left] == nums[mid]”, we can add left lie “left++” until “nums[left]” and “nums[right]” are different. Code implemented in ‘search’ method.

The time complexity is O(n).

Solution 1:

Solution 1 is better to return an index. If we just need to return boolean, a brute force method is ok, like Solution 2.

Related Problem:
Search in Rotated Sorted Array

# Search in Rotated Sorted Array

## Problem:

Suppose a sorted array is rotated at some pivot unknown to you beforehand.

(i.e., 0 1 2 4 5 6 7 might become 4 5 6 7 0 1 2).

You are given a target value to search. If found in the array return its index, otherwise return -1.

You may assume no duplicate exists in the array.

Example
For [4, 5, 1, 2, 3] and target=1, return 2.

For [4, 5, 1, 2, 3] and target=0, return -1.

Challenge
O(logN) time

## My Solution:

Use binary search to find the match.
There are some notable tips:

• Check the start point first (at the left or right side of mid), then assume the target on the consecutive segment.
• While loop condition: left != right
• After the while loop, we should add “if (nums[left] == target) return left;”
• The if condition “nums[mid] >= nums[left]” should contain the equal sign

The time complexity is O(log n).

Related Problem:
Search in Rotated Sorted Array II

# Remove Duplicates from Sorted Array II

## Problem:

What if duplicates are allowed at most twice?

For example,
Given sorted array nums = [1,1,1,2,2,3],

Your function should return length = 5, with the first five elements of nums being 1, 1, 2, 2 and 3. It doesn’t matter what you leave beyond the new length.

## Solution:

Compared to Remove Duplicates from Sorted Array, We onlt need to add an another variable to record the number of times that the same elements occur, named as ‘same’;
This is a sorted map, so we just need to one variable to solve the problem, if not sorted, we need a Hashmap to record the occurrence of each elements.

Remove Duplicates from Sorted Array

# Remove Duplicates from Sorted Array

## Problem:

Given a sorted array, remove the duplicates in place such that each element appear only once and return the new length.

Do not allocate extra space for another array, you must do this in place with constant memory.

Example
Given input array A = [1,1,2],

Your function should return length = 2, and A is now [1,2].

Tags Expand

Related Problems Expand

## My Solution:

In this question, a variable named ‘last’ is used to record the index of last element in the result array.
Go through the array, if the current element nums[i] is not equal to the last element of result array ‘nums[last]’,
add the current element to end of result array.

Time complexity is O(n), space complexity is O(1)

Remove Duplicates from Sorted Array II

# Deep Learning Overview

Nowadays, the performances of machine learning models heavily rely on the representation of data or feature selection steps rather than just the choice of machine learning algorithms. Thus much effort is applied on preprocessing pipelines such as feature selection. Even though some specific domain knowledge can be used to help design the representation of data, the motivation of Artificial Intelligence needs more powerful representations of features. Deep Learning also called unsupervised learning, is a relatively new research field of machine learning which can learn multi- ple levels of abstraction and representation of features directly from data. It aims at learning feature hierarchies with higher level features formed by the composition of lower ones. The multiple levels structures allow to build complex functions which take data as input and output the result directly without depending on features crafted by humans [Bengio09].

Deep learning achieved many successful results on some problems, such as im- age classification [Ciresan12] [Krizhevsky12], semantic parsing [Bordes12] and speech recognition [Dahl12]. Deep architecture may express the complex distributions more efficiently with better per- formance on challenging tasks [Bengio07][Bengio09]. The hypothesis that the composition of addi- tional functional levels can give more powerful modeling capacity has already been proposed for a long time [Hinton90][Rumelhart86]. However, the training process of deep architec- ture was proven to be very difficult, until some successful approaches of [Bengio007][Hinton06][Hinton006] for training stacked autoencoder and DBN occurred. One key idea behind them is to train the deep architecture layer by layer by unsupervised learning, which is also called unsupervised feature learning. Each layer will generate a more abstract representation of the observed layer by doing a local optimization. Unsupervised feature learning can learn useful features automatically and directly from the data set by unsupervised learning without given specific features defined by human. The unsupervised learned features are more natural with less information lost [Bengio15]. Some deep learning models also have a potential powerful capacity of solving time-series problems [La14], which is another reason that makes deep learning suitable for stock trend prediction. Therefore deep learning can provide a new potential and powerful approach to improve stock prediction.

## Building Deep Representations

Experimental results show that it is much harder to train a deep architecture than training a shallow one [Bengio07][Erhan09]. The rapid recent growth and success of deep learning owe to a breakthrough initiated by Geoff Hinton in 2006 and quickly followed up by some papers [Hinton006][Bengio94][Poultney06]. Greedy layer wise unsupervised pre-training as a central idea was proposed. It means just one layer of the hierarchy is trained at one time by unsupervised feature learning to learn a new transformation of data with the previous layer output. Finally, the set of pre-trained layers is combined with a stan- dard supervised classifier such as Support Vector Machine, Logistic Regression or as initialization for a deep learning model such as a Stacked Denoising Auto-encoder or Deep Belief Network. Experiments show that the layer-wise stacking can attain a better feature representation in most time [Larochelle09]. Although it is not difficult to combine single layers pretrained by unsupervised learning into a supervised model, it is not very clear how the single layers should combine to form a better unsuper- vised model [Bengio15]. One approach is to stack pre-trained RBMs (section 4.2.1) into DBN (section 4.2.2).

### Stacked Denoising Autoencoders

[Bengio09] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009.

[Ciresan12] Dan Ciresan, Ueli Meier, and Ju ̈rgen Schmidhuber. Multi-column deep neural networks for image classification. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3642–3649. IEEE, 2012.

[Krizhevsky12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[Bordes12] Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. Joint learn- ing of words and meaning representations for open-text semantic parsing. In International Conference on Artificial Intelligence and Statistics, pages 127– 135, 2012.

[Dahl12] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre- trained deep neural networks for large-vocabulary speech recognition. Audio, Speech, and Language Processing, IEEE Transactions on, 20(1):30–42, 2012.

[Bengio07] Yoshua Bengio, Yann LeCun, et al. Scaling learning algorithms towards ai. Large-scale kernel machines, 34(5), 2007.

[Hinton90] Geoffrey E Hinton. Connectionist learning procedures. artificial intelligence, 40 1-3: 185 234, 1989. reprinted in j. carbonell, editor,”. Machine Learning: Paradigms and Methods”, MIT Press, 1990.

[Rumelhart86] David E Rumelhart and James L McClelland. The pdp research group: Par- allel distributed processing: Explorations in the microstructure of cognition. Foundations, 1, 1986.

[Bengio007] Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19:153, 2007.

[Hinton06] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.

[Hinton006] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algo- rithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.

[Erhan09] Dumitru Erhan, Pierre-Antoine Manzagol, Yoshua Bengio, Samy Bengio, and Pascal Vincent. The difficulty of training deep architectures and the effect of unsupervised pre-training. In International Conference on artificial intelligence and statistics, pages 153–160, 2009.

[Bengio15] Yoshua Bengio, Ian J. Goodfellow, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2015.

[La14] Martin La ̈ngkvist, Lars Karlsson, and Amy Loutfi. A review of unsupervised feature learning and deep learning for time-series modeling. Pattern Recogni- tion Letters, 42:11–24, 2014.

[Bengio94] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term depen- dencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.

[Poultney06] Christopher Poultney, Sumit Chopra, Yann L Cun, et al. Efficient learning of sparse representations with an energy-based model. In Advances in neural information processing systems, pages 1137–1144, 2006.

[Larochelle09] Hugo Larochelle, Yoshua Bengio, J ́eroˆme Louradour, and Pascal Lamblin. Ex- ploring strategies for training deep neural networks. The Journal of Machine Learning Research, 10:1–40, 2009.

[Bengio15] Yoshua Bengio, Ian J. Goodfellow, and Aaron Courville. Deep learning. Book in preparation for MIT Press, 2015.

# Stacked Denoising Autoencoders

## Autoencoders

An autoencoder [Bengio09] is a network whose graphical structure is shown in Figure 4.1, which has the same dimension for both input and output. It takes an unlabeled training examples in set where is a single input and encodes it to the hidden layer by linear combination with weight matrix and then through a non-linear activation function. It can be mathematically expressed as , where is the bias vector.

Figure 4.1: An Autoencoder

After that the hidden layer representation will be reconstructed to the output layer 􏰌 through a decoding function, in which 􏰌 has a same shape as . Hence the decoding function can be mathematically expressed as 􏰌, where can be called tried weights. In this project, tied weights were used. The aim of the model is to optimize the weight matrices, so that the reconstruction error between input and output can be minimized. It can be seen that the Autoencoder can be viewed as an unsupervised learning process of encoding-decoding: the encoder encodes the input through multi-layer encoder and then the decoder will decode it back with the lowest error [Hinton06].
To measure the reconstruction error, traditional squared error can be used. One of the most widely used way to measure that is the cross entropy if the input can be represented as bit vector or bit possibilities. The cross entropy error is shown in Equation 4.1:

Equation:4.1

The hidden layer code can capture the information of input examples along the main dimensions of variant coordinates via minimizing the reconstruction error. It is similar to the principle component analysis (PCA) which project data on the main component that captures the main information of the data. h can be viewed as a compression of input data with some lost, which hopefully not contain much information about the data. It is optimized to compress well the training data and have a small reconstruction error for the test data, but not for the data randomly chosen from input space.

## Denoising Autoencoders

In order to prevent the Autoencoder from just learning the identity of the input and make the learnt representation more robust, it is better to reconstruct a corrupted version of the input. The Autoencoder with a corrupted version of input is called a Denoising Autoencoder. Its structure is shown in Figure 4.2. This method was proposed in [Vincent08], and it showed an advantage of corrupting the input by comparative experiments. Hence we will use denoising autoencoders instead of classic autoencoders in this thesis.

Figure 4.2: A graphical figure of Denoising Autoencoder. An input x is corrupted to . After that the autoencoder maps it to the hidden representation and attempts to reconstruct .

A Denoising Autoencoder can be seen as a stochastic version with adding a stochastic corruption process to Autoencoder. For the raw inputs , some of them will be randomly set to 0 as corrupted inputs . Next the corrupted input will be en- coded to the hidden code and then reconstructed to the ouput. But now 􏰌 is a deterministic function of rather than . As Autoencoder, the reconstruction is considered and calculated between 􏰌 and noted as . The parameters of the model are initialized randomly and then optimized by stochastic gradient descent algorithms. The difference is that the input of the encoding process is a corrupted version , hence it forces a much more clever mapping than just the identity, which can denoise and extract useful features in a noise condition.

The training algorithm of a denoising autoencoder is summarized in Algorithm 2.

## Stacked Autoencoder

Unsupervised pre-training
A Stacked Autoencoder is a multi-layer neural network which consists of Autoencoders in each layer. Each layer’s input is from previous layer’s output. The greedy layer wise pre-training is an unsupervised approach that trains only one layer each time. Every layer is trained as a denoising autoencoder via minimising the cross entropy in reconstruction. Once the first layer has been trained, it can train the layer by using the previous layer’s hidden representation as input. An example is shown below. Figure 4.3 shows the first step of a stacked autoencoder. It trains an autoencoder on raw input to learn by minimizing the reconstruction error .

Figure 4.3: Step 1 in Stacked Autoencoders

Next step shown in Figure 4.4. The hidden representation would be as ”raw input” to train another autoencoder by minimizing the reconstruction error . Note that the error is calculated between previous latent feature representation and the output . Parameters and will be optimized by the gradient descent algorithm. The new hidden representation h2 will be the ’raw input’ of the next layer.

Figure 4.4: Step 2 in Stacked Autoencoders

The pre-training algorithm of stacked denoising autoencoder is summarized in algorithm 3.

## Supervised fine-tuning

At last once all the layers has been pre-trained, the next step called fine-tuning is performed. A supervised predictor will be extended to the last layer, such as support vector machine or a logistic regression layer. In this project, we chose a logistic regression layer. After that the network will be trained. A sample graph is shown in Figure 4.5. It can be seen that for each layer of the network, only the encoding hidden representation are considered. The fine-tuning step will train the whole network by back-propagation like training an Artificial Neural Network. A stacked denoising autoencoder is just replace each layer’s autoencoder with denoising autoencoder whilst keeping other things the same.

Figure 4.5: A complete architecture of stacked autoencoder

The supervised fine-tuning algorithm of stacked denoising auto-encoder is summa- rized in Algorithm 4.

[Bengio09] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127, 2009. Also published as a book. Now Publishers, 2009.

[Hinton06] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006

[Vincent08] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Man- zagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pages 1096–1103. ACM, 2008.

￼￼￼￼