For a training set X of $$m$$ samples, each sample have $$n$$ features. $$y$$ is the responder. Both X and y can be written in matrix form as:

$X = \begin{bmatrix} x^{1}_{1}&x^{1}_{2}&x^{1}_{3}&\cdots&x^{1}_{n}\\ x^{2}_{1}&x^{2}_{2}&x^{2}_{3}&\cdots&x^{2}_{n}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ x^{m}_{1}&x^{m}_{2}&x^{m}_{3}&\cdots&x^{m}_{n}\\ \end{bmatrix} = \begin{bmatrix} x^{(1)^T}\\ x^{(2)^T}\\ x^{(3)^T}\\ \vdots\\ x^{(m)^T} \end{bmatrix} \ where\ x^i_j \in \mathbf{R} \\ y = \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix} \ where\ y_i \in \left\{ {0, 1}\right\}$

We have log likelihood of logistic regression:

$l(\theta) = \sum\limits_{i=1}^{m}[y^{(i)}ln\ h(x^{(i)}) + (1 - y^{(i)})ln(1 - h(x^{(i)})]$

Where $$h(x^{(i)})$$ is sigmoid function of $$x^{(i)}$$

$h(x^{(i)}) = \frac{1}{1 + e^{-\theta^Tx^{(i)}}}$

In order to maximize $$l(\theta)$$, I use Newton method to find $$\theta$$. We have:

$\theta := \theta - H^{-1} \nabla_{\theta} l(\theta)$

Where H is Hessian matrix:

$H_{i,j} = \frac{\partial^2l(\theta)}{\partial \theta_i \partial \theta_j}$

and $$\nabla_{\theta}l(\theta)$$ is the vector of partial derivatives of $$l(\theta)$$ with respect to the $$\theta$$

$\nabla_{\theta}l(\theta) = \begin{bmatrix} \frac{\partial l(\theta)}{\partial \theta_1}\\ \frac{\partial l(\theta)}{\partial \theta_2}\\ \cdots\\ \frac{\partial l(\theta)}{\partial \theta_m}\\ \end{bmatrix}$

The matrix form of $$H_{i,j}$$ is:

$H = X'SX$

here S is the matrix of:

$S = \begin{bmatrix} h(x^{(1)})(1 - h(x^{(1)})&0&\cdots&0 \\ 0&h(x^{(2)})(1 - h(x^{(2)})&\cdots&0 \\ \vdots&\vdots&\ddots&\vdots \\ 0&0&\cdots&h(x^{(m)})(1 - h(x^{(m)})) \end{bmatrix}$

Having all the thing I need, let implement it! For simplicity, I have used Matlab to calculate the above equation

P/S: using MathJax for Latex is so convenience!