Scipy, a very well-known Python library, have some fundamental but powerful tools for optimization. Based on my observation, when the number of independent variables are few, these methods work fine. But in applications with tenth or hundredth parameters, it is not possible to use this library because it runs very slow and also function definition would be too hard.
Here, I will not introduce all available functions, but just some of them which seems more famous. Also, these are my idea and other researchers may get better results from this library with tunnings.
The experiments have been done in a single JupyterLab notebook and run on a 12 CPU core server node.
The important question that this experiment wants to answer it is:
” How to make a good regression model for a function with complex shape? “
As it was discussed in this post, it is possible to view this problem from several aspect. Here we will use several Scipy tools to get more familiar with them. The JupyterLab notebook is accessible in GitHub.
First, a non-linear function is made for producing the training data. Then, step by step the non-linearity of function will be increased. For example function x*sinc(2*pi*f*x) is a very non-linear function, specially when the frequency of sinc increases.
Then, three non-linear models where proposed. The first one is an activation function after a linear transformation. The second one is a perceptron with three nodes; the input has two node and output is a single node. Therefore, there are 7 parameters which four of them are weights and three of them are biases.
The last model is a polynomial with optional degree.
All the above functions need some activation functions. If the activation function are in the same family of original model, the estimation is often very good. But if it would be different, the estimator needs more parameters or function degrees. Some activation functions work good in general; such as polynomials (between -1 and 1), tanh, sigmoid and etc.
Each model has some parameters which should be estimated; the first model has two parameters, the second model has 7 parameters and the last model has (degree+1) parameters. The Scipy functions are used to estimate these parameters for the best fit to the non-linear function. These optimization functions in Scipy have been used:
- curve_fit
- minimize
- brute
- basinhopping
- differential_evolution
- dual_annealing
For each function, two parameters were calculated; running time and mean square error. Also, the estimated plot is draw for each function.
1. curve_fit:
sample: curve_fit(opt_func, t, ft, p0=np.random.randn(40), maxfev=10000)
This function needs some arguments; the function which is needed to be fitted to data, the input and output values, initial condition and maximum number of function evaluation. The number of variables which opt_func needed to optimize is determined in initial condition. For example, in the sample, it can be seen that 40 parameters are needed to be estimated. Hence, the search space has 40 dimensions.
2. minimize
sample: minimize(mse_fit_func, x0=x_init, method='Nelder-Mead', tol=None, options={'disp': False, 'xatol': 1e-6, 'fatol': 1e-6,'maxiter': 3000, 'maxfev': 3000})
As what the name shows, this function can minimize another function. Therefore, for curve fitting application, it is enough to define the cost metric for minimization. When the cost metric minimized, the result would be a fitted function. Here, the metric is mean square error. Therefore, the mse_fit_func will be defined separately. Initial condition is chosen randomly and the dimension of initial condition is equal to the number of parameters which needed to be estimated. There are many methods and some of them are design for specific applications, but Nelder-Mead is one of the well-known ones. Other parameters are used for determining the stopping criteria.
We found that minimization is very sensitive to initial condition. Hence, we made a loop that can produce random initial condition and with good luck, the solution or the best local estimation will be found. The interesting fact is this method works good in many situations and it runs usually fast.
3. brute
sample: brute(mse_fit_func, rranges, full_output=True, finish=opt.fmin, workers=4)
Brute search is good as long as the dimension of the problem remain low, for example less than 4 or 5. When the dimension increases, the running time exponentially increases and this method wouldn’t be feasible in many applications. This function minimize the function in the variation ranges of each variable. The ranges of variables will be determined in a list of list of values and fed to function as the second argument.
4. basinhopping
sample: basinhopping(mse_fit_func, x0=x_init, niter=500, niter_success=None, interval=30, T=0.1, stepsize=1, disp=False)
This method used for minimization, but it wouldn’t just search locally. It has an argument that depicts how frequent the minimizer should jump from one basin to another. By jumping from one area to another, there are hopes that minimizer can find a good local point. In practice it needs several parameters which tuning all of them is completely application dependent and sometime exhausting. The good point is that this method is fast.
5. differential_evolution
sample: differential_evolution(mse_fit_func, bounds)
This method works like simple evolutionary algorithm. It needs fewer arguments than other methods; just the function which needs to be minimized and boundaries of each variable. The function is fast, and its optimization accuracy is acceptable. Also, it has some arguments to determine the stopping criteria which may be helpful to increase the accuracy.
6. dual_annealing
sample: dual_annealing(mse_fit_func, bounds, maxiter=1000)
This function works like simulated annealing algorithm. It is fast and accuracy is acceptable.
Some Hints: (these are not scientific facts, but just personal observations.)
- Don’t use brute search, it is extremely slow with not acceptable accuracy.
- minimize function with random initialization works very fast and good in many situations.
- curve_fit works fast and good when the activation function can follow the data well enough.
- basin_hopping results are not better than random minimize function while it is slower. It needs too many tuning parameters which makes it unfavorable.
- Both differential_evolution and dual_annealing belong to random optimizer family which are fast and usually find good results, specially when the activation function is well enough to track the data.
Be First to Comment