Can not finish fitting model

daria · November 2020

Greetings!

I have send a request to my endpoint to fit the dataset. It starts to fit but can not finish.

I was waiting for more than an hour but it is still in process. Is it ok, that fitting takes so much time?

Seems like there could be a mistake somewhere, but I can not understand where to find.

daria · November 2020

Made a getinfo.txt file. Maybe it could help.

getinfo.txt

daria · November 2020

Have found some info in the documentation. According to this chapter I was managing to run fit locally. After calling:

$ om shell

I was wondering if I can do something with fit. So I called:

In [1]: from sklearn.linear_model import LinearRegression

In [2]:  import pandas as pd
   ...: 

In [3]: # train a linear regression model
   ...: df = pd.DataFrame(dict(x=range(10), y=range(20,30)))
   ...: clf = LinearRegression()
   ...: clf.fit(df[['x']], df[['y']])
Out[3]: LinearRegression()

In [4]: # store the trained model
   ...: om.models.put(clf, 'lrmodel')
Out[4]: <Metadata: Metadata(name=lrmodel,bucket=omegaml,prefix=models/,kind=sklearn.joblib,created=2020-11-02 17:00:03.568000)>

Then I found this chapter and done the following:

In [1]: import pandas as pd

In [2]: # create a dataframe and store it
   ...: df = pd.DataFrame(dict(x=range(10), y=range(20,30)))
   ...: om.datasets.put(df, 'test_lr_model')
Out[2]: <Metadata: Metadata(name=test_lr_model,bucket=omegaml,prefix=data/,kind=pandas.dfrows,created=2020-11-02 17:39:55.660962)>

In [3]: # use it to fit the model
   ...: result = om.runtime.model('lrmodel').fit('test_lr_model[x]', 'test_lr_model[y]')

I was waiting for about 15 min, but got no response.

I interrupted the process and got some Tracebacks:

AttributeError: 'ChannelPromise' object has no attribute '__value__'
...

gaierror: [Errno -2] Name or service not known
...

OSError: failed to resolve broker hostname
...

daria · November 2020

Greetings!

I fixed the issue.

It was related to the insufficient configs in the docker-compose file. I changed them, rebuild everything and it works.

support · November 2020

I was waiting for more than an hour but it is still in process. Is it ok, that fitting takes so much time?

omega|ml does not add substantial time to the fitting process of a model - essentially the fit process takes the same amount of time on the runtime as it does locally (it runs the same code).

In general the example you give should work out of the box. One potential problem could be a mismatch in the pandas or the scikit-learn versions of the client where you are submitting vs. the versions in the runtime worker. However this would typically trigger an Exception, not an endless wait.

got some Tracebacks:

The tracebacks are from the networking stack. This looks like runtime has not responded, indicating a problem either with the broker (rabbitmq) or in the runtime worker itself.

From the getinfo logs I can see that you are running the open source stack in local docker-compose. This means you can directly peek into the worker logs. If the rabbitmq broker works ok, this will show the reception of the omega_fit task message and any errors that result from it. If the broker does not work ok, it will show errors related to that.

$ docker-compose logs worker

To debug, a few more things to try:

to run the runtime process locally:

# fit the model using the runtime, however run it locally
# -- will show whether there is a problem with the model or the dataset
#    the model and the dataset are retrieved from the database, however the execution is local, 
#    that is using the runtime code as the worker does, but running it in your local python process       
$ om shell
[]  om.runtime.model(local=True).model('lrmodel').fit('test_lr_model[x]', 'test_lr_model[y]')

check that the worker responds to ping ok

# this should return a response within < 1 second
$ om runtime ping

monitor events flowing to the worker, check that the worker runs ok

# show events flowing into celery
$ celery -A omegaml.celeryapp events

# show active workers
$ celery -A omegaml.celeryapp inspect active_queues

# show active tasks
$ celery -A omegaml.celeryapp inspect active

daria · November 2020

Thank you for the response!

support · November 2020

Glad you were able to fix it!

It was related to the insufficient configs in the docker-compose file. I changed them, rebuild everything and it works.

Could you share an example of the config here or perhaps file an issue at https://github.com/omegaml/omegaml/issues, please? We are most interested to improve resilience to such problems and provide better guidance to users. Thank you!

Can not finish fitting model

Comments