Can not finish fitting model

Greetings!

I have send a request to my endpoint to fit the dataset. It starts to fit but can not finish.

I was waiting for more than an hour but it is still in process. Is it ok, that fitting takes so much time?

Seems like there could be a mistake somewhere, but I can not understand where to find.

Comments

  • Made a getinfo.txt file. Maybe it could help.


  • Have found some info in the documentation. According to this chapter I was managing to run fit locally. After calling:

    $ om shell

    I was wondering if I can do something with fit. So I called:

    In [1]: from sklearn.linear_model import LinearRegression
    
    In [2]:  import pandas as pd
       ...: 
    
    In [3]: # train a linear regression model
       ...: df = pd.DataFrame(dict(x=range(10), y=range(20,30)))
       ...: clf = LinearRegression()
       ...: clf.fit(df[['x']], df[['y']])
    Out[3]: LinearRegression()
    
    In [4]: # store the trained model
       ...: om.models.put(clf, 'lrmodel')
    Out[4]: <Metadata: Metadata(name=lrmodel,bucket=omegaml,prefix=models/,kind=sklearn.joblib,created=2020-11-02 17:00:03.568000)>
    

    Then I found this chapter and done the following:

    In [1]: import pandas as pd
    
    In [2]: # create a dataframe and store it
       ...: df = pd.DataFrame(dict(x=range(10), y=range(20,30)))
       ...: om.datasets.put(df, 'test_lr_model')
    Out[2]: <Metadata: Metadata(name=test_lr_model,bucket=omegaml,prefix=data/,kind=pandas.dfrows,created=2020-11-02 17:39:55.660962)>
    
    In [3]: # use it to fit the model
       ...: result = om.runtime.model('lrmodel').fit('test_lr_model[x]', 'test_lr_model[y]')
    
    

    I was waiting for about 15 min, but got no response.

    I interrupted the process and got some Tracebacks:

    AttributeError: 'ChannelPromise' object has no attribute '__value__'
    ...
    
    gaierror: [Errno -2] Name or service not known
    ...
    
    OSError: failed to resolve broker hostname
    ...
    


  • edited November 2020

    Greetings!

    I fixed the issue.

    It was related to the insufficient configs in the docker-compose file. I changed them, rebuild everything and it works.

  • edited November 2020

    I was waiting for more than an hour but it is still in process. Is it ok, that fitting takes so much time?

    omega|ml does not add substantial time to the fitting process of a model - essentially the fit process takes the same amount of time on the runtime as it does locally (it runs the same code).

    In general the example you give should work out of the box. One potential problem could be a mismatch in the pandas or the scikit-learn versions of the client where you are submitting vs. the versions in the runtime worker. However this would typically trigger an Exception, not an endless wait.

    got some Tracebacks:

    The tracebacks are from the networking stack. This looks like runtime has not responded, indicating a problem either with the broker (rabbitmq) or in the runtime worker itself.

    From the getinfo logs I can see that you are running the open source stack in local docker-compose. This means you can directly peek into the worker logs. If the rabbitmq broker works ok, this will show the reception of the omega_fit task message and any errors that result from it. If the broker does not work ok, it will show errors related to that.

    $ docker-compose logs worker
    

    To debug, a few more things to try:

    • to run the runtime process locally:
    # fit the model using the runtime, however run it locally
    # -- will show whether there is a problem with the model or the dataset
    #    the model and the dataset are retrieved from the database, however the execution is local, 
    #    that is using the runtime code as the worker does, but running it in your local python process       
    $ om shell
    []  om.runtime.model(local=True).model('lrmodel').fit('test_lr_model[x]', 'test_lr_model[y]')
    
    • check that the worker responds to ping ok
    # this should return a response within < 1 second
    $ om runtime ping
    
    • monitor events flowing to the worker, check that the worker runs ok
    # show events flowing into celery
    $ celery -A omegaml.celeryapp events
    
    # show active workers
    $ celery -A omegaml.celeryapp inspect active_queues
    
    # show active tasks
    $ celery -A omegaml.celeryapp inspect active
    


  • edited November 2020

    Thank you for the response!

  • edited November 2020

    Glad you were able to fix it!

    It was related to the insufficient configs in the docker-compose file. I changed them, rebuild everything and it works.

    Could you share an example of the config here or perhaps file an issue at https://github.com/omegaml/omegaml/issues, please? We are most interested to improve resilience to such problems and provide better guidance to users. Thank you!

Sign In or Register to comment.