Generalised Linear Models: Binary data











up vote
0
down vote

favorite












I am currently working on GLM problem.



My response variable is binary as are some of my explanatory variable,others are categorical i.e. 1-1day, 2- 2-3days, 3-5+days and so forth.
I have coded it into factors.



My question is: I have used the step function and I am left with a model with many insignificant variables, in this case; do I simply drop these variables, if not what do I do ?
Also I tried to do the model selection, manually, using the anova function to test if the differences in the deviance were significant enough, and this gives me an answer that is somewhat different to the automatic model selection. Is this to be expected?



How do i go about my model selection, and how can I test if the functional form of my variables is correct ?



Thanks any help! :)










share|cite|improve this question






















  • You might want to do a search for related terms on stats.stackexchange.com
    – shadowtalker
    2 days ago















up vote
0
down vote

favorite












I am currently working on GLM problem.



My response variable is binary as are some of my explanatory variable,others are categorical i.e. 1-1day, 2- 2-3days, 3-5+days and so forth.
I have coded it into factors.



My question is: I have used the step function and I am left with a model with many insignificant variables, in this case; do I simply drop these variables, if not what do I do ?
Also I tried to do the model selection, manually, using the anova function to test if the differences in the deviance were significant enough, and this gives me an answer that is somewhat different to the automatic model selection. Is this to be expected?



How do i go about my model selection, and how can I test if the functional form of my variables is correct ?



Thanks any help! :)










share|cite|improve this question






















  • You might want to do a search for related terms on stats.stackexchange.com
    – shadowtalker
    2 days ago













up vote
0
down vote

favorite









up vote
0
down vote

favorite











I am currently working on GLM problem.



My response variable is binary as are some of my explanatory variable,others are categorical i.e. 1-1day, 2- 2-3days, 3-5+days and so forth.
I have coded it into factors.



My question is: I have used the step function and I am left with a model with many insignificant variables, in this case; do I simply drop these variables, if not what do I do ?
Also I tried to do the model selection, manually, using the anova function to test if the differences in the deviance were significant enough, and this gives me an answer that is somewhat different to the automatic model selection. Is this to be expected?



How do i go about my model selection, and how can I test if the functional form of my variables is correct ?



Thanks any help! :)










share|cite|improve this question













I am currently working on GLM problem.



My response variable is binary as are some of my explanatory variable,others are categorical i.e. 1-1day, 2- 2-3days, 3-5+days and so forth.
I have coded it into factors.



My question is: I have used the step function and I am left with a model with many insignificant variables, in this case; do I simply drop these variables, if not what do I do ?
Also I tried to do the model selection, manually, using the anova function to test if the differences in the deviance were significant enough, and this gives me an answer that is somewhat different to the automatic model selection. Is this to be expected?



How do i go about my model selection, and how can I test if the functional form of my variables is correct ?



Thanks any help! :)







statistics statistical-inference binary logistic-regression






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked 2 days ago









odesinit

255




255












  • You might want to do a search for related terms on stats.stackexchange.com
    – shadowtalker
    2 days ago


















  • You might want to do a search for related terms on stats.stackexchange.com
    – shadowtalker
    2 days ago
















You might want to do a search for related terms on stats.stackexchange.com
– shadowtalker
2 days ago




You might want to do a search for related terms on stats.stackexchange.com
– shadowtalker
2 days ago










1 Answer
1






active

oldest

votes

















up vote
0
down vote













Model selection is an art included numerous statistical skill and analyzing technique. Generally, if you get the correct model form or do the right way of variables selecting, the coefficient in result will be meaningful and the model will predict more correctly the target variables. And you can check it by splitting to the training set, validation set and testing set.



With the GLMs have a general form as $y_i=beta_0+sum_{i=1}^nbeta_ix_i+epsilon$, we focus mostly on how to choose the right distribution of random component $Y$ and how to modify the predictors in the best way.





You can imagine that predicting will be more strict if you have the right distribution for the target variable. E.g, you can check the distribution by using the Tweedie model with the functional parameter can specify types distribution such as discrete (Poisson), continuous (Normal, Gamma) and mixed type (Compound Poisson). You can approach specifically shrinkage methods following each type of distributions.



For the predictors $X$, instead of removing the insignificant feature, you should try to make it better by detecting an anomaly or dropping the outlier. In a common way, plotting covariance matrix to see how relevant btw the features, you can analyze and adjust the threshold of boxplot for the continuous features, and categorical features can be split into the dummy matrix.



After that, you can fit the model and analyze the result. Trying to do several statistical tests to see how well features fit with target variables such as R-square, adjusted-R-square, p-value, do ANOVA testing, do some likelihood test AIC... Using the validation set (or cross-validation set) to improve the model.





Implement the result and testing method, then repeat the model selection steps until you get your expected result.



My resources: Non-Life Insurance Pricing with Generalized Linear Models-Authors: Ohlsson, Esbjörn, Johansson, Björn, and others paper for specific topic.



Hope it is helpful.






share|cite|improve this answer










New contributor




AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "69"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2999132%2fgeneralised-linear-models-binary-data%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    Model selection is an art included numerous statistical skill and analyzing technique. Generally, if you get the correct model form or do the right way of variables selecting, the coefficient in result will be meaningful and the model will predict more correctly the target variables. And you can check it by splitting to the training set, validation set and testing set.



    With the GLMs have a general form as $y_i=beta_0+sum_{i=1}^nbeta_ix_i+epsilon$, we focus mostly on how to choose the right distribution of random component $Y$ and how to modify the predictors in the best way.





    You can imagine that predicting will be more strict if you have the right distribution for the target variable. E.g, you can check the distribution by using the Tweedie model with the functional parameter can specify types distribution such as discrete (Poisson), continuous (Normal, Gamma) and mixed type (Compound Poisson). You can approach specifically shrinkage methods following each type of distributions.



    For the predictors $X$, instead of removing the insignificant feature, you should try to make it better by detecting an anomaly or dropping the outlier. In a common way, plotting covariance matrix to see how relevant btw the features, you can analyze and adjust the threshold of boxplot for the continuous features, and categorical features can be split into the dummy matrix.



    After that, you can fit the model and analyze the result. Trying to do several statistical tests to see how well features fit with target variables such as R-square, adjusted-R-square, p-value, do ANOVA testing, do some likelihood test AIC... Using the validation set (or cross-validation set) to improve the model.





    Implement the result and testing method, then repeat the model selection steps until you get your expected result.



    My resources: Non-Life Insurance Pricing with Generalized Linear Models-Authors: Ohlsson, Esbjörn, Johansson, Björn, and others paper for specific topic.



    Hope it is helpful.






    share|cite|improve this answer










    New contributor




    AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






















      up vote
      0
      down vote













      Model selection is an art included numerous statistical skill and analyzing technique. Generally, if you get the correct model form or do the right way of variables selecting, the coefficient in result will be meaningful and the model will predict more correctly the target variables. And you can check it by splitting to the training set, validation set and testing set.



      With the GLMs have a general form as $y_i=beta_0+sum_{i=1}^nbeta_ix_i+epsilon$, we focus mostly on how to choose the right distribution of random component $Y$ and how to modify the predictors in the best way.





      You can imagine that predicting will be more strict if you have the right distribution for the target variable. E.g, you can check the distribution by using the Tweedie model with the functional parameter can specify types distribution such as discrete (Poisson), continuous (Normal, Gamma) and mixed type (Compound Poisson). You can approach specifically shrinkage methods following each type of distributions.



      For the predictors $X$, instead of removing the insignificant feature, you should try to make it better by detecting an anomaly or dropping the outlier. In a common way, plotting covariance matrix to see how relevant btw the features, you can analyze and adjust the threshold of boxplot for the continuous features, and categorical features can be split into the dummy matrix.



      After that, you can fit the model and analyze the result. Trying to do several statistical tests to see how well features fit with target variables such as R-square, adjusted-R-square, p-value, do ANOVA testing, do some likelihood test AIC... Using the validation set (or cross-validation set) to improve the model.





      Implement the result and testing method, then repeat the model selection steps until you get your expected result.



      My resources: Non-Life Insurance Pricing with Generalized Linear Models-Authors: Ohlsson, Esbjörn, Johansson, Björn, and others paper for specific topic.



      Hope it is helpful.






      share|cite|improve this answer










      New contributor




      AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




















        up vote
        0
        down vote










        up vote
        0
        down vote









        Model selection is an art included numerous statistical skill and analyzing technique. Generally, if you get the correct model form or do the right way of variables selecting, the coefficient in result will be meaningful and the model will predict more correctly the target variables. And you can check it by splitting to the training set, validation set and testing set.



        With the GLMs have a general form as $y_i=beta_0+sum_{i=1}^nbeta_ix_i+epsilon$, we focus mostly on how to choose the right distribution of random component $Y$ and how to modify the predictors in the best way.





        You can imagine that predicting will be more strict if you have the right distribution for the target variable. E.g, you can check the distribution by using the Tweedie model with the functional parameter can specify types distribution such as discrete (Poisson), continuous (Normal, Gamma) and mixed type (Compound Poisson). You can approach specifically shrinkage methods following each type of distributions.



        For the predictors $X$, instead of removing the insignificant feature, you should try to make it better by detecting an anomaly or dropping the outlier. In a common way, plotting covariance matrix to see how relevant btw the features, you can analyze and adjust the threshold of boxplot for the continuous features, and categorical features can be split into the dummy matrix.



        After that, you can fit the model and analyze the result. Trying to do several statistical tests to see how well features fit with target variables such as R-square, adjusted-R-square, p-value, do ANOVA testing, do some likelihood test AIC... Using the validation set (or cross-validation set) to improve the model.





        Implement the result and testing method, then repeat the model selection steps until you get your expected result.



        My resources: Non-Life Insurance Pricing with Generalized Linear Models-Authors: Ohlsson, Esbjörn, Johansson, Björn, and others paper for specific topic.



        Hope it is helpful.






        share|cite|improve this answer










        New contributor




        AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        Model selection is an art included numerous statistical skill and analyzing technique. Generally, if you get the correct model form or do the right way of variables selecting, the coefficient in result will be meaningful and the model will predict more correctly the target variables. And you can check it by splitting to the training set, validation set and testing set.



        With the GLMs have a general form as $y_i=beta_0+sum_{i=1}^nbeta_ix_i+epsilon$, we focus mostly on how to choose the right distribution of random component $Y$ and how to modify the predictors in the best way.





        You can imagine that predicting will be more strict if you have the right distribution for the target variable. E.g, you can check the distribution by using the Tweedie model with the functional parameter can specify types distribution such as discrete (Poisson), continuous (Normal, Gamma) and mixed type (Compound Poisson). You can approach specifically shrinkage methods following each type of distributions.



        For the predictors $X$, instead of removing the insignificant feature, you should try to make it better by detecting an anomaly or dropping the outlier. In a common way, plotting covariance matrix to see how relevant btw the features, you can analyze and adjust the threshold of boxplot for the continuous features, and categorical features can be split into the dummy matrix.



        After that, you can fit the model and analyze the result. Trying to do several statistical tests to see how well features fit with target variables such as R-square, adjusted-R-square, p-value, do ANOVA testing, do some likelihood test AIC... Using the validation set (or cross-validation set) to improve the model.





        Implement the result and testing method, then repeat the model selection steps until you get your expected result.



        My resources: Non-Life Insurance Pricing with Generalized Linear Models-Authors: Ohlsson, Esbjörn, Johansson, Björn, and others paper for specific topic.



        Hope it is helpful.







        share|cite|improve this answer










        New contributor




        AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        share|cite|improve this answer



        share|cite|improve this answer








        edited 2 days ago





















        New contributor




        AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.









        answered 2 days ago









        AnNg

        374




        374




        New contributor




        AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        New contributor





        AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






        AnNg is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fmath.stackexchange.com%2fquestions%2f2999132%2fgeneralised-linear-models-binary-data%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            QoS: MAC-Priority for clients behind a repeater

            Ивакино (Тотемский район)

            Can't locate Autom4te/ChannelDefs.pm in @INC (when it definitely is there)