Term Project, INLS 625 Spring 2019

For this project, I’m investigating bills from the 112th to 115th Congresses that passed the House. Data for the project were sourced from three places: ProPublica’s Congress API, govtrack.us, and voteview.com (for DW_NOMINATE score). I collected and (for the most part) pre-processed the data in Python (see the script here). I next turned to R to understand and analyze the data.

Descriptives

To begin, I’ve included the pre-processed data in tabular form below. Although I did most recoding in Python before importing the data to R, I did do some additional preprocessing once the data were loaded in R, including making dates be recognized as dates, converting between nominal and numeric attributes, and fixing some string cases.

require(tidyverse)
data <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/Processed with No Text.csv")
data <- subset(data, select = -c(1,7,9,11,17))
data$cospons_r[is.na(data$cospons_r)] <- 0
data$cospons_d[is.na(data$cospons_d)] <- 0
data$cospons_i[is.na(data$cospons_i)] <- 0
data$introduced_date <- as.Date(data$introduced_date)
data$primary_subject <- tolower(data$primary_subject)
data$primary_subject <- str_to_title(data$primary_subject, locale = "en")
names(data) <- c("bill_id","bill_slug","bill_type","committees",
                 "cosponsors","introduced_date","primary_subject",
                 "sponsor_id","sponsor_name","sponsor_party",
                 "sponsor_state","sponsor_title",
                 "congress","dw_nom_1",
                 "dw_nom_2","sponsor_gender","sponsor_twitter",
                 "sponsor_leadership_role","sponsor_seniority",
                 "sponsor_party_loyalty","sponsor_district",
                 "sponsor_age","cosponsors_r","cosponsors_d",
                 "cosponsors_i","bill_len","bill_avg_word_len",
                 "bill_num_stopwords","bill_num_numerics",
                 "bill_num_usc_refs","result")
data$sponsor_party <- as.character(data$sponsor_party)
data$sponsor_party_n[data$sponsor_party == "R"] <- -1
data$sponsor_party_n[data$sponsor_party == "I"] <- 0
data$sponsor_party_n[data$sponsor_party == "D"] <- 1
data$sponsor_gender_n <- as.numeric(data$sponsor_gender)
data$sponsor_leadership <- !(data$sponsor_leadership_role == "")
data$result_simplified <- NA
data$result_simplified[as.character(data$result) %in% c("Became law","Passed; not law (e.g. CR)","Vetoed")] <- "Made it through"
data$result_simplified[as.character(data$result) %in% c("Went to senate","Other","Didn't leave Congress")] <- "Languished in Congress"
data

Uni- or Bi-variate Distributions

Next, I’ve provided a few visualizations so we can get to know the data. First: what do the bills deal with? Each bill was assigned a primary subject, using categories established by the Congressional Research Service. Below, you can see that certain categories of bills are far more common than others. Bills in just two categories - those dealing with Congress and with “”Government Operations and Politics" - make up over 30% of Bills that passed the House in the 112th - 115th Congresses. Those categories deal with general government oversight, operations, administration, elections, ethics.

require(questionr)
subject_data_ordered <- subject_data <- freq(subset(data, select = c(7))$primary_subject)
subject_data_ordered$subject <- subject_data$subject <- rownames(subject_data)
subject_data_ordered$subject <- factor(subject_data$subject, levels = subject_data[order(subject_data$n), "subject"])
ggplot(subject_data_ordered, aes(x = subject, y = n)) +
  geom_bar(stat = "identity", fill = "black") +
  coord_flip() +
  labs(title = "    Figure 1: Frequencies of Bill Primary Subjects",
       caption = "Categorized into 32 bins used by the Congressional Research Service,\nmore information at https://www.congress.gov/help/field-values/policy-area\n",
       y = "", x = "")

Next: who is putting these bills forward? I collected a variety of characteristics about bill sponsors, which I review below. First, in the figure below, the sponsor ages, seniority ranks, and ideologies are plotted. Seniority measures the number of years a member has served. dw_nom_1 and dw_nom_2 measure member ideology and are calculated from roll call vote records; the first dimension measures the member’s position re: government intervention in the economy and the second dimension measures the member’s positions with respect to salient social issues of the day, e.g. slavery in the early-mid 19th Century and LGBTQ rights today.

As the figure shows, sponsors of bills in the relevant timeframe are polarized on the first (economic) dimension but generally share similar positions on the second (social) dimension. They also tend to be older; while the overall US population with bimodal (~30, ~60), the younger population is underrepresented in this sample. That is to be expected though, as Congress as a whole is older than the US population.

subset(data, select = c(14,15,19,22)) %>%
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +   
    geom_density() +
    theme_bw() +
    labs(title = "Figure 2: Densities of Selected Continuous Attributes")

Next, I plotted a few discrete characteristics of the data; as the figure below shows, the number of bills that passed the House steadily increased from the 112th through the 115th Congress. These bills were overwhelmingly sponsored by men, and overwhelmingly sponsored by Republicans. The gender disparity is to be expected due to the overall gender disparity in the House, and it might be enhanced in this case by the party disparity, as the Republican House caucus is more majority-male than its Democratic counterpart.

The disparity in sponsorships by party is further explored in the second figure below, which shows that the share of Democratic- to Republican-sponsored bills was roughly equal across the four Congresses at hand. This stark imbalance in sponsorships by party is to be expected; the Republican Party held the majority in the House for each of these four Congresses. This fact makes comparison within/among the four Congresses more sound; if the majority and leadership swapped part-way through, many facets might change as a result. However, this also should be noted as a limit on the generalizability of any findings from this project; they only apply to Republican-held Houses, and really only these specific Congresses, as so much in politics depends on temporal context.

subset(data, select = c(13,16,10)) %>%
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +  
    facet_wrap(~ key, scales = "free") +
    geom_bar(fill = "black") +
    theme_bw() +
    labs(title = "Figure 3: Distributions of Selected Categorical Attributes")

summarise(group_by(data, congress, sponsor_party), bill_count = n()) %>%
  ggplot(aes(fill = sponsor_party, x = congress, y = bill_count)) +
    geom_bar(stat = "identity", position = "fill") +
    scale_fill_manual(name = "Sponsor Party",
                      values = c("#3487BD","black","#D63E50")) +
    labs(title = "Figure 4: Bills Sponsored by Congress and Sponsor Party",
         y = "Proportion of Bills Sponsored\n",
         x = "Congress") 

Finally, I examined the characteristics of the bill texts themselves, which I calculated through text processing in Python. The distributions of those characteristics are plotted below. The distributions are plotted with logarithmic scales on the x axes, due to the extreme right skew of the distributions. This makes the bill_avg_word_len plot slightly unorthodox, but for speed of coding, I used a simple facet_wrap() call that uses the same scale type for all plots - a decent compromise, since you can still understand what the bill_avg_word_len plot is getting across.

In order, the characteristics plotted below are:

  • Average word length in the bill
  • Total bill text length
  • Number of numbers in the bill (“4670” -> 1)
  • Number of English stopwords in the bill
  • Number of references to the US Code in the bill (“40 USC 5670”, “section 15 of title 40 of United States Code”)
subset(data, select = c(26:30)) %>%
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density()  +                      # as density
    scale_x_log10() +
    theme_bw() +
    labs(title = "Figure 5: Densities of Selected Characteristics: Bill Text")

Correlations

With an understanding of each variable itself, I next turned to gaining an understanding of the relationships between variables. In the figure below, I have plotted the correlations between each variable-pair in the dataset; the size and color of each circle represents the magnitude and direction of any correlation between the relevant two variables. The deeper the color and the larger the circle, the stronger the relationship; if the circle is blue, the correlation is positive/direct, and if the circle is red, the correlation is negative/inverse.

Several top-line conclusions can be drawn from this figure:

  • The cosponsors* variables, except cosponsors_i, are strongly related to each other - even cosponsors_d and cosponsors_r are strongly, positively correlated. As the number of cosponsors increases, the number of cosponsors in each party also generally increases. The nonexistent correlation of cosponsors_i with any of the other cosponsors* variables is likely due to the fact that there are simply barely any independent cosponsors in the dataset at all (total across all bills = 54, out of over 75,000 cosponsorships across all bills).
  • Sponsor age and seniority are positively correlated. This makes sense; older people are more likely to have spent longer in the House.
  • All but one attribute of the bill texts are also strongly, positively correlated; the one outlier is average word length. This makes sense - each of the other variables (bill_num_stopwords, bill_num_usc_refs, and bill_num_numerics) are raw counts, which should increase as the overall bill text increases in length (bill_len). Average word length, however, should have little relationship to how long the document is.
  • Sponsor party is strongly, negatively correlated with the economic ideological dimension (dw_nom_1). A decrease in dw_nom_1 is associated with supporting greater government intervention in the economy, i.e. supporting traditionally-Democratic proposals. An increase in sponsor_party_n represents moving towards Democratic (Republican = -1, Independent = 0, Democratic = 1). Therefore, this inverse relationship makes sense.
  • There are several other interesting relationships between partisanship/ideology:
    • Sponsor party is negatively correlated with sponsor gender; because an “increase” in gender means going from female to male and a “decrease” in party means going from Democrat to Independent to Republican, this negative correlation suggests that being male frequently paired with being Republican, and being female is frequently paired with being Democratic.
    • Economic ideology is negatively correlated with sponsor age and seniority. Older, more-senior members are predicted to have lower dw_nom_1 scores, i.e. prefer more government intervention in the economy. This is in contrast to the pattern in the wider US population, where getting older usually predicts having more conservative/libertarian economic positions.
cor.mat <- cor(subset(data, select = c(5,13:15,19,26:30,32:33)))
cor.mat.rounded <- round(cor.mat, 2)
require(corrplot)
corrplot(cor.mat.rounded, type = "lower", number.cex = .7, order = "AOE", tl.cex = 0.8, tl.srt = .01, tl.col = "black", title = "Figure 6: Matrix of Correlations Between Numeric Attributes", col = colorRampPalette(c("#3487BD","white","#D63E50"))(200))

Target: Bill Fates

The final attribute in the data that will prove crucial in later analysis is the ultimate fate of each bill - did it make it through Congress, get signed by the President, and become law? Of course, reality is a bit more complex than that, and there are more possible outcomes than either dying in the House or fully becoming law; these are the six possible outcomes into which I recoded the bills, listed in descending order of frequency:

  • Went to Senate (n = 1723): passed the House (as did all bills in this dataset) and went to the Senate, but no further
  • Became law (n = 1212): the simplest category
  • Didn’t leave Congress (n = 792): mostly died in some committee
  • Other (n = 193): truly a grab-bag; e.g. a bill that entered SCOTUS litigation
  • Passed, but are not law (n = 33): bills that passed Congress but do not becomes laws, such as continuing resolutions
  • Vetoed (n = 4): self-explanatory; a very small category, but qualitatively very distinct and worth keeping separate

In addition to this six-level categorical variable, I coded a dichotmous version result_simplified, which combined the six categories listed above into the following two:

  • Made it through (n = 1249): composed of Became law, Passed, but are not law, and Vetoed
  • Languished in Congress (n = 2708): composed of Went to Senate, Didn’t leave Congress, and Other

Clearly, there are distinct differences between bills that became law and those that were vetoed, but in this simple dichotmous split, the four bills that were vetoed are more like the other bills that also made it all the way through Congress than those that didn’t make it out at all.

In the figure below, I show the breakdown of bill fates within each of the 32 primary subject categories I reviewed above.

summarise(group_by(data, result, primary_subject), bill_count = n())  %>%
  ggplot(aes(fill = result, x = primary_subject, y = bill_count)) +
    geom_bar(stat = "identity", position = "fill") +
    coord_flip() +
    scale_fill_brewer(palette = "Spectral", guide = guide_legend(reverse = TRUE), name = "Result") +
    labs(y = "Percent of Bills within Subject Area",
         title = "Figure 7: Fates of Bills that Passed the House\nin the 112th - 115th Congresses",
         x = "") +
    theme(plot.title = element_text(hjust = 0.5))

write.csv(data, "C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/R_Processed_Data.csv")

Analytics

With a firm understanding of the data, I could now turn to attempting predictions. For that, I turned to KNIME; I attempted to use both R and Weka at different points, but encountered more obstacles with both of those platforms’ machine learning tools than with KNIME’s.

k-Means Clustering

With the data I have, both supervised and unsupervised learning methods could yield interesting results. First, I undertook unsupervised learning, specifically clustering, as I had seen (as reviewed above) that there were certain groupings in the data that might form nice clusters. For that cluster analysis, I used k-Means clustering in KNIME. In order to prep the data for that algorithm, several steps had to be taken:

  • Completely non-informative columns (i.e. those with unique values for each instance) were dropped (but they can be re-joined here).
  • Three string columns were recoded to boolean columns:
    • sponsor_party => sponsor_democrat
    • sponsor_gender => sponsor_female
    • sponsor_leadership_role => sponsor_leadership
  • Almost perfectly collinear columns were dropped (determination based mostly on the correlation matrix above):
    • cosponsors_d, cosponsors_r, and cosponsors_i were dropped, leaving cosponsors
    • `bill_num_stopwords and bill_num_numerics were dropped, leaving bill_len and bill_num_usc_refs
  • Rows with missing values were dropped.

In KNIME, I specified that I wanted 6 clusters to be identified; I chose this number as it’s the number of unique values in my result variable, and I thought it might be interesting to see if clusters appear which are similar to/predict the ultimate fate of the bills (I undertake that task more directly with the supervised learning models below.)

After performing the cluster analysis in KNIME, I ported the data back over to R to make tables and plots. The following table contains the mean value of each cluster on each of the attributes used in the analysis; put together, the table represents the seven centroids of the clusters, in 12-dimensional space. From this table, we can see that there is more difference between the clusters on some attributes than other others. For example, there is a lot of variation between the clusters on bill_len, but not much variation between clusters on congress.

km_1_clusters <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/K-Means-Clusters1.csv")
km_1_clusters

In fact, when some of the most informative attributes from that cluster analysis are plotted below, it becomes clear that bill_len is an incredibly strong driver of the clustering.

require(plotly)
km_1_data <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/K-Means-Output1.csv")
plot_ly(km_1_data, type = "scatter3d", mode = "markers", x = ~bill_len, y = ~dw_nom_2,
        z = ~sponsor_seniority, color = ~Cluster, hoverinfo = 'text', text = ~row.ID, colors = "Spectral") %>%
  layout(title = "Figure 8: k-Means Clustered Bills Passed in the House,\n112th - 115th Congresses")

So, what happens if we drop that attribute (and bill_num_usc_refs, which is highly collinear with it)?

km_2_clusters <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/K-Means-Clusters2.csv")
km_2_clusters

Many of the same variables are informative, though now it’s a bit easier to tell. In particular, the number of cosponsors, the sponsor’s ideologies, and the sponsor’s seniority/leadership status are major drivers of the clustering. In the plot below, we see that number of cosponsors is a “primary” driver, distinguishing bills with many cosponsors from those without. Within bills with few cosponsors, dw_nom_1 (sponsor’s economic ideology), divides the data into two groups, but neither it nor dw_nom_2 (sponsor’s social positions) yields clear clusters.

km_2_data <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/K-Means-Output2.csv")
plot_ly(km_2_data, type = "scatter3d", mode = "markers", x = ~cosponsors, y = ~dw_nom_1,
        z = ~dw_nom_2, color = ~Cluster, hoverinfo = 'text', text = ~row.ID, colors = "Spectral") %>%
  layout(title = "Figure 9: k-Means Clustered Bills Passed in the House,\n112th - 115th Congresses (2)")

If we limit the plot to only those three clusters (with low cosponsor numbers), we can easily see how those final clusters are being distinguished: mostly along sponsor age, seniority, and social positions. One cluster is made up of bills sponsored by “younger” members (those 55 and under); the remaining two clusters both are made up of bills sponsored by older members, and differ on member seniority.

filter(km_2_data, Cluster %in% c("cluster_0", "cluster_3", "cluster_4")) %>%
  plot_ly(type = "scatter3d", mode = "markers", x = ~sponsor_seniority, y = ~sponsor_age, z = ~dw_nom_2, 
          color = ~Cluster, hoverinfo = "text", text = ~row.ID, colors = "Spectral") %>%
  layout(title = "Figure 10: k-Means Clustered Bills Passed in the House,\n112th - 115th Congresses (3)")

This is all very interesting, but what we’re really trying to do here is predict what’ll happen to a bill in the long run; do these six clusters have anything to do with our target variable?

km_2_data %>%
  group_by(result, Cluster) %>%
    summarise(count_em = n()) %>%
      ggplot(aes(fill = result, x = Cluster, y = count_em)) +
        geom_bar(stat = "identity", position = "fill") +
        scale_fill_brewer(palette = "Spectral") +
        labs(x = "",
             y = "Proportion of Bills within Cluster\n",
             title = "Figure 11: Bill Fates by k-Means Cluster")

Nope. Sad!

Random Forest DT Modeling

The next models I developed for this data were supervised; specifically, random forest decision trees, naïve bayesian modelling, and logistic regression. Hopefully, these will be better at achieving my end-goal.

I again built most of these models in KNIME. In contrast to the additional data processing that was necessary before the cluster analysis above, no changes were made to the data between export from R, import into KNIME, and running of the models. For this model set, I asked KNIME to predict the six-level result column - the more detailed fates of bills in the data set.

I built two RandomForest node chains in KNIME; one primary based on a simple 85/15 partition and one secondary with an X-Partitioner/X-Aggregator loop for 10-fold CV. The model I discuss below is that developed by the simple 85/15 partition, as that chain produced more detailed output, with not only the consensus prediction, but the number of models which agreed on that prediction and the proportion of models that predicted each possible outcome for each instance (think probabilistic modeling). I’m including that data with those columns below. I’ve also coded a match column, simplifying the six-level prediction column into a two-level boolean column.

rf_1_data <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/RF_Data1.csv")
rf_1_data['match'] <- (as.character(rf_1_data$result) == as.character(rf_1_data$result..Out.of.bag.))

rf_1_data
rf_1_acc_stats <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/RF_AccStats1.csv")

Overall, the models developed by the partitioned-RF model set had an accuracy of 75.4208754%, i.e. an error rate of 24.5791246%. Running the 10-fold CV yielded a similar average error rate of 23.351%, confirming the stability of the RF models. Those models’ performance differed by result category; RandomForest was generally better at predicting the Passed, but not law and Went to Senate and not as good at predicting the Became law and Didn’t leave Congress results. I base that conclusion off of the F-scores for each result category as reported by KNIME and included below. Stats for the Vetoed and Other categories were not reported.

rf_1_acc_stats

The RandomForest model’s performance can also be visualized graphically, in addition to through tables. Below, I’ve graphed a simpler version of the stats from the table above; for each category, I calculated the percent of bills in which the RF model’s predicted category (the most common prediction from the 100 individual models) matched the true value. This shows the same conclusions drawn above; Went to Senate and Didn’t leave Congress had the greatest within-group accuracy, Other and Became law were predicted correctly less often - though still over 50% of the time - and the remaining two categories were predicted incorrectly 100% of the time, almost certainly due to their relatively miniscule sample size.

summarise(group_by(rf_1_data, result, match), bill_count = n()) %>%
  ggplot(aes(fill = match, x = result, y = bill_count)) +
    geom_bar(stat = "identity", position = "fill") +
    coord_flip() +
    scale_fill_manual(guide = guide_legend(reverse = TRUE),
                      name = "Prediction",
                      labels = c("Incorrect","Correct"),
                      values = c("#D63E50","#3487BD")) +
    labs(y = "Proportion of Bills within Result Category",
         title = "Figure 12: RandomForest Performance By Bill Fate,\nBills that Passed the House in the 112th - 115th Congresses",
         x = "") +
    theme(plot.title = element_text(hjust = 0.5))

Another way to visualize the model’s performance is to examine its accuracy by primary subject area; was the model particularly better or worse at predicting the result of any bills in any certain subjects? The answer to that question is plotted below. Overall, the model performed realtively-similarly across almost all subject areas. There are three outlier categories in which the model was either 100% correct or 100% incorrect in its predictions:

  • 100% Correct: Private Legislation and Civil Rights and Liberties, Minority Issues
  • 100% Incorrect: Social Sciences and History

An additional outlier, Arts, Culture, and Religion, was not 100% one way or the other, but it had markedly lower accuracy than the remaining subject areas.

As we saw in Figure 1, each of these four bills is among the least-frequent (together, they make up 4 of the bottom 6 categories by frequency). As such, it again follows that the model would perform worse in these categories, simply due to the small sample size to work with.

summarise(group_by(rf_1_data, primary_subject, match), bill_count = n()) %>%
  ggplot(aes(fill = match, x = primary_subject, y = bill_count)) +
    geom_bar(stat = "identity", position = "fill") +
    coord_flip() +
    scale_fill_manual(guide = guide_legend(reverse = TRUE),
                      name = "Prediction",
                      labels = c("Incorrect","Correct"),
                      values = c("#D63E50","#3487BD")) +
    labs(y = "Proportion of Bills within Subject Area",
         title = "Figure 13: RandomForest Performance By Subject Area,\nBills that Passed the House in the 112th - 115th Congresses",
         x = "") +
    theme(plot.title = element_text(hjust = 0.5))

nb_acc_stats_1 <- read.csv("C:/Users/johnr/OneDrive/Spring 2019/INLS 625/Project/NB_AccStats1.csv")

Naïve Bayesian Modelling (NB)

For this next model type, I again turned to KNIME. Again building the model using an 85/15 random split for training and test data, I applied a Naïve Bayes learner with result as the target column. This model is remarkably less accurate than the RandomForest model, with test accuracy only around 50% (48.6531987%. Similarly to the RandomForest model set, the NB model performed differently by result value; it was best at predicting the Went to Senate result, and markedly worse at predicting all other categories; more detailed accuracy statistics are below. Because of this pretty horrible performance, I won’t go any further into the NB model here.

nb_acc_stats_1

Logistic Regression

data$result_simplified <- as.factor(data$result_simplified)
data$made_through[data$result_simplified == "Made it through"] <- 1
data$made_through[data$result_simplified == "Languished in Congress"] <- 0

data$primary_subject <- as.factor(data$primary_subject)
data <- within(data, primary_subject <- relevel(primary_subject, ref = 8))

data$sponsor_gender <- as.factor(data$sponsor_gender)
data$sponsor_party <- as.factor(data$sponsor_party)
data$sponsor_leadership <- as.factor(data$sponsor_leadership)
data$congress <- as.numeric(data$congress)

logit_reg <- glm(made_through ~ cosponsors + primary_subject + sponsor_party +  sponsor_gender + congress + dw_nom_1 + dw_nom_2 + sponsor_age + bill_len + bill_num_usc_refs + sponsor_leadership, data = data, family = "binomial")

Finally, I performed a logistic regression on the data. For this method, I had to use made_through as the outcome - I attempted to build a multinomial logit model so that I could regress with the result outcome, but encountered errors with R’s mlogit method. I first attempted to do this section in KNIME like I did for the previous sections, but KNIME’s logistic regression node would never converge, no matter what data I fed it or if I standardized numeric attributes. In R, I was able to get the model to converge.

In the logit model, I included all of the variables we’ve seen to be at all predictive or correlated, in the correlation matrix and in the other analytical methods above. I used a binomial distribution with a logit link to estimate a logit model. The formula I used is: made_through ~ cosponsors + primary_subject + sponsor_party + sponsor_gender + congress + dw_nom_1 + dw_nom_2 + sponsor_age + bill_len + bill_num_usc_refs + sponsor_leadership.

In the model that resulted, some variables were highly predictive of the outcome; the table below lists each variable in the regression whose effect was significant to at least the p < 0.05 level. I list the variable name and the corresponding Odds Ratio. Two notes: some variable names are poorly formatted - these are the dummy variables created by R when a categorical (factor) variable, such as primary_subject, is used in regression. Second, an Odds Ratio is somewhat complicated to interpret (though less complicated than a logit coefficient) - essentially, an Odds Ratio greater than 1 means the predictor is directly correlated with the outcome, and an Odds Ratio less than 1 means the predictor is inversely correlated with the outcome.

Lastly, it’s important for interpretation to understand how the outcome is coded; in this case, the outcome is made_through, where 1 corresponds to “Made it through” in result_simplified and 0 corresponds to “Languished in Congress”. Therefore, when Odds Ratios are being interpreted, an increase in the outcome means a increase in likelihood of getting through Congress.

logit_reg_coef_info <- as.data.frame(summary(logit_reg)$coefficients)
logit_reg_coef_info['Var'] <- rownames(logit_reg_coef_info)

ors <- as.data.frame(exp(coef(logit_reg)))
ors['Var'] <- rownames(ors)
signif_preds_0.05 <- logit_reg_coef_info[logit_reg_coef_info$`Pr(>|z|)` < 0.05,c(1,5)]
signif_preds_0.05_with_ors <- merge(signif_preds_0.05, ors, by = "Var")
names(signif_preds_0.05_with_ors) <- c("Predictor","Estimate","Odds Ratio")
signif_preds_0.05_with_ors$`Odds Ratio` <- as.character(signif_preds_0.05_with_ors$`Odds Ratio`)

subset(signif_preds_0.05_with_ors, select = c(1,3))

As we can see from the table above, a bill being in a number of subject areas greatly increases the likelihood that it will make it all the way through Congress. In the regression, the base (reference) category in primary_subject was Congress, the most-frequent category. Therefore, these >1 Odds Ratios mean that the odds that bills in those categories make it through Congress are greater than the odds that a bill in the Congress category does the same. This, in a similar pattern to the very infrequent variables in previous models, may be due to the fact that the Congress category is actually very frequent; with more bills in the area, there may simply be more bills “cluttering” the category that don’t and were never going to make it through.

The remaining significant predictors in this regression are congress, sponsor_age, and sponsor_partyR. Although sponsor_age is highly significant with a p-value of 1.95e-13 (!!) its effect is, practically, fairly small. A sponsor being one year older only increases by 2.8 percentage points the odds that a bill makes it through Congress. In contrast, a bill having been sponsored by a Republican halved the odds of it making its way through Congress, compared to bills sponsored by Democrats (Democrats being the reference category in sponsor_party. Finally, bills became less likely to make it all the way through Congress as time passed from the 112th to the 115th Congresses. These effects, for both congress and sponsor_party, may be influenced by the simple differences in sample sizes between values; there were more and more bills each succesive Congress, yet there may be some limit on the number of bills that can make it all the way through; therefore, the “likelihood” that a bill gets all the way through decreases simply because there are more bills present. The same may be true for Republican-sponsored bills; as seen in Figure 4 above, there are over 3 times as many Republican-sponsored bills as Democratically-sponsored bills, which may lead to a similar pheonomenon as with Congresses.

One additional way to depict this model’s results is through the fitted values; the plot below compares the proportion of bills that made it through congress (i.e. outcome in the logit regression == 1) from reality versus from the regression’s predictions. As the figure shows, the model’s prediction follows a similar pattern as reality, though it predicts that more bills get passed across the board.

data['logit_fit_val'] <- fitted(logit_reg)
data['logit_pred'] <- (data$logit_fit_val < 0.5)

data$congress <- as.factor(data$congress)

require(reshape2)
summarise(group_by(data, congress), mean_real = mean(made_through), mean_pred = (1-mean(logit_pred))) %>%
  melt(id.vars = c("congress")) %>%
    ggplot(aes(x = congress, y = value, fill = variable)) +
      geom_bar(position = "dodge", stat = "identity") +
      scale_fill_manual(name = "mean(made_through)",
                        labels = c("True","Predicted"),
                        values = c("#3487BD","#D63E50")) +
      labs(title = "Figure 14: Real vs. Fitted Means of Target Variable\nby Congress, Logit Regression",
           x = "Congress", y = "")

Similarly, the logit model correctly predicts that a smaller proportion* of Republican bills made it through Congress than Democratic bills, though (in contrast to the previous figure) it underestimated the means across the board.

* Important to note that these are proportions, because in raw numbers, definitely more Republican bills made it through than Democratic ones. But the underlying reason for that phenomenon - it being a Republican-controlled Congress - also underlies this phenomenon, because Republicans simply proposed so many bills that they couldn’t keep up with all of them and get all of them through.

filter(data, sponsor_party != "I") %>%
  group_by(sponsor_party) %>%
    summarise(mean_real = mean(made_through), mean_pred = (1-mean(logit_pred))) %>%
      melt(id.vars = c("sponsor_party")) %>%
        ggplot(aes(x = sponsor_party, y = value, fill = variable)) +
          geom_bar(position = "dodge", stat = "identity") +
          scale_fill_manual(name = "mean(made_through)",
                            labels = c("True","Predicted"),
                            values = c("#3487BD","#D63E50")) +
          labs(title = "Figure 15:  Real vs. Fitted Means of Target Variable\nby Sponsor Party, Logit Regression",
               x = "Sponsor Party", y = "")

Conclusions

Best model: RandomForest or Logit Regression

Main limitation: so much about politics can’t be captured (see the intercept from the logit regression)

Going forward: