Take-home_Ex03

Author

LIANG YAO

Published

June 17, 2023

Modified

June 17, 2023

1. Task and Questions:

Objectives:

FishEye International, a non-profit focused on countering illegal, unreported, and unregulated (IUU) fishing, has been given access to an international finance corporation’s database on fishing related companies. In the past, FishEye has determined that companies with anomalous structures are far more likely to be involved in IUU (or other “fishy” business). FishEye has transformed the database into a knowledge graph. It includes information about companies, owners, workers, and financial status. FishEye is aiming to use this graph to identify anomalies that could indicate a company is involved in IUU.

FishEye analysts have attempted to use traditional node-link visualizations and standard graph analyses, but these were found to be ineffective because the scale and detail in the data can obscure a business’s true structure. Can you help FishEye develop a new visual analytics approach to better understand fishing business anomalies?

Questions:

Use visual analytics to understand patterns of groups in the knowledge graph and highlight anomalous groups.

Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.
Develop a visual analytics process to find similar businesses and group them. This analysis should focus on a business’s most important features and present those features clearly to the user. Limit your response to 400 words and 5 images.
Measure similarity of businesses that you group in the previous question. Express confidence in your groupings visually. Limit your response to 400 words and 4 images.
Based on your visualizations, provide evidence for or against the case that anomalous companies are involved in illegal fishing. Which business groups should FishEye investigate further? Limit your response to 600 words and 6 images.

Please be noticed:

In this exercise, only question 1 and question 2 would be explored.

2. Load packages and data:

Show the code

pacman::p_load(jsonlite, tidygraph, ggraph, 
               visNetwork, graphlayouts, 
               skimr, tidytext, tidyverse)

Show the code

mc3 <- jsonlite::fromJSON("data/MC3.json")

2.1 Find the nodes and edges:

2.1.1 Nodes

Show the code

#view(mc2[["nodes"]])
mc3_nodes <- as_tibble(mc3$nodes) %>%
  mutate(country=as.character(country),
         id=as.character(id),
         product_services=as.character(product_services),
         revenue_omu = as.numeric(as.character(revenue_omu)),
         type=as.character(type)) %>%
  select(id,country, type, revenue_omu, product_services) 
#  group_by(id,country, type, product_services) %>%
#  summarise(count=n(),revenue=sum(revenue_omu))

Show the code

skim(mc3_nodes)

Data summary
Name	mc3_nodes
Number of rows	27622
Number of columns	5
_______________________
Column type frequency:
character	4
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
id	1	6	64	22929
country	1	2	15	100
type	1	7	16	3
product_services	1	4	1737	3244

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
revenue_omu	21515	0.22	1822155	18184433	3652.23	7676.36	16210.68	48327.66	310612303	▇▁▁▁▁

2.1.2 Edges:

Show the code

#view(mc2[["links"]])
mc3_edges <- as_tibble(mc3$links) %>%
  distinct() %>%
  mutate(source=as.character(source),
         target=as.character(target),
         type=as.character(type)) %>%
  mutate(source=as.character(source)) %>%
  group_by(source, target, type) %>%
  summarise(weight=n()) %>%
  filter(source!=target) %>%
  ungroup()

Show the code

mc3_edges <- rbind(mc3_edges %>%
  mutate(source = ifelse(grepl("^c\\(.*\\)$", source), source, NA)) %>%
  drop_na() %>%
  mutate(source = strsplit(gsub("^c\\(\"|\"\\)$|\"", "", source), ",\\s*")) %>%
  unnest(source),
  mc3_edges[!grepl("^c\\(.*\\)$", mc3_edges$source), ])

Show the code

skim(mc3_edges)

Data summary
Name	mc3_edges
Number of rows	27169
Number of columns	4
_______________________
Column type frequency:
character	3
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
source	1	6	64	13158
target	1	6	28	21265
type	1	16	16	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
weight	0	1	1	0	1	1	1	1	1	▁▁▇▁▁

Caution Here:

There are plenty of rows in “source” column are comma separated characters, I broke them into individual rows by deliminator of “,”, remove all those “c(”“)”, and then rebind with other normal columns.

3. Initial Data Exploring:

3.1 Exploring the edge data frame.

Check the type distribution of edges:

Show the code

mc3_edges %>%
  group_by(type) %>%
  summarise(count=n())

# A tibble: 2 × 2
  type             count
  <chr>            <int>
1 Beneficial Owner 18691
2 Company Contacts  8478

3.2 Exploring the nodes data frame.

Check products & services of each kind of nodes:

Show the code

product_check <- left_join(mc3_nodes %>%
  group_by(type) %>%
  summarise(nodes=n()),
  mc3_nodes %>%
  mutate(n_0 = str_count(product_services, "character")) %>%
  filter(n_0>0) %>%
  group_by(type) %>%
  summarise(empty_product=n()),
  by=join_by(type),
  keep=FALSE)%>%
  mutate(nodes_with_product=nodes-empty_product)

product_check

# A tibble: 3 × 4
  type             nodes empty_product nodes_with_product
  <chr>            <int>         <int>              <int>
1 Beneficial Owner 11949         11926                 23
2 Company           8639             1               8638
3 Company Contacts  7034          7033                  1

Caution Here:

Most of Beneficial Owner and Company Contacts nodes’ product_services are indicated as “character(0)”, so only “Company” type of nodes got specific product and services description.

4. Pattern Analysis & Visualization

4.1 Check “Company” type of nodes with most “Company Contacts” type of edges.

Find those “Company Contacts” edges of “Company” nodes.

Show the code

company_contacts <- mc3_nodes%>%
  select(id,type) %>%
  filter(type=='Company') %>%
  distinct() %>%
  inner_join(mc3_edges%>%
              filter(type=='Company Contacts'),
            by=join_by(id==source),
            keep=TRUE,
            multiple="all",
            suffix=c('_nodes','_edges')) %>% 
  group_by(source,target) %>%
  summarise(weight=sum(weight)) %>%
  arrange(desc(weight)) %>%
  ungroup()

Extract and Visualize the companies with most company contacts edges.

Show the code

top10_contacts<- pull(head(company_contacts %>%
  group_by(source) %>%
  summarise(count=n(),weight=sum(weight)) %>%
  arrange(desc(count)), 10),source)

ggplot(data = company_contacts %>%
  group_by(source) %>%
  summarise(count=n(), weight=sum(weight)) %>%
  arrange(desc(count)) %>%
  head(10),
       aes(x = reorder(source, weight),y=weight)) +
  geom_col()+
  xlab(NULL) +
  coord_flip() +
      labs(x = "Company",
      y = "Weight",
      title = "Top 10 Companies with most company contacts")

4.1.1 Network of source “Aqua Aura SE Marine life”

Filter edges by source

Show the code

edges1 <- mc3_edges %>% 
  filter(grepl(top10_contacts[1],source)) %>%
  group_by(source,target,type) %>%
  summarise(weight=sum(weight)) %>%
  drop_na(weight) %>%
  ungroup()

Extract nodes.

Show the code

nodes1_extract <- rbind(edges1 %>%
                           rename(id=source)%>%
                             mutate(group='company') %>% 
                             select(id,group)%>%
                             distinct(),
                         edges1 %>%
                           rename(id=target) %>%
                           rename(group=type) %>%
                           select(id,group) %>%
                           distinct())

Check any duplication of nodes:

Show the code

nodes1_extract%>%
  group_by(id) %>%
  summarise(count=n())%>%
  filter(count>1)

# A tibble: 0 × 2
# ℹ 2 variables: id <chr>, count <int>

Building network model

Show the code

graph1 <- tbl_graph(nodes = nodes1_extract,
                       edges = edges1,
                       directed = FALSE) %>%
  mutate(between = centrality_betweenness(),
         close = centrality_closeness(),
         degree = centrality_degree())

Show the code

graph1 %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(width=weight,color=type), 
                 alpha=0.8) +
  scale_edge_width(range = c(0.5,3)) +
  geom_node_point(aes(alpha=0.5,
                  size = degree,
                  color = group)) + 
  scale_size_continuous(range=c(1,5))+
  geom_node_text(aes(filter=degree >= 3, label=id), show.legend = FALSE, size=4) +
  theme_graph()

Notes:

Here we can see this is a normal network which connected with relatively more beneficial owner and some company contacts. Hence, the heaviest weight is 3 with majority of weight is 1.

4.1.2 Network of “Irish Mackerel S.A.”

Filter edges and then Extract nodes

Show the code

edges2 <- mc3_edges %>% 
  filter(grepl(top10_contacts[2],source)) %>%
  group_by(source,target,type) %>%
  summarise(weight=sum(weight)) %>%
  drop_na(weight) %>%
  ungroup()

Show the code

nodes2_extract <- rbind(edges2 %>%
                           rename(id=source)%>%
                             mutate(group='company') %>% 
                             select(id,group)%>%
                             distinct(),
                         edges2 %>%
                           rename(id=target) %>%
                           rename(group=type) %>%
                           select(id,group) %>%
                           distinct())

Building network model

Show the code

graph2 <- tbl_graph(nodes = nodes2_extract,
                       edges = edges2,
                       directed = FALSE) %>%
  mutate(between = centrality_betweenness(),
         close = centrality_closeness(),
         degree = centrality_degree())

Show the code

graph2 %>%
    ggraph(layout = "fr") +
  geom_edge_link(aes(width=weight,color=type)) +
  scale_edge_width(range = c(0.5,3)) +
  geom_node_point(aes(size = degree,
                  color = group)) + 
  scale_size_continuous(range=c(1,5))+
  geom_node_text(aes(filter=degree >= 3, label=id), show.legend = FALSE, size=4) +
  theme_graph()

Caution Here:

Here we can see an abnormal network in which most of edges are connected with it’s company contacts, and the edge with highest weight is connected with it’s company contracts.

4.1.3 Network of “Kerala S.A.”

Filter edges and then Extract nodes

Show the code

edges3 <- mc3_edges %>% 
  filter(grepl(top10_contacts[3],source)) %>%
  group_by(source,target,type) %>%
  summarise(weight=sum(weight)) %>%
  drop_na(weight) %>%
  ungroup()

Show the code

nodes3_extract <- rbind(edges3 %>%
                           rename(id=source)%>%
                             mutate(group='company') %>% 
                             select(id,group)%>%
                             distinct(),
                         edges3 %>%
                           rename(id=target) %>%
                           rename(group=type) %>%
                           select(id,group) %>%
                           distinct())

Building network model

Show the code

graph3 <- tbl_graph(nodes = nodes3_extract,
                       edges = edges3,
                       directed = FALSE) %>%
  mutate(between = centrality_betweenness(),
         close = centrality_closeness(),
         degree = centrality_degree())

Show the code

graph3 %>%
    ggraph(layout = "kk") +
  geom_edge_link(aes(width=weight,color=type)) +
  scale_edge_width(range = c(0.5,3)) +
  geom_node_point(aes(size = degree,
                  color = group)) + 
  scale_size_continuous(range=c(1,5))+
  geom_node_text(aes(filter=degree >= 3, label=id), show.legend = FALSE, size=4) +
  theme_graph()

Caution Here:

Similar anomalies as previous one.

4.2 Nodes categorization by company ID:

4.2.1 Extract id text and preprocessing.

Word tokenization with punctuation and no lowercasing

Show the code

tidy_nodes <- mc3_nodes %>%
  unnest_tokens(word, id, to_lower = TRUE, strip_punct = TRUE,drop = FALSE)

Customize stop words and Remove stop words

Show the code

# Create a list of custom stopwords that should be added
word <- c("llc","plc","ltd","inc","de","del","company","corporation","liability")
lexicon <-  rep("custom", times=length(word))

# Create a dataframe from the two vectors above
mystopwords <- data.frame(word, lexicon)
names(mystopwords) <- c("word", "lexicon")

# Add the dataframe to stop_words df that exists in the library stopwords
stop_words <-  dplyr::bind_rows(stop_words, mystopwords)

Show the code

stopwords_removed <- tidy_nodes %>% 
  anti_join(stop_words)

Notes:

Here need to customize stop words to include some company suffixes.

4.2.2 Visualize the id words with most counts

Show the code

stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Unique words of ID",
      y = "Count",
      title = "Top 15 Count of unique words found in company ID")

4.2.3 Check “company” ID highest frequency word:

Filter nodes with “Sons” in their ID text:

Show the code

nodes_sons <- mc3_nodes %>%
  filter(type=='Company') %>%
  select("id","country","type","product_services") %>%
  mutate(n_sons = str_count(id, "Sons")) %>%
  filter(n_sons>0) %>%
  distinct()

Notes:

Here I only took out the “company” type of nodes with “Sons” in their ID.

Filter edges by those ID we found with “Sons” and then Extract nodes

Show the code

edges_sons <- nodes_sons%>%
  select(id,country) %>%
  distinct() %>%
  left_join(mc3_edges,by=join_by(id==source),keep=TRUE,multiple="all") %>% 
  group_by(source,target,type) %>%
  summarise(weight=sum(weight)) %>%
  drop_na(weight) %>%
  ungroup()

Show the code

nodes_sons_extract <- rbind(edges_sons %>%
                           rename(id=source)%>%
                             mutate(group='company') %>% 
                             select(id,group)%>%
                             distinct(),
                         edges_sons %>%
                           rename(id=target) %>%
                           rename(group=type) %>%
                           select(id,group) %>%
                           distinct()) %>% 
  left_join(mc3_nodes%>%select("id"), 
             by = 'id', unmatched="drop", keep = FALSE, multiple= 'first')

Building network model

Show the code

visNetwork(nodes_sons_extract,edges_sons%>%
             rename(from=source)%>%
             rename(to=target)) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(smooth = list(enabled = TRUE, 
                         type = "curvedCW")) %>%
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = TRUE) %>%
  visLegend() %>%
  visLayout(randomSeed = 123)

Notes:

Those nodes are individually connected with different set of nodes, and no obvious cluster of company contacts, indicating there’s no intensely close connected group of companies.

4.2.4 Check “company” ID 2nd highest frequency word:

Filter nodes with “Smith” in their ID text:

Show the code

nodes_smith <- mc3_nodes %>%
  filter(type=='Company') %>%
  select("id","country","type","product_services") %>%
  mutate(n_sons = str_count(id, "Smith")) %>%
  filter(n_sons>0) %>%
  distinct()

Filter edges by those ID we found and then Extract nodes

Show the code

edges_smith <- nodes_smith%>%
  select(id,country) %>%
  distinct() %>%
  left_join(mc3_edges,by=join_by(id==source),keep=TRUE,multiple="all") %>% 
  group_by(source,target,type) %>%
  summarise(weight=sum(weight)) %>%
  drop_na(weight) %>%
  ungroup()

Show the code

nodes_smith_extract <- rbind(edges_smith %>%
                           rename(id=source)%>%
                             mutate(group='company') %>% 
                             select(id,group)%>%
                             distinct(),
                         edges_smith %>%
                           rename(id=target) %>%
                           rename(group=type) %>%
                           select(id,group) %>%
                           distinct()) %>% 
  left_join(mc3_nodes%>%select("id"), 
             by = 'id', unmatched="drop", keep = FALSE, multiple= 'first')

Building network model

Show the code

visNetwork(nodes_smith_extract,edges_smith%>%
             rename(from=source)%>%
             rename(to=target)) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(smooth = list(enabled = TRUE, 
                         type = "curvedCW")) %>%
  visOptions(highlightNearest = TRUE,
             nodesIdSelection = TRUE) %>%
  visLegend() %>%
  visLayout(randomSeed = 123)

Notes:

Similar pattern with previous network with several central nodes which are connected with a set of beneficial owners and company contacts.