谁在管理我们的国家? 使用 R 语言分析全国人大代表数据

谁在管理我们的国家? 使用 R 语言分析全国人大代表数据

最近各大新闻媒体都在铺天盖地报道有关两会的新闻,我们也来蹭一下热度,虽然找不到今年人大代表的相关数据,但是我们可以从这个网站: https://news.cgtn.com/event/2019/whorunschina/index.html 获取 2019 年的两会人大代表数据。数据我们已经爬好放在本文的附件中了。

读取数据

首先读取数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
library(tidyverse)
library(hrbrthemes)
# 设置 ggplot2 绘图主题,注意这里的 cnfont 是在 Profile 文件里面设置的字体,如果提示不存在 cnfont 可以把 `base_family = cnfont` 参数删除。
theme_set(theme_ipsum(base_family = cnfont))

# 读取数据
df <- read_csv('NPC.csv')
# 概览数据:
glimpse(df)

#> Rows: 2,975
#> Columns: 25
#> $ Delegation <chr> "Anhui", "Hunan", "Shaanxi", "Fujian", "Zheji…
#> $ Partisan <chr> "CPC(Communist Party of China)", "CPC(Communi…
#> $ 党派 <chr> "中共", "中共", "中共", "无", "农工党", "民主同盟", "民主同盟", "…
#> $ Name <chr> "Ding Shiqi", "Ding Xiaobing", "Ding Yunxiang…
#> $ 姓名 <chr> "丁士启", "丁小兵", "丁云祥", "丁世忠", "丁列明", "丁光宏", "丁仲礼"…
#> $ Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male…
#> $ 性别 <chr> "男", "男", "男", "男", "男", "男", "男", "男", "女", "男…
#> $ `Birth year` <dbl> 1966, 1960, 1963, 1970, 1963, 1963, 1957, 195…
#> $ Age <dbl> 53, 59, 56, 49, 56, 56, 62, 62, 52, 49, 40, 3…
#> $ Generation <chr> "1960s", "1960s", "1960s", "1970s", "1960s", …
#> $ 年代 <chr> "60后", "60后", "60后", "70后", "60后", "60后", "50后"…
#> $ Ethnicity <chr> "Han", "Han", "Han", "Hui", "Han", "Han", "Ha…
#> $ 民族 <chr> "汉族", "汉族", "汉族", "回族", "汉族", "汉族", "汉族", "汉族",…
#> $ Birthplace <chr> "Anhui", "Gansu", "Shaanxi", "Fujian", "Zheji…
#> $ 籍贯 <chr> "安徽", "甘肃", "陕西", "福建", "浙江", "安徽", "浙江", "浙江",…
#> $ Region <chr> "Central China", "Western China", "Western Ch…
#> $ 区域 <chr> "中部", "西部", "西部", "东部", "东部", "中部", "东部", "东部",…
#> $ `Subject Department` <chr> "Natrual Sciences", "Unknown", "Natrual Scien…
#> $ 专业分类 <chr> "自然科学", "未知", "自然科学", "社会科学", "自然科学", "自然科学", "自然…
#> $ Major <chr> "Engineering", "Unknown", "Engineering", "Man…
#> $ 人文社科拆后专业 <chr> "工学", "未知", "工学", "管理学", "医学", "理学", "理学", "未知", "理学"…
#> $ `Educational background` <chr> "PhD", "Unknown", "PhD", "Bachelor", "PhD", "…
#> $ 学历 <chr> "博士研究生", "未知", "博士研究生", "本科", "博士研究生", "硕士研究生",…
#> $ `Ever studied abroad` <chr> "No", "Unknown", "No", "No", "Yes", "No", "No…
#> $ 海外留学经验 <chr> "无", "未知", "无", "无", "有", "无", "无", "无", "无", "无", …

总人数

1
2
3
4
5
df %>% 
count() %>%
pull()

#> [1] 2975

计算每个代表团的总人数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
df %>% 
count(Delegation)

#> # A tibble: 35 x 2
#> Delegation n
#> * <chr> <int>
#> 1 Anhui 111
#> 2 Beijing 54
#> 3 Chongqing 60
#> 4 Fujian 69
#> 5 Gansu 52
#> 6 Guangdong 163
#> 7 Guangxi 89
#> 8 Guizhou 73
#> 9 Hainan 25
#> 10 Hebei 123
#> # … with 25 more rows

我手动把这个结果翻译了一下:

1
2
3
4
5
6
7
8
read_csv("delegation2.csv") %>%
transmute(
delegation = `代表团`,
count = n
) %>%
mutate(
delegation = if_else(delegation == "中国人民解放军武警部队", "中国人民解放军\n武装警察部队", delegation)
) -> delegation

然后我们绘制一幅华夫图展示这个结果,waffle 包返回的是一个 gg 对象,所以我们可以在其上直接添加 ggplot2 包的操作:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
library(waffle)
colors <- paletteer::paletteer_d("ggsci::default_igv", 35)
# 得先把 delegation 转换成 named vector
dv <- delegation$count
names(dv) <- delegation$delegation
waffle(dv, rows = 40, size = 0.5, colors = colors) +
labs(title = "全国人大代表人数:2975",
subtitle = "一共35个代表团",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>") +
theme_ipsum(base_family = cnfont) +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
legend.key.size = unit(0.5, "cm"))

我们还可以把人数最多的 9 个代表团拿出来比较:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
library(forcats)
library(ggchicklet)
# ggchicklet 的安装
# devtools::install_local('ggchicklet_0.5.2.tar.gz')
# 这九个代表团的总人数:
delegation %>%
arrange(desc(count)) %>%
slice(1:9) %>%
summarise(total = sum(count)) %>%
pull()

#> [1] 1454

# 绘制一幅条形统计图:
simcolors = c("#fed439", "#709ae1", "#8a9197", "#d2af81", "#fd7446", "#d5e4a2", "#197ec0", "#f05c3b", "#46732e", "#71d0f5", "#370335", "#075149", "#c80813", "#91331f", "#1a9993", "#fd8cc1")
delegation %>%
arrange(desc(count)) %>%
slice(1:9) %>%
mutate(
delegation = fct_reorder(delegation, count)
) %>%
ggplot() +
geom_chicklet(aes(x = delegation, y = count,
fill = delegation, color = delegation)) +
scale_color_manual(values = simcolors) +
scale_fill_manual(values = simcolors) +
labs(x = "", y = "人数",
title = "全国人民代表大会人数最多的九个代表团",
subtitle = "这九个代表团的总人数为 1454 ",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>") +
coord_flip() +
theme(legend.position = "none")

性别、年龄分布

性别年龄分布我们可以一起来看:

1
2
3
4
5
6
7
8
9
10
11
df %>% 
count(年代, 性别) %>%
ggplot() +
geom_col(aes(x = 年代, y = n,
fill = 性别, color = 性别)) +
labs(x = "年代", y = "人数", title = "全国人大代表的年龄分布",
subtitle = paste("平均年龄为", df %$% mean(.$Age, na.rm = T) %>% round(2), "岁"),
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>") +
scale_color_manual(values = c("男" = "#709ae1", "女" = "#fed439"), name = "性别") +
scale_fill_manual(values = c("男" = "#709ae1", "女" = "#fed439"), name = "性别") +
theme(legend.position = c(0.15, 0.8))

可以看出,60 后是 NPC 的核心,90 后中女性的数量多于男性。人大代表们的平均年龄是 53.77 岁。其中,1672 名代表出生于 20 世纪 60 年代,占总数的一半以上。另外我们还可以看到,代表们越年轻,性别比例越均衡。

是不是觉得女性的比例很少?实际上近几届人大会议上女代表的比例正在稳步上升:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
library(scatterpie)
wm <- tribble(
~year, ~woman,
10, 0.202,
11, 0.213,
12, 0.234,
13, 0.249
) %>%
mutate(
man = 1 - woman
)

ggplot() +
geom_scatterpie(aes(x = year, y = 5,
group = year),
cols = c("woman", "man"),
data = wm,
pie_scale = 5,
color = "#FFFEEA") +
coord_equal() +
scale_x_continuous(breaks = 10:13,
labels = paste0(10:13, "th")) +
scale_fill_manual(name = "性别",
values = c("man" = "#709ae1",
"woman" = "#fed439"),
breaks = c("man", "woman"),
labels = c("男性", "女性")) +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(x = "全国人大\n\n",
title = "历届全国人大女性代表的比例变化",
subtitle = "第 13 届全国人大女性代表的比例为 24.9%,比第 11 届提高了 4.7 个百分点\n\n\n",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>")

民族分布

中国是个有着 56 个民族的多民族国家,55个少数民族 + 汉族,那么人大代表中有多少个民族呢?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
df %>%
count(民族)

#> # A tibble: 56 x 2
#> 民族 n
#> * <chr> <int>
#> 1 阿昌族 1
#> 2 白族 8
#> 3 保安族 1
#> 4 布朗族 1
#> 5 布依族 7
#> 6 藏族 33
#> 7 朝鲜族 12
#> 8 达斡尔族 1
#> 9 傣族 5
#> 10 德昂族 1
#> # … with 46 more rows

也就是说每个民族都至少有一个代表,可以用一幅树图展示各个民族代表的比例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
library(treemapify)
df %>%
count(民族) %>%
mutate(
ishan = (民族 == "汉族") * 1
) %>%
ggplot(aes(area = n, fill = factor(ishan))) +
geom_treemap() +
geom_treemap_text(aes(label = 民族), family = cnfont,
size = 12, color = "black") +
scale_fill_manual(values = simcolors) +
labs(title = "全国人大代表的民族分布",
subtitle = "我国是一个多民族国家",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>") +
theme(legend.position = "none")

其中汉族人 2538 人,占总数的 85%。

学历分布

不用想就知道人大代表们的学历应该都挺高的:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
df %>% 
count(性别, 学历) %>%
mutate(
学历 = fct_reorder(学历, n)
) %>%
ggplot() +
geom_chicklet(aes(x = 学历, y = n, fill = 性别)) +
scale_color_manual(values = c("男" = "#709ae1", "女" = "#fed439"), name = "性别") +
scale_fill_manual(values = c("男" = "#709ae1", "女" = "#fed439"), name = "性别") +
theme(legend.position = c(0.15, 0.8)) +
labs(x = "", y = "人数",
title = "全国人大代表的学历分布",
subtitle = "每 10 名人大代表中就有 9 名持有学士及以上的学位。88.5%的代表拥\n有学士学位或以上学历。拥有硕士学位的人占最大比例(836 人),\n博士学位排名第二(584 人)。",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>")

如果我们只关心男女比例,而不关心各种学历的人数关系,可以使用 100% 填充模式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
df %>% 
count(性别, 学历) %>%
mutate(
学历 = fct_reorder(学历, n)
) %>%
ggplot() +
geom_chicklet(aes(x = 学历, y = n, fill = 性别),
position = position_fill()) +
scale_color_manual(values = c("男" = "#709ae1", "女" = "#fed439"), name = "性别") +
scale_fill_manual(values = c("男" = "#709ae1", "女" = "#fed439"), name = "性别") +
theme(legend.position = "right") +
labs(x = "", y = "人数",
title = "全国人大代表的学历分布",
subtitle = "每 10 名人大代表中就有 9 名持有学士及以上的学位。88.5%的代表拥\n有学士学位或以上学历。拥有硕士学位的人占最大比例(836 人),\n博士学位排名第二(584 人)。",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>")

学科分布

全国人大代表的专业背景怎么样?根据中国教育部的专业分类,管理科学,哲学,文学,历史,教育,艺术,经济,法律和军事科学属于人文社会科学 ; 而科学,工程,农业和医学是自然科学。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
library(ggpol)
c("法学", "工学", "管理学", "教育学", "经济学",
"军事", "理学", "历史学", "农学", "未知",
"文学", "医学", "艺术", "哲学") -> labels
df %>%
count(人文社科拆后专业) %>%
mutate(
人文社科拆后专业 = factor(
人文社科拆后专业,
levels = labels,
labels = labels)
) %>%
ggplot() +
geom_parliament(aes(seats = n, fill = 人文社科拆后专业,
color = 人文社科拆后专业)) +
coord_fixed() +
scale_fill_manual("学科",
values = simcolors,
breaks = 1:14,
labels = labels) +
scale_color_manual("学科",
values = simcolors,
breaks = 1:14,
labels = labels) +
guides(color = guide_legend(ncol = 2),
fill = guide_legend(ncol = 2)) +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank()) +
labs(title = "全国人大代表的专业背景怎么样?",
subtitle = "根据中国教育部的专业分类,管理科学,哲学,文学,历史,教育,\n艺术,经济,法律和军事科学属于人文社会科学 ; 而科学,工程,农\n业和医学是自然科学。",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>")

党派分布

1
2
3
4
df %>% 
count(党派) %>%
rename(人数 = n) %>%
knitr::kable(align = "c")
党派 人数
九三学社 64
民革 44
民主建国会 57
民主同盟 58
农工党 54
台盟 11
423
致公党 38
中共 2172
中国民主促进会 54

可以绘制一幅冲积图来展示各个代表团的党派分布:

fig.asp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
library(ggalluvial)
df %>%
left_join(read_csv("delegation2.csv"), by = "Delegation") %>%
count(代表团, 党派) %>%
ggplot(aes(axis1 = 代表团, axis2 = 党派, y = n)) +
scale_x_discrete(limits = c("代表团", "党派"),
expand = c(0.1, 0.05)) +
geom_stratum() +
geom_alluvium(aes(fill = 党派)) +
geom_text(aes(label = after_stat(stratum)),
stat = "stratum",
family = cnfont,
size = 3) +
scale_fill_manual(values = simcolors) +
guides(fill = "none") +
theme(axis.text.y = element_blank(),
axis.title.y = element_blank()) +
labs(y = "",
title = "全国人大的党派和代表团分布",
subtitle = "中共的代表最多, 2172 人",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>")

代表们籍贯的地理分布图

可以使用 ggplot2 + sf 绘制:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
library(ggplot2)
library(sf)
df %>%
count(籍贯) %>%
mutate(prov = str_sub(籍贯, 1, 2)) -> provdata

read_sf("2019省级行政区划/省.shp") %>%
mutate(prov = str_sub(省, 1, 2)) %>%
left_join(provdata) -> mapdata

mapborder <- read_sf("九段线.geojson")

# 分割变量与绘图
library(ggspatial)
mapdata %>%
mutate(n = as.numeric(n),
n = if_else(is.na(n), 0, n)) %>%
mutate(group = cut(n,
breaks = c(0, 50, 100,
150, 200, 250,
300, 350),
include.lowest = T,
labels = c("0~50", "50~100",
"100~150", "150~200",
"200~250", "250~300", "300~350"))) %>%
ggplot() +
geom_sf(aes(geometry = geometry,
fill = group),
size = 0.05) +
geom_sf(aes(geometry = geometry),
data = mapborder,
size = 0.2) +
scale_fill_manual(values = c("#e0f2f1", "#b2dfdb", "#80cbc4", "#4db6ac", "#26a69a", "#009688")) +
theme(legend.position = c(0.1, 0.3),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.major = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank()) +
labs(title = "全国人大代表的籍贯分布",
subtitle = "来自山东的人大代表最多,为 320 人",
caption = "数据来源:Who runs China?\n<https://news.cgtn.com/event/2019/whorunschina/>",
fill = "代表人数") +
theme(plot.margin = grid::unit(c(0.5, 1.5, 0.5, 1), "cm")) +
coord_sf(crs = "+proj=lcc +lat_1=30 +lat_2=62 +lat_0=0 +lon_0=105 +x_0=0 +y_0=0 +ellps=krass +units=m +no_defs",
xlim = c(-3500000, 3090000)) +
annotation_scale(
width_hint = 0.2,
text_family = cnfont
) +
annotation_north_arrow(
location = "tr", which_north = "false",
width = unit(1.6, "cm"),
height = unit(2, "cm"),
style = north_arrow_fancy_orienteering(
text_family = cnfont
)
)

阅读更多:https://mp.weixin.qq.com/s/35P5yoZtL1ENKRQ0z9J5MQ
视频讲解:https://rstata.duanshu.com/#/brief/course/c150d3474c8c435aa19f3caec6d2be2a

谁在管理我们的国家? 使用 R 语言分析全国人大代表数据

https://tidyfriday.cn/posts/50106/

作者

Painter

发布于

2021-03-10

更新于

2021-03-14

许可协议

评论