1 Introduction

Resumes play an important role in human career. Resumes are used for employees to find jobs and for HR to select staffs. Almost all key events about career and demographic information are recorded on the resumes. Too many valuable patterns can be explored from resumes. Since resumes are commercial secrets, it is difficult to obtain the complete resumes.

Visualization techniques have already been used to assist the analysis of humanities [1], which are good at storytelling. Different aspects of humanities visualization are presented in the previous work, such as visualizing traffic data [2], table tennis data [3], E-commerce data [4]. But as far as I know, there is almost no visualizing work dedicated to resumes.

In order to obtain resumes with complete attributes, we collect resumes through the offline cooperations with as many companies as possible. For five years, 372,829 Chinese resumes are collected with complete attributes, such as income and Chinese ID number. The data distribution analysis verifies the diversity and validity of the data set. Then an interactive visualization system called ResumeVis is developed to explore the correlations among the attributes. There may be multiple values in one attribute of resumes, such as education experiences and work experiences. Taking the characteristics of resumes into account, we propose a new correlation representation – parallel coordinates with multi-valued attributes. Besides, user-friendly interactions, such as filter elements, reorder attributes, brushing and linking, are integrated to provide an easy-use interface. Personal perspective and correlation perspective are provided in the system. Finally, three case studies from different aspects are illustrated as examples to validate the system usability.

The main contributions include:

  • A large resume data set with complete attributes is constructed, which contains 372,829 Chinese resumes and valuable attributes, such as income and Chinese ID number.

  • A complete interactive visualization system is developed to explore the correlations of resumes.

  • We propose a new correlation representation – parallel coordinates with multi-valued attributes, which supports multiple values in one attribute in an item.

2 Related Work

2.1 The Visualization of Humanities

The visualization techniques is good at storytelling and increasingly used to visualize humanities data [5], called digital humanities [6]. Compared with the scientific data, the humanities data is multi-dimensional and the attributes of humanities data are more complicated and diverse [7]. Different aspects of humanities visualization are presented in the previous work. For example, Xiaoying et al. visualized traffic data to find flow patterns [2]. Yingcai et al. presented a system called iTTVis to visualize table tennis data [3]. Wang developed a GIS software called “MeteoInfo” for meteorological data visualization [8]. But as far as I know, there is almost no visualizing work dedicated to resumes.

In order to mine the information hidden behind the data, interactions play an important role in information visualization [9]. There are different kinds of interactions, such as brushing [10], filtering [11], focusing and linking [12, 13]. In this paper, the flexible and rich interactions are integrated in the proposed system.

2.2 The Visualization of Correlations

There are many techniques to visualize correlations, such as scatterplot matrix [14], graphs [15], maps [16] and parallel coordinates [17]. Among these visualization methods, the parallel coordinates are a common way of visualizing high-dimensional data and analyzing multivariate data [18]. It consists of n parallel lines, typically vertical and equally spaced. A data item in the data set is represented as a polyline with vertices on the parallel axes.

Some modified parallel coordinates are proposed for different visualization applications. For example, Xiaoying et.al presented the parallel coordinates with line and set, depicting attributes by lines and rectangular bars [2]. Kosara et al. proposed “parallel sets” for interactive exploring the categorical data [19]. Alexander et al. proposed “VisBricks” for multiform visualization of large, inhomogeneous data by extending the idea of “parallel sets” [20]. Graham et al. used curves to enhance parallel coordinate visualisations [21]. However, none of the above methods are suitable for resume data, in which one attribute in an item has multiple values. So in this paper, we propose an improved parallel coordinates with multi-valued attributes to adapt to the characteristics of resumes.

3 The Resume Data Set

3.1 Data Collection

There are three crucial aspects of the resumes, including the demographics, the education experiences and the work experiences. Due to the privacy and commercial secrets of resumes, there are very limited information in the online resumes. We cannot collect resumes through web crawlers, since the important information is hidden from the public. It is easy to obtain bias in the pattern mining by using the incomplete resume information. To get as much information as possible, we collect resumes through the offline cooperations with as many companies as possible to get the complete information.

Through 5 years of collection, there are 372,829 Chinese resumes in total. We organize the various sources into the unified Json format. In the Json file, each item is a resume, with dimensions in four aspects. Dimensions of the resumes are shown in Table 1, including the demographics, the education experiences, the work experiences and the training experiences.

Table 1. Dimensions of the resumes

We use the dummy coding to identify the attribute values as numbers. An employee may have multiple education experiences. So in the category of education, there are multiple education items. Each item contains the school name, the major name, the start time, the end time, and the academic qualification obtained at this school. Similarly, an employee may have multiple work experiences. Each work item contains the company name, the department name, the job name, the start time and the end time. Besides, the training experiences are taken into account, which have a direct relationship with the work experiences.

3.2 Data Distribution

Before exploring the correlations of resumes, we do the statistical analysis of data set to verify the diversity of data distribution and the validity of the data set. Since the resumes are from different sources, our collection algorithms do a lot of data integration and classification. The details will be described in the following.

Gender. There are 206,904 resumes with the “gender” dimension, including 104,316 females and 102,588 males. It can be seen that the proportions of men and women in the data set are balanced.

Date of Birth. Some resumes clearly show the date of birth, while some resumes record the Chinese ID number. In the Chinese ID number, the 7th number to the 14th number are the date of birth. If the explicit record of birth and the birth extracted in the Chinese ID number conflict, the date of birth is obtained from the Chinese ID number in our data collection algorithm, because the error probability of manually entering the birth is higher.

We focus on the resumes with the date of birth between 1950 to 1999, and in total 208,831 resumes meet this condition. There are 1,155 persons born in 50s (0.55% in the population), 17,595 persons born in 60s (8.43%), 110,849 persons born in 70s (53.08%), 76,663 persons born in 80s (36.71%), and 2,569 persons born in 90s (1.23%). From the distribution, we can see that 70s and 80s are the main force of the social workforce, making up nearly 90% of the total. This distribution is consistent with the distribution of the social workforce.

Marital Status. There are 187,224 resumes with the “marital status” dimension. There are 44,824 singles, which are 23.94% of the population. There are 65,148 (34.8%) married, 516 (0.27%) divorced, and 76,736 (40.99%) confidential.

Email. There are 196,609 resumes with the “email” dimension. And 5,981 different emails appear in these resumes. Most of these emails only have one resume registered. The top 9 emails with most users are 163.com (33,596), hotmail.com (30,003), sina.com (18,941), 126.com (17,988), yahoo.com.cn (14,579), qq.com (14,317), gmail.com (10,643), 263.net (9,367), sohu.com (9,136), which are in total 80.65% of the population.

Highest Education. There are 203,026 resumes with the “highest education” dimension. The distribution is shown in Fig. 1. 61.5% of the population isundergraduates and 17.78% of the population is masters. To some extent, this distribution reflects the education composition of the population.

Fig. 1.
figure 1

The distribution of the highest education.

Language. There are 127,435 resumes with the “language” dimension. Some employees master two or more foreign languages. So there are 165,727 language items. 74.47% of the population can speak English, 13.11% of the population can speak Japanese, followed by French (4.34%), German (3.15%), Korean (1.76%), Russian (0.86%), Spanish (0.54%) and Italian (0.11%). Few people can speak other languages.

Skill. There are 59,968 resumes with the “skill” dimension, and 25,840 different skill items appear in these resumes. Some employees master two or more skills. Most of these skills are related with the computers. The top ten skills with most resumes are Visual Basic (7,694), Access (4,621), TCP/IP (4,357), FoxPro (4,328), Visual C++ (4,235), Management Information System (3,800), Java (3,479), MS-SqlServer (3,425), HTML (3,210) and Oracle (3,136).

Monthly income. The current salary is a very precious information, which is an important quantitative indicator of an employee’s ability. There are 149,601 resumes with salary related information. Our data collection algorithm divided the monthly income into 9 categories, and classified each resume based on its salary related information. The distribution of the monthly incomes is shown in Fig. 2. This distribution is basically consistent with the officially released statistics of Beijing residents’ income.

Fig. 2.
figure 2

The distribution of the monthly incomes.

University. The education experiences have a direct impact on the work experiences. There are 259,969 resumes with the “education” dimension. Some employees use the Chinese University names, while some employees use the English names. So the data collection algorithm combines the Chinese University names and the corresponding English names. After the data integration, there are 79,433 different universities in these resumes. The distribution of ten universities with most graduates are University of International Business and Economics (6175 graduates), Renmin University of China (5481 graduates), Beijing International Studies University (5393 graduates), Beijing University of Technology (5217 graduates), Beijing Foreign Studies University (4710 graduates), Capital University of Economics and Business (4661 graduates), Peking University (4408 graduates), Beijing Institute of Technology (3654 graduates), Tsinghua University (3317 graduates), University of Science and Technology Beijing (3116 graduates).

All these ten universities are in Beijing, since the resumes in our data set are collected from the companies in Beijing and most people choose to work in the city where they study. There are so many different types of universities and companies in Beijing, so there is no bias to collect resumes from companies in Beijing. Besides, the number of resumes in the data set is large, which are with high qualities.

Major. There are 71 different majors in the 259,969 resumes with the “education” dimension. There are six majors with more than ten thousand graduates, which is shown in Fig. 3. The six majors are foreign languages, business administration (eg, marketing/international trade/tourism/logistics), economics (eg, financial/accounting), information science (eg, electrical/computer/software/network/communications), mechanical (eg, automation/industrial design) and law.

Fig. 3.
figure 3

The distribution of top six majors with most graduates.

Work Experiences. The work experiences are the crucial factors in the resumes. There are 174,833 resumes with the “work” dimension, which provides rich resources for further correlation exploring. Many employees have worked in more than one company. So there are 341,720 different companies in total. The top 10 companies with most employees are Lenovo (574), IBM (501), Huawei (348), Hewlett Packard (HP) China (346), Siemens China (288), CCTV (275), Bank of China (269), Beijing Organising Committee for Olympic (246) and Industrial and Commercial Bank of China (233). The statistics shows that the employee proportions in these companies are very low, for example, the proportion of IBM is 0.3%, which verifies the diversity of the work experiences in our data set and is very helpful for the correlation exploring.

There are 111,419 different types of departments in the 174,833 resumes. The top 10 departments with most employees are sales (25,418), marketing (19,368), finance (17,477), Administration (11,765), Human resource (8,699), Technical department (5,411), business (3,163), Office (2,820), engineering department (2,610) and manager office (2,130). The employees in the top four departments are more than ten thousand. Pattern mining based on one department is valuable, not to mention the entire data set.

We define some different levels of jobs and categories in each level based on the human resource expertise. For example, the levels contain sales, marketing, IT, research, finances, etc, and in the level of sales, the categories include sales manager, sales assistant, sales representative, etc. There are 1,385 different combinations in the 174,833 resumes with work experiences. The top jobs with most employees are administration - administrative assistant (20,138), other - other (18,464) administration - manager assistant/secretary/clerk (16,514), translation - English translator (10,536), sales - sales representative (10,276), sales - sales manager (10,041), finances - accountant (9,707), human resources - HR specialist/assistant (7,750), marketing - specialist/assistant (7,572) and sales - sales assistant (7,270). The abundant job categories and so many resumes with job information provide solid data support for the proposed system.

Training Experiences. There are 70,859 resumes with the “training” dimension, and 83,706 different training organizations appear in these resumes. Some employees participate in two or more training courses. Most of them are language training. The most popular training institution is New Oriental School. 3,802 employees attended the courses in New Oriental School, accounting for 5% of the population. The other two of the three most popular training institutions are Sunlands (1,447) and Beijing Foreign Studies University (1,047).

4 System Design

The screenshot of ResumeVis is illustrated in Fig. 4. There are six parts: title bar, improved parallel coordinates, introduction and controls, attribute filter, group selection and displayed attribute selection.

Fig. 4.
figure 4

The screenshot of ResumeVis.

Title Bar. The title bar clearly shows the system name “ResumeVis: interactive correlation explorer of resumes”.

A button named “keep” is used to filter and keep the items based on the attribute values. If you want to focus on some values in an attribute, you select these values by dragging vertically along the attribute axis first. Then after pressing the keep button, the items which do not meet the condition will be removed. More importantly, multiple attributes can be filtered collaboratively to support various analytical tasks. After filtering by one attribute, you can repeat the same operation to filter items by another attribute. In this way, the items are filtered by two attributes at the same time.

At the right of the title bar, there are two numbers separated by a slash. The right one is the total number of the items, while the left one is the number of the items which have been visualized in the parallel coordinates. Due to the large amount of data, the system uses the left number as a loading progress bar. Figure 4 shows the results by removing the items with group value “0”. So there are 149,601 items in total after filtering.

Improved Parallel Coordinates. This part is the core of the system. We improve the traditional parallel coordinates to adapt to resume data. Convenient interactions are integrated in the improved parallel coordinates for flexibility, which will be explained in detail in the “interaction” subsection.

Introduction and Controls. This part shows a brief introduction of the system and the interactions for the improved parallel coordinates.

Group Selection. In the system, we can set an attribute as the group and assign different colors for all the possible values of the attribute, which helps users distinguish between different values. Only one attribute can be selected as the group at a time by selecting in the dropdown box. However, too many colors are easy to confuse. So the attributes with too many values are not allowed to be set as the group, such as the attributes “school”, “company” and “job”. In this system, the maximum number of possible values in the group is 71, which is the number of different majors. And the 71 different majors are encoding from 1 to 71 based on the number of employees in each major.

The color encodings for all the possible values in the group attribute and the meanings of these values are shown under the dropdown box. For example, in Fig. 4, we set the attribute “monthly income” as the group, and the color encodings and meanings of ten possible values are shown, in which “0” indicates that there is no income information.

Attribute Filter. This part is used to filter attributes, especially those with too many values. First, select an attribute in the dropdown box, then input a value and press the filter button. For example, select the attribute “school”, then input the value “Tsinghua University”. After pressing the filter button, the parallel coordinates will only display the items whose value of school is Tsinghua University. This part makes up the limitation of group selection.

Fig. 5.
figure 5

An example of parallel coordinates with multi-valued attributes.

Displayed Attribute Selection. There are 21 attributes in our data set. Displaying all attributes in the parallel coordinates is too crowded. So the attribute selection part at the bottom right of the interface allows users to choose which attributes to display in the parallel coordinates by checking the corresponding check boxes. The parallel coordinates will be changed dynamically based on the user selection.

The system uses the improved parallel coordinates to depict resumes from the following two perspectives.

Personal Perspective: Sometimes we focus on one employee’s resume, such as a successful person. Personal perspective is designed for this function. “Parallel coordinates with multi-valued attributes” is contributed to represent resumes, which has multiple values in one attribute. In order to protect personal privacy, the identification information such as names, Chinese ID number and emails is not visible to users.

Correlation Perspective: The parallel coordinates have an innate nature for correlation exploring. Furthermore, combined with flexible and diverse interactions, such as brush, changing axis, reordering axis, group setting, etc, the system provides rich methods of correlation exploring and helps users to obtain a more comprehensive understanding of resumes.

4.1 Parallel Coordinates with Multi-valued Attributes

In a resume, some attributes have multiple values. For example, the education experiences of an employee may include three periods, undergraduate, graduate and PhD, and the work experiences may contain multiple companies and jobs. In the standard parallel coordinates, an item is a polyline, and one attribute has only one value. So the standard parallel coordinates are not suitable for resume data. In this paper, we propose the parallel coordinates with multi-valued attributes, which can show multiple values in one attribute to meet the characteristics of resumes.

An example of parallel coordinates with multi-valued attributes is shown in Fig. 5. The employee is female and born in 1979, whose demographic information is unique. She masters two foreign languages, English and one minority language. She was on the college from 1999 to 2001 and was as an undergraduate from 2002 to 2005. The majors of these two stages are economics, whose major code is “3”. From the example, we can see that the parallel coordinates with multi-valued attributes show all the attributes in a resume very clearly.

4.2 Interaction

There are flexible and rich interactions in the system to support various analytical tasks.

Brush. The brush interaction is to filter items based on attribute values by dragging vertically along the attribute axis. The brush interaction can be used with the keep button. The differences between brush and keep button are: (1) Only one attribute can be filtered at a time by using the brush interaction. If brushing the second attribute, the first brush is invalid. While using the keep button, multiple attributes can be filtered simultaneously. (2) By using the brush, the shapes and axes of the parallel coordinates stay the same, except that the items which do not meet the condition are removed. However, after pressing the keep button, the axes are adapted to the selected values and the parallel coordinates will be changed. If you want to remove brush, just tap the axis background.

Change Axis. Any combinations of all the 21 attributes in the resumes can be allowed by checking the check boxes at the bottom right of the interface. Besides, users can drag axis label to the left edge to remove one attribute.

Reorder Axis. All the axes in the parallel coordinates can be reordered in any order by dragging an attribute label horizontally at one time. If you want to see the direct correlations between two attributes, drag the attribute label horizontally to let them be next to each other.

Invert Axis. The values in each axis can be inverted upside down by tapping the axis label.

Group Settings. Each attribute with small amount of possible values can be set as the group. The group can be seen as the focus of current data analysis.

Attribute Filter. The items can be filtered by each value in any attribute.

5 Experiments

The parallel coordinates with multi-valued attributes in the system are implemented by D3.js [22] and based on the open source code [23]. The following case studies are illustrated as examples to validate the system usability.

5.1 Case Study 1: Which Attributes Related to Income?

Job seekers are most concerned about income. So we use all the items in the data set to look for the attributes related to the income. In our common sense, the higher your education, the higher your income. Another common sense is that with age, income is increasing. Is that right? In this case study, we explore the correlation among income, year of birth and highest education.

First, the attribute “monthly income” is set as the group, all the 372,829 items are displayed in different colors. The color encodings and meanings for all the monthly income types are shown at the bottom of the group selection part. Second, let the attributes “year of birth” and “highest education” be next to the attribute “monthly income” by using the axis reordering interaction. Finally, use the brush and keep button alternately to filter items by focusing on one value at a time. The specific operation is to select one value of “monthly income” by brush, then to press the keep button to only maintain the items with the specific value, finally to select all the values of “year of birth”, which the filtered items fall into. The results are shown in Fig. 6.

Fig. 6.
figure 6

Correlations among income, highest education and year of birth with monthly income as the group. The income in the left one is below 1000 yuan; the income in the middle one is between 6000 to 7999 yuan; the income in the right one is over 25000 yuan.

Figure 6 shows the correlations among income, highest education and year of birth with three values of monthly income. The left one is the result when the monthly income is below 1000 yuan. The highest education distribution mainly focuses on the secondary school, high school, college and undergraduate. There are very few highly educated. There is no item whose year of birth is below 1970. That’s why we use the brush and keep button to filter items based on two attributes. We hope to expand the items along the axis as far as possible to display more clearly. The middle is the result when the monthly income is between 6000 to 7999 yuan. The distribution of highest education is relatively uniform. There are few items with EMBA as highest education, because there are few resumes that meets this condition, as shown in Fig. 1. The range of year of birth is from 1950 to 2000. The right is the result when the monthly income is over 25000 yuan. The highest education focuses on the high degrees and there are few resumes with values “secondary school” and “high school”. Besides, there are many items with EMBA as highest education and more elders.

For further cross validation of correlations between monthly income and highest education, we set the attribute “highest education” as the group from another point of view. The results are shown in Fig. 7. The left one is the result when the highest education is “high school”. The distribution of monthly income mainly focuses on the low incomes. There are few items with more than 1000 yuan. The right one is the result when the highest education is “Ph.D”. There are more items with high income and few items with low income. So the case study validates the common sense that the higher your education, the higher your income and a certain correlation with ages.

Fig. 7.
figure 7

Examples of correlations between highest education and income with highest education as the group. The highest education in the left one is high school; the highest education in the right one is Ph.D.

Fig. 8.
figure 8

Correlations among some attributes with major as the group. The major in the left one is foreign language; the major in the middle one is information science; the major in the right one is art.

5.2 Case Study 2: Are There Differences Between Different Majors?

First, the attribute “major” is set as group. To see the correlations between each attribute and the attribute “major”, drag each attribute label horizontally to let it be next to major. And press the keep button to keep items with a specific major value.

Correlations among major, language and monthly income are shown in Fig. 8(a). The distribution of language in the left image (foreign language major) is more evenly than the distributions in the other two images (information science and art). The result is in line with our understanding. Besides, there is no significant differences in the distribution of monthly income. Correlations between major and highest education are shown in Fig. 8(b). Three majors are selected, foreign language, information science and art. We can see that the number of art doctors is obviously less than that of the other two majors. In fact, compared to other majors, there are less amount of doctoral awarding departments of art in Chinese colleges and universities.

6 Conclusion

In this paper, we propose an interactive visualization system, called ResumeVis, to explore the correlations of resumes. First, we construct a large resume data set with complete attributes, which contains 372,829 Chinese resumes and valuable attributes. The data distribution analysis verifies the diversity and validity of the data set. Then, the improved parallel coordinates with multi-valued attributes are proposed to adapt to the characteristics of resume data. Combine with the flexible and rich interactions, the system supports various analytical tasks. Finally, the case studies validate the system usability.