Keywords

1 Motivation and Background

Improving penetration of internetFootnote 1 and digital literacy is leading to the growth of crowdsourcing (image recognition, language translation, responses to questions and other micro and macro tasks). For quick money or due to lack of complete and correct knowledge, responses suffer in quality. The current quality control processes use majority voting, peer-reviews, data mining, fault-tolerant sub-tasks, game theory and other hybrid modes [6]. Crowdsourcing Q&A platforms such as StackExchange and Quora use majority voting for assessing the quality of responses, this would mean manual intervention and latency till viewers rate/vote. We extracted 135 responses of a sample 50 questionsFootnote 2 on ‘Phishing’ from Quora. More than 77% of the responses were not related answers. Among the related and relevant responses, greater than 90% responses had less/no votes from crowd though they were semantically same from a relevant and higher rated answer. Lack of recognition (viewer rating) could de-motivate intrinsic contributors. Also, we conducted an online surveyFootnote 3 in Sep’2016 approach to understand the quality concerns in crowdsourcing. Social media sites including CrowdsourcingWeekFootnote 4 Linkedin group were used for survey participation. The majority of the survey respondents were IT savvy working professionals from Asian countries. 76% of these 212 respondents stated that crowdsourced responses have poor quality. The quality gaps in crowdsourcing of Q&A is the motivation for this research.

We propose Completeness, Consistency and Correctness (3Cs) approach adopted from Software Requirements Engineering (RE) for quality control of crowdsourced Question and Answers. Like software products that are built based on stakeholder(s) requirements (understanding and knowledge level), the responses in crowdsourcing are also based on workers’ knowledge level. Hence, we hypothesize that the rigor of 3Cs will differentiate good from bad responses, thus leads to quality control. Though there are many other RE quality characteristics such as traceability, modifiability, unambiguity, etc., the importance of 3Cs is unequivocally stated in research publications [1, 10, 14], ISO/IEC 25010:2011Footnote 5 standards and GartnerFootnote 6 market research.

An example of completeness of a response for a question on ‘what are the key characteristics of Information Security’ is ‘Confidentially, Integrity and Availability’. Completeness has puritan view with many forms such as Functional, Syntactic, Semantic, etc. Taking direction from Gabriel’s comments in ‘The rise of worse is better’, we measure Completeness as the degree of coverage of real world situations in the response(s) ensuring unnecessary or irrelevant features are not captured. Obtaining complete information for a domain is never ending problem [3]. Hence, our completeness measure for a response is with reference to extracted knowledge base (KB), termed as Adequate Completeness ACP.

Consistency is the measure of conflict free sentences of the response with respect to the objective (question). An example of consistency in a response as extracted from a crowdsourcing platform for a question on ‘What are the security features of Amex credit card’ is ‘Amex credit card has 2 levels of security: they have the normal CVV (Card Verification Value) and the 3 digits are a CID (Customer Card Identity). CVV is a calculated highly secure 4 digit code based on your card number that is not contained in the card magnetic strip’. Based on evolving OntologyFootnote 7 with increasing instances of KB, our consistency ACN is measure of conflict free tuples (Concept + Relationship + Concept) in the response. This would also mean that the response is not just a bag of words but sentences that are cohesive and are conflict free. The history of past contributions (credibility) of workers’ in the topic (Question-Answer) is also a factor in our consistency measure. We relate credibility to consistency rather correctness as they are measures of trust rather rightness.

Correctness is the degree to which a response contains conditions and limitations for the desired capability (question). Hence, a response correctness is not necessarily a binary (Yes/No or True/False) but a degree of match/similarity. An example of correctness of a response is ‘Authentication is used for providing an access entry into the system’. Like completeness, our correctness [3] is a measurement with respect to the extracted KB. The adequate correctness ACR of a crowdsourced response is based on the occurrences of semantically similar content in the extracted KB and the relation to question type (What, Why, When, Where, Who and How - 5W & 1H).

We propose CSQuaRE score based on ACP, ACN and ACR for assessing CrowdSourced responses Quality using Requirements Engineering approach of information security related questions. Our proposed approach would be demonstrated for ‘Information Security’ crowdsourced responses. Our past experiencesFootnote 8 and existence of security related information exchange platforms such as StackExchange, AlienVaultFootnote 9, etc. provide confidence that individuals are comfortable in seeking and responding to security related questions on public platforms. Crowdsourcing Week, a leading website on crowdsourcing has identified security information exchange as one of the top emerging trends.

2 Research Questions

Addressing the following research questions would provide a quantitative measure CSQuaRE, for assessing 3Cs in a response.

  • (Q1) What are the dimensions of completeness, correctness and consistency of a response that can be measured automatically?

    Increasing KB for completeness leads to inconsistency. Completeness and consistency of a response enhances correctness. The interplay among these 3Cs has to be identified to avoid double-counting or negation in CSQuaRE calculation, this includes the degree of relationship - linear/polynomial.

  • (Q2) What is the credibility of a worker in the past while responding to questions in a specific topic/domain?

    Most of the existing crowdsourcing platforms limit credibility assessment at platform and/or tasks level. The crowd worker may not be active on the crowdsourcing platform but may have deep knowledge in the question domain and could be prolific contributor on other internet sites. This research includes identification of the person and his/her credibility in the specific question domain from the obtained KB.

  • (Q3) What is the temporal effect on the response of a question with respect to completeness, consistency and correctness?

    As KB increases with time, the response that had a certain CSQuaRE may change over a period. As an example, strength of cryptography algorithms has improved from SHA1 to SHA2 and so on. Hence, responses’ CSQuaRE requires re-calibration to maintain 3Cs.

Part of our research, we also plan to crawl internet to extract security related information, conduct a study on importance of text cohesion in crowdsourced responses and develop an evolving ontology (KB) based on newer instances of extracted information.

3 Related Work

The related work describes quality control in crowdsourcing, 3Cs and its attributes for quality control, credibility assessment and Q&A platforms.

3.1 Quality in Crowdsourcing

Afra et al. [8] used credibility based on past contributions and contributors mobility pattern for quality control. Aroyo et al. [11] performed quality assessments of Q&A postings on disagreement-based metrics to harness human interpretation. In a recent study, Bernstein et al. [12] discusses on reputation of crowd workers and importance of peer-reviews. The existing quality control mechanisms use hard wired mechanism and are not multi-dimensional model. Some of the other related literature discusses usage of game theory such as multi-armed bandit, better task clarity, effect of cascade model, Groundtruth, experience and language nativity for evaluating quality of workers/tasks. We plan to use reputation/credibility of crowd workers based on past contributions and Ground truth in the form of KB in our quality control approach.

3.2 Completeness, Consistency and Correctness

Siegemund et al. [9] uses ontology model for identification of consistency and completeness of evolving requirements. Lami et al. [7] presented a methodology and tool for evaluation of natural language based requirements for consistency and completeness. The work of McCalls quality model and Zowghi et al. [15] identified the interplay among 3Cs that we plan to extend in our CSQuaRE measurement. The behavioural aspects of the worker such as credibility based on past contributions and profile attribute to consistency in the quality. We plan to extend the work of Kumaraguru et al. [4] on Twitter tweets for credibility score of responses specific to the question domain.

3.3 Question and Answers

Question Answering systems transformed much in the last four decades on par with natural language processing (NLP) techniques. In 1978, the first classic Q&A book published based on Lehnert’s thesis provided fundamental basis for research. Availability of TREC corpus, research in Biomedical domain gave impetus to Q&A platforms. Hirschman et al. [5] factored importance of completeness and correctness in Q&A platforms. The articles and publications of AnswerBus [13], START from MIT, etc. state the usage of advancements in NLP and AI for providing the services. Publicly available literature on IBM Watson, Apple Siri, etc. states the Q&A usefulness and the importance of continuous evolution/training based on KB. While there is no involvement of crowd workers in any of these platforms for quality control, their success on relying evolving KB and ensuring cohesion in responses aligns with our approach.

As evident from the reviewed literature, there is no comprehensive approach for quality control of crowdsourced responses using credibility of worker in the internet, temporal affect on the quality of response, domain knowledge for completeness, consistency and correctness measurement.

4 Proposed Approach

To demonstrate 3Cs approach for quality control, we plan to build a crowdsourcing platform prototype using available open source Q&A software after technical and functional evaluationFootnote 10. The following sections describe the progress of work in terms of ‘In-progress’ and ‘Yet to begin’ for addressing the research questions. The schematic Fig. 1 depicts the approach for implementing quality control of crowdsourced responses.

Fig. 1.
figure 1

Approach for quality control in crowdsourcing

4.1 Building Domain Repository: In-Progress

The basis for measurement of CSQuaRE is based on the extracted domain content. 934,000+ security related URLsFootnote 11 are obtained from Wikipedia and Twitter. These URLs are categorized into 14 groups and 114 controls as available in ISO/IEC 27001:2013Footnote 12 to ensure representation across sub-domains. The crawled content based on seed URLs are cleansed (stop word removal and stemming) and classified into security sub-domains. As there is no prevalent security search engine, we plan to provide an interface (Search Engine) to the extracted domain content by June 2017. This would also provide a user base and an opportunity to seek feedback on data relevance by security experts associated with Banking Community and DSCIFootnote 13, a voluntary organization with security experts as its members.

4.2 Ontology Evolution: In-Progress

The extracted domain content would be represented as an ordered pair of TBox (Concepts and Relationships) and ABox (Assertions) using existing Security Ontology [2] and Word2Vec. We use Word2Vec for similarity mapping between Ontology terms and extracted internet content, Word2Vec is trained on 100 Billion words of GoogleNews Archives and provides. This ordered pair (TBox and ABox) of ontology would be considered as knowledge base (KB). This KB would be used for evaluating the ACP, ACN and ACR of crowdsourced responses. However, the KB needs update with change in time as the concepts and relationships of a domain evolve or change and thus re-calibration of the past responses. Automatically updating ontology based on increased domain content may not be acceptable to ontologists. An observable pattern of KB would be identified for ontologists to update the security ontology. The observable pattern of extracted domain content would be assessed on text cohesion, relevance of the text to 114 controls of security and credibility of the information source. The text of the responses will also be used for ontology evolution as the responses may contain information that is not available in the crawled content and this text could be used for quality control of related future questions.

4.3 Credibility: Yet to Begin

As stated earlier, credibility in a question domain is part of our consistency measure. We plan to have login to our crowdsourcing platform based on user’s Twitter ID. The crowd worker credibility would be based on past contributions on crowdsourcing platform and the credibility score on Twitter in the question-answer related topic. Also, we are in the process of evaluating credibilityFootnote 14 of website containing the information security content, to ensure not every available content is being used for ontology evolution.

$$Credibility = \lbrace Twitter Cred, Site, Contributions... \rbrace $$

4.4 Assignment of CSQuaRE: Yet to Begin

A question posted on the crowdsourcing platform may have one or more responses. Every response would be assigned a CSQuaRE score based on ACP, ACN and ACR. Information Retrieval evaluation metrics such as Recall, Latent Semantic Index, etc. are being explored for measuring Completeness (ACP). NLP techniques such as cohesiveness (part of discourse analysis to ensure responses are not just bag of words but sentences that are related) and individual credibility in the domain based on past contributions would be part of Consistency (ACN) measure. The users in our proposed crowdsourcing platform will also have voting option for the responses, this voting would act as feedback loop for improving the credibility of the crowd worker. For the Correctness (ACR) measure, we are exploring Machine Learning (Decision Tree and SVM) approaches for matching Question Type vis-a-vis response and FrameNetFootnote 15 to obtain semantic similarity of response with reference to the KB.

The initial weights for each of the components (ACP, ACN and ACR) of CSQuaRE score would be equal, scaled to 10 (0 being unrelated and 10 being highest) and will refine based on the feedback loop. Some of the factors in calculating score are

$$ACP = \lbrace TermCoverage, OntologyDepth, .... \rbrace $$
$$ACN = \lbrace DLMatch, Individual Credibility, ... \rbrace $$
$$ACR = \lbrace QuestionType, Response Similarity, ... \rbrace $$
$$CSQuaRE = \lbrace ACP, ACN, ACR \rbrace $$

5 Evaluation Plan

An empirical approach would be used for validating the solutioning of research problems. We plan to extract questions and responses from StackExchange that are related to 114 control groups of ISO 27001, with more than 3 respondents and are rated by viewers. We provide these responses to Security Experts and ask them to evaluate the relevance of response on a scale 0–10, 0 - being unrelated and 10 - being highest. We assess the CSQuaRE of these responses on our crowdsourcing platform making credibility score a constant. We hypothesize that CSQuaRE score should be similar to the score assessed by security experts. The viewer rating of StackExchange responses may also be high for responses that are assessed high by our approach. We also plan to perform a control data experiment to assess CSQuaRE applicability. As part of the evaluation, each of the measures such as ACP, ACN and ACR would be made constant to measure their effectiveness for quality control.

Also, a survey would be conducted to get the feedback on CSQuaRE from security experts. This survey would guide us in identifying gaps and scope for further refinement of the approach.