Skip to main content
Log in

Deep learning, deep change? Mapping the evolution and geography of a general purpose technology

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

General purpose technologies that can be applied in many industries are an important driver of economic growth and national and regional competitiveness but there is little research about their geographic dynamics and the role of industrial ecosystems in spurring their development. We address this with an analysis of Deep Learning, a core technique of artificial intelligence systems increasingly being recognized as the latest example of a transformational general purpose technology. We identify Deep Learning papers through a semantic analysis of a novel dataset from arXiv, a popular preprints website, and use CrunchBase, a technology business directory to map business capabilities. After showing that Deep Learning conforms to the definition of a general purpose technology, we study changes in its geography and its drivers revealing China’s rise in Deep Learning research. We also find that initial volatility in the geography of Deep Learning has been followed by consolidation suggesting that the window of opportunity for new entrants might be closing. We study the regional drivers of Deep Learning competitive advantage, finding that strong research clusters tend to appear in regions that specialise in research and industrial activities related to Deep Learning, underscoring the importance of supportive innovation ecosystems for the development of general purpose technologies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. Code and data for reproducing this analysis is available for review in https://github.com/nestauk/arxiv_ai.

  2. The beginning of the automobile industry is a paradigmatic example of this phase, with inventors and entrepreneurs exploring in parallel various energy sources for the automobile, from the combustion engine to electrical and steam-powered motors, (Klepper, 1996).

  3. At the same time, there might be some dislocation of activity as standardized parts of the production process are outsourced or off-shored to other locations with cheaper costs.

  4. It should however be noted that in our analysis we exclude papers authored by researchers in multinational corporations since it is not possible to reliable assign their institutions with a single location, an important element of our analysis.

  5. We identify DL papers in the MAG data by selecting fields of study - the keywords that MAG use to label their data - that appeared as highly salient terms in our topic modeling.

  6. We also create a more restrictive DL category containing only those papers were both topics are present with a topic weight above 0.5, resulting in a total of 1604 papers. A visual inspection of a random sample of papers in both groups suggests that their outputs are similarly relevant, implying lower recall in the restricted set. This leads us to focus on the more expansively defined set. This is further motivated by our interest in understanding the diffusion of DL methods in various computer science subjects where DL may not be the key ingredient.

  7. Our fuzzy matching strategy, which is described in detail in the "Appendix" achieves a match rate of 90%.

  8. Here we rely on boundary (shapefile) data from the Natural Earth public map dataset.

  9. As before, we geocode CB companies using a point-in-polygon approach with boundaries from Natural Earth.

  10. We focus on those observations with the longest and more informative descriptions, comprising around 200,000 companies

  11. By setting a high threshold for classification of arXiv papers into CrunchBase categories we seek to remove noise in the transference of the model across corpora with different languages.

  12. We note that they classify computer vision papers and patents outside of DL. This contrasts with our finding that Computer Vision is one of the main application areas for DL, underscoring the value of unsupervised approaches for the analysis of fast moving technology fields.

  13. The results are similar if we focus on the most highly cited papers every year.

  14. Most papers are labelled with multiple arXiv subjects. We allocate a paper to a subject if it appears in it at least once.

  15. Highly cited papers are those in the top citation quartile for each year.

  16. The measures of related activity weight levels of regional specialization in research subjects and industrial activities by the DL similarity vectors described in the "Appendix".

  17. Our results are robust to changes in these thresholds

  18. During our robustness tests we have removed some of these interaction terms without substantial changes in our results.

  19. The exception to this, Information Technology (cs.IT) is a catch-all subject present in almost 20% of the computer science arXiv corpus

  20. This result is also driven by the overlaps between DL and Computer Vision outlined in "GPT aspects of DL research in arXiv" section.

  21. It should however be noted that in our analysis we exclude papers authored by researchers in multinational corporations since it is not possible to reliable match their institutions with a single location, an important element of our analysis.

  22. ‘Fuzzy-matching’ refers to the process of finding a likely match for a set of text (such as a word or sentence) amongst a choice of texts. A naive example would be comparing the ratio of the number of characters between texts, and identifying the texts with the highest ratio as a match.

  23. We do this with a point-in-polygon approach using boundary (shapefile) data from the Natural Earth public map dataset.

  24. We also create a more restrictive DL category containing only those papers either topic is present with a γ above 0.5, resulting in a total of 1,604 papers. A visual inspection of a random sample of papers in both groups suggests that their outputs are similarly relevant, implying lower recall in the restricted set. This leads us to focus on the more expansively defined set. This is further motivated by our interest in understanding the diffusion of DL methods in various computer science subjects where DL may not be the key ingredient.

  25. We note that London, a region that might have been expected to rank high in this list is split into its Boroughs (local administrative boundaries) through our analysis.

  26. Researchers who submit their papers to arXiv label them with a set of relevant research categories. We focus our analysis in Computer Science (cs) subjects as well as those in the stat.ML subject.

  27. These results underscore the importance of triangulating our results against other data sources in future research.

  28. We focus on those observations with the longest and more informative descriptions, comprising around 200,000 companies

  29. By setting a high threshold for classification of arXiv papers into CrunchBase categories we seek to remove noise in the transference of the model across corpora with different languages.

References

  • Abernathy, W. J., & Utterback, J. M. (1978). Patterns of industrial innovation. Technology Review, 80(7), 40–47.

    Google Scholar 

  • Adner, R. (2017). Ecosystem as structure: An actionable construct for strategy. Journal of Management, 43(1), 39–58.

    Article  Google Scholar 

  • Aghion, P., David, P. A., & Foray, D. (2009). Science, technology and innovation for economic growth: Linking policy research and practice in ‘STIG Systems’. Research Policy, 38(4), 681–693.

    Article  Google Scholar 

  • Agrawal, A., Gans, J., & Goldfarb, A. (2018a). Prediction machines: The simple economics of artificial intelligence. Boston: Harvard Business Press.

    Google Scholar 

  • Agrawal, A., McHale, J., & Oettl, A. (2018b). Finding needles in haystacks: Artificial intelligence and recombinant growth. National Bureau of Economic Research: Tech. Rep.

  • Agrawal, A. K., Gans, J. S., & Goldfarb, A. (2018c). Economic policy for artificial intelligence. Working Paper 24690, National Bureau of Economic Research. 10.3386/w24690. http://www.nber.org/papers/w24690

  • Anderson, P., & Tushman, M. L. (1990). Technological discontinuities and dominant designs. A cyclical model of technological change Administrative science quarterly (pp. 604–633). New York: JSTOR.

    Google Scholar 

  • Audretsch, D. B., & Feldman, M.P. (1996). R&D Spillovers and the geography of innovation and production. The American Economic Review 86(3):630–640. http://www.jstor.org/stable/2118216

  • Autio, E., & Thomas, L. (2014). Innovation ecosystems. The Oxford handbook of innovation management (pp. 204–288). Oxford: Oxford University Press.

    Google Scholar 

  • Balland, P. A., & Rigby, D. (2017). The geography of complex knowledge. Economic geography, 93(1), 1–23.

    Article  Google Scholar 

  • Boschma, R. (2005). Proximity and innovation: a critical assessment. Regional Studies, 39(1), 61–74.

    Article  Google Scholar 

  • Bostrom, N. (2017). Strategic implications of openness in AI development. Global Policy, 8(2), 135–148.

    Article  Google Scholar 

  • Breschi, S., Lassébie, J., & Menon, C. (2018). A portrait of innovative start-ups across countries. Paris: OECD.

    Google Scholar 

  • Bresnahan, T., & Yin, P. L. (2010). Reallocating innovative resources around growth bottlenecks. Industrial and Corporate Change, 19(5), 1589–1627.

    Article  Google Scholar 

  • Bresnahan, T. F., & Trajtenberg, M. (1995). General purpose technologies ‘Engines of growth’? Journal of econometrics, 65(1), 83–108.

    Article  Google Scholar 

  • Brundage, M. (2016). Modeling progress in AI. In Workshops at the thirtieth AAAI conference on artificial intelligence

  • Brynjolfsson, E., Rock, D., & Syverson, C. (2017). Artificial intelligence and the modern productivity paradox: A clash of expectations and statistics. National Bureau of Economic Research: Tech. rep.

  • Börner, K., Scrivner, O., Gallant, M., Ma, S., Liu, X., Chewning, K., et al. (2018). Skill discrepancies between research, education, and jobs reveal the critical need to supply soft skills for the data economy. Proceedings of the National Academy of Sciences, 115(50), 12630–12637.

    Article  Google Scholar 

  • Cockburn, I. M., Henderson, R., & Stern, S. (2018). The impact of artificial intelligence on innovation. National bureau of economic research: Tech. Rep.

  • Dalle, J. M., Den Besten, M., & Menon, C. (2017). Using Crunchbase for economic and managerial research. Paris: OECD.

    Google Scholar 

  • David, P. A. (1990). The dynamo and the computer: an historical perspective on the modern productivity paradox. The American Economic Review, 80(2), 355–361.

    Google Scholar 

  • De Fauw, J., Ledsam, J. R., Romera-Paredes, B., Nikolov, S., Tomasev, N., Blackwell, S., et al. (2018). Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine, 24(9), 1342–1350.

    Article  Google Scholar 

  • Frenken, K., Van Oort, F., & Verburg, T. (2007). Related variety, unrelated variety and regional economic growth. Regional Studies, 41(5), 685–697.

    Article  Google Scholar 

  • Furman, J., & Seamans, R.(2018). AI and the Economy. SSRN Scholarly Paper ID 3186591, Social Science Research Network, Rochester, NY. https://papers.ssrn.com/abstract=3186591

  • Gofman, M., & Jin, Z. (2019). Artificial intelligence, human capital, and innovation. human capital, and innovation .(August 20, 2019)

  • Goldfarb, A., & Trefler, D. (2018). AI and international trade. National Bureau of Economic Research: Tech. rep

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Hall, B. H., & Trajtenberg, M. (2004). Uncovering GPTs with patent data. National Bureau of Economic Research: Tech. rep.

  • Helpman, E., & Trajtenberg, M. (1994). A time to sow and a time to reap: growth based on general purpose technologies. National Bureau of Economic Research: Tech. rep.

  • Hidalgo, C. A., & Hausmann, R. (2009). The building blocks of economic complexity. Proceedings of the national academy of sciences, 106(26), 10570–10575.

    Article  Google Scholar 

  • Hidalgo, C. A., Balland, P. A., Boschma, R., Delgado, M., Feldman, M., Frenken, K., Glaeser, E., He, C., Kogler, D. F., & Morrison, A. (2018). The principle of relatedness. In International conference on complex systems (pp 451–457). Springer

  • Index, A. I. (2017). The Artificial Intelligence Index: 2017 Annual Report. Technical report: Tech. rep.

  • Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy Blog, 21, 23.

    Google Scholar 

  • Klepper, S. (1996). Entry, exit, growth, and innovation over the product life cycle. The American Economic Review, 86, 562–583.

    Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). ImageNet Classification with Deep Convolutional Neural Networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in Neural Information Processing Systems 25, Curran Associates, Inc., pp 1097–1105. http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

  • Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt.

    Google Scholar 

  • Mokyr, J. (2002). The gifts of Athena: Historical origins of the knowledge economy. Princeton: Princeton University Press.

    Google Scholar 

  • Owens, J. D., Houston, M., Luebke, D., Green, S., Stone, J. E., & Phillips, J. C. (2008). GPU computing. Proceedings of the IEEE, 96(5), 879–899.

    Article  Google Scholar 

  • Porter, M. E. (1998). Clusters and the new economics of competition (Vol. 76). Boston: Harvard Business Review.

    Google Scholar 

  • Sample, I. (2018). Scientists plan huge European AI hub to compete with US | Science | The Guardian. the Guardian. https://www.theguardian.com/science/2018/apr/23/scientists-plan-huge-european-ai-hub-to-compete-with-us

  • Scott, A., & Storper, M. (2003). Regions, globalization, development. Regional Studies, 37(6–7), 579–593.

    Article  Google Scholar 

  • Taddy, M. (2018). The technological elements of artificial intelligence. National Bureau of Economic Research: Tech. rep.

  • Ver Steeg, G., & Galstyan, A. (2014). Discovering structure in high-dimensional data through correlation explanation. In Advances in Neural Information Processing Systems (pp 577–585)

  • Wang, K., Shen, Z., Huang, C., Wu, C. H., Dong, Y., & Kanakia, A. (2020). Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1), 396–413.

    Article  Google Scholar 

  • Williams, G. (2018). Why China will win the global race for complete AI dominance. Wired UK https://www.wired.co.uk/article/why-china-will-win-the-global-battle-for-ai-dominance

  • Wooldridge, M. (2020). The road to conscious machines. London: Penguin Books.

    Google Scholar 

  • Xu, G., Wu, Y., Minshall, T., & Zhou, Y. (2018). Exploring innovation ecosystems across science, technology, and business: A case of 3D printing in China. Technological Forecasting and Social Change, 136, 208–221.

    Article  Google Scholar 

Download references

Acknowledgements

Previous versions of this paper received valuable comments from attendees at the 2019 ZEW Conference on the Economics of Innovation and Patenting, the SPRU Friday Seminar Series, and the 2019 NBER Economics of AI Conference in. We would like to thank Tommaso Ciarli, W. E. Steinmueller and Jann Kinne in particular for their feedback. The analysis presented in this paper was supported through a European Commission Horizon 2020 Research and Innovation Action: CO-CREATION-08-2016-2017 - Better integration of evidence on the impact of research and innovation in policy making. We also thank two anonymous researchers for valuable comments that have substantially improved the quality of the paper.

Funding

Funding was provided by H2020 Society (Grant No: 770420).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Mateos-Garcia.

Appendices

Appendix A: Data collection and processing

In this paper, we:

  1. 1.

    Ascertain the ‘GPT-ness’ of Deep Learning by analysing its evolution, diffusion and impact in the field of Computer Science research.

  2. 2.

    Analyze the evolution of its geography and more specifically if it has been subject to the turbulence (‘deep change’) we may have expected as a consequence of its disruptive, transformational nature

  3. 3.

    Study the drivers of the geography of DL research and the extent to which it is driven by the co-location between DL researchers and other researchers and businesses with relevant (related) capabilities that may be conducive to the development of strong DL ecosystems.

This requires the creation of a complex data collection and processing pipeline involving the following activities:

  1. 1.

    We combine data from arXiv, GRID (Global Research Identifier Database) and MAG (Microsoft Academic Graph) to create a geocoded dataset of research activity in computer science disciplines

  2. 2.

    We identify DL papers with CorEx, a topic modeling algorithm.

  3. 3.

    We measure the relatedness between computer science subjects and DL research through their co-occurrence in arXiv papers.

  1. 4.

    We use CrunchBase, a business directory, to map industrial activities that might be relevant for the development of DL clusters. We measure relatedness between those industries and DL using a machine learning model trained on company descriptions to predict the industrial orientation of arXiv research papers.

    We go through these two streams of data collection and classification in turn.

    Code and data for reproducing this analysis is available for review in https://github.com/nestauk/arxiv_ai.

Identifying and mapping DL papers in arXiv data

We generate the DL dataset for our analysis by matching three non-proprietary open data sources; arXiv, Microsoft Academic Graph (MAG), and the Global Research Identifier Database (GRID). The data sources are matched in the following order, according to the procedure described in the following section:

$$ \left\{ {{\text{arXiv}}\mathop{\longrightarrow}\limits^{{{\text{matched}}\,{\text{to}}}}{\text{MAG}}} \right\}\mathop{\longrightarrow}\limits^{{{\text{matched}}\,{\text{to}}}}{\text{GRID}} $$

By following this pipeline of data collection, we create a dataset with the features described in Table 3 for further processing as described in this "Appendix".

Table 3 Features extracted in the data collection procedure.

arXiv

arXiv is an open archive of academic preprints widely used by researchers in quantitative, physical and computational science fields with 1.7 million papers at the time of writing. Data from these papers can be accessed programmatically via the arXiv API. As arXiv papers are self-registered, we ensure that papers are not simply ‘junk’ articles by requiring that all papers are matched to a journal publication or conference proceeding, as presented in the following section. We also have anecdotal evidence that the archive contains many high quality papers, since a short study of conference proceeding from the prestigious AI Conference on Neural Information Processing Systems (NeurIPS) in 2017 reveals that over 55% of these were published on arXiv.

Is arXiv a suitable data source for the analysis of applied R&D? We believe that this is the case. The AI research community has a strong culture of openness in its publication of research findings, software and benchmark datasets, which are perceived as a way to attract scientific talent (Bostrom, 2017). Some of the most active DL institutions in our corpus include corporations such as Google, Microsoft, IBM, Baidu or Huawei.Footnote 21

From the initial set of over 1.3 million papers, we select approximately 134,000 for analysis because they fall under the broad category of ‘Computer Science’ (cs) or the specific category of ‘Statistics - Machine Learning’ (stat.ML).

Microsoft academic graph (MAG)

Microsoft Academic Graph (MAG) is an open API offering access to 140 million academic papers and documents compiled by Microsoft and available as part of its Cognitive Services (Wang et al., 2020). For the purpose of this paper, MAG helps to ensure that article retrieved from arXiv have been published in a journal or conference proceeding, as well as providing citation counts, publication date and author affiliations. The matching of the arXiv dataset as mentioned in the latter section is performed in two steps.

We begin by matching the publication title from arXiv to the MAG database. The database can be queried by paper title, although fuzzy-matchingFootnote 22 or near-matches are not possible with this service. Furthermore, since paper titles in MAG have been preprocessed, we need to apply a similar preprocessing prior to querying the MAG database. There is no public formula for achieving this, so we explicitly describe the following steps to emulate the MAG preprocessing:

  1. 1.

    Identify any ‘foreign’ characters (for example, Greek or accented letters) as nonsymbolic;

  2. 2.

    Replace all symbolic characters with spaces; and

  3. 3.

    Ensure no more than one space separates characters.

This procedure leads to a match rate of 90%, for the set of arXiv articles used in this paper. We speculate that papers could be missing for several reasons: the titles on arXiv could significantly different from those on MAG; the latter procedure may be insufficient for some titles; the arXiv paper may not be published in a journal or conference proceeding; and MAG may not otherwise contain the publication. It may be possible to recuperate some of these papers, however this is currently not a limiting factor in our analysis.

Global research identifier database ( GRID )

We use the Global Research Identifier Database (GRID) to enrich the dataset with geographical information, specifically a latitude and longitude coordinate for each affiliation that we can then reverse geocode into countries and regions.Footnote 23. The GRID data is particularly useful since it provides institute names and aliases (for example, the institute name in foreign languages). Each institute name from MAG is matched to the comprehensive list from GRID as follows:

  1. 1.

    If there is an exact match amongst the institute names or aliases, then extract the coordinates of this match. Assign a ‘score’ of 1 to this match (see step 3. for the definition of ‘score’).

  2. 2.

    Otherwise, check whether a match has previously been found. If so, extract the coordinates and score of this previous match.

  3. 3.

    Otherwise, find the GRID institute name with the highest matching score, by convoluting the scores from various fuzzy-matching algorithms in the following manner:

    $$ \frac{1}{\sqrt N }\sqrt {\sum\limits_{n = 0}^{N} {F_{n} \left( {m_{{{\text{MAG}}}} ,M_{{{\text{GRID}}}} } \right)^{2} } } $$
    (3)

    where N is the number of fuzzy-matching algorithms to use, Fnreturns a fuzzy matching score (in the range 0 → 1) from the nth algorithm, mMAG is the name from MAG to be matched and MGRID is the comprehensive list of institutes in the GRID data.

The form of Equation 3 ensures that effect of a single poor fuzzy-matching score is to vastly reduce the preference for a given match. Therefore, good matches are defined according to Equation 3 as having multiple good fuzzy-matching scores, as measured according to different algorithms. We use a prepackaged set of fuzzy-matching algorithms implementing the Levenshtein Distance metric, and specifically, two algorithms applying a token-sort-ratio and a partial-ratio respectively.

After this stage of data matching, we are left with approximately 240,000 unique institute-publication matches with at least one computer science subject in their arXiv categories.

Topic modeling

We use the Correlation Explanation (CorEx) (Ver Steeg and Galstyan, 2014) algorithm, which takes an information theoretic approach to generate n combinations of features in the data which maximally describe correlations in the dataset.

Before doing this, we pre-process our text by tokenizing the text of the abstracts and removing common stop-words, very rare words and punctuation. We lemmatize the tokens based on their part-of-speech tag, and create bi-grams and tri-grams. Documents with less than twenty tokens are removed from the sample. After these steps, there are over 168,000 features (unique unigrams, bi-grams or tri-grams) in the dataset.

Using a one-hot bag-of-words representation, we optimally find n = 28 topics by tuning n with respect to the ‘total correlation’ variable, as advised by the CorEx authors. The generated topics contain words which are sorted in terms of their contribution of each feature to total correlation. We assign a score Sjfor each topic j (containing Njwords wiwith topic weights \(T_{i}^{j}\) ) to each document W such that:

$$ S^{j} = \sum\limits_{i = 0}^{{N^{j} }} {T_{i}^{j} \delta (w_{i} ,W)} $$
(4)

where:

$$ \delta (w_{i} ,W) = \left\{ {\begin{array}{*{20}c} 1 & {if\,w_{i} \in W} \\ 0 & {{\text{otherwise}}} \\ \end{array} } \right. $$
(5)

Topics are then assigned to each document only if the following condition is satisfied:

$$ S^{j} \ge \gamma T_{\max }^{j} $$
(6)

where γ is a threshold parameter that we assign below, and \(T_{\max }^{j}\) is the maximum topic weight. The form of the above asserts that documents must contain a sufficient number of components of topics to be assigned to the topic. Clearly, a larger choice of γ leads to a lower frequency of documents assigned to the topic whilst improving the overall recall.

After inspecting the model outputs, we identify two topics related to DL, containing keywords such as ‘neural network’, ‘deep learning’, or ‘convolutional neural networks’. We label as ‘Deep Learning’ those papers where either of these topics is present with a γ above 0.5, giving us a set of 15,062 DL papers (11% of the total unique papers).Footnote 24

Descriptive results

Figures 10, 11 and 12 present the distribution of DL and non-DL activity over arXiv computer science subjects, countries and regions for the top categories in each variable.

Some observations:

Fig. 10
figure 10

Distribution of DL/non DL papers by arXiv category

Fig. 11
figure 11

Distribution of DL/non DL papers by country (top 20 countries)

Fig. 12
figure 12

Distribution of DL/non DL papers by region (top 35 regions)

  1. 1.

    DL papers are highly concentrated in a small number of arXiv subjects: Computer Vision (cs.CV), ComputerLearning (cs.LG), MachineLearning (stat.ML), Artificial Intelligence (cs.AI) and Neural Networks (cs.NE). The set of DL-intensive subjects includes some that rely on unstructured datasets where DL has achieved important breakthroughs, and in fields that specialize in the development of ML and AI methods.

  2. 2.

    The United States has the largest share of DL and non-DL papers, with around a third of all publications in both categories. China is over-represented in DL: its share of DL papers is more than double its share of non-DL papers. By contrast, France is underrepresented in DL.

  3. 3.

    North American regions dominate the global rankings of DL activity. California, Massachusetts, New York, Maryland, Illinois and Texas rank highly by volume of DL activity. Ontario and Quebec in Canada also have high levels of activity, consistent with Canada’s strong research base on AI. Beijing, the South West Development Corporation in Singapore, Maryland and Quebec are over-represented in DL, with substantially higher shares of activity in DL than in the rest of the corpus). Notably, only one EU region (Bavaria) appears in the top ten of global DL research in arXiv.Footnote 25

Research relatedness

As part of our analysis of the drivers for the development of strong DL ecosystems in Section “Drivers of DL ecosystem emergence”, we want to measure if a location contains substantial research capabilities related to Deep Learning - in other words, whether there are complementarities between the local research base and DL which may lead to collaborations and knowledge flows that support the development of the DL ecosystem.

We measure relatedness between research subjects and DL through their cooccurrence in arXiv papers.Footnote 26 More specifically, we calculate the cosine similarity between vectors representing the subjects that appear in each paper in the corpus.

Figure 13 displays a heatmap of the proximities (calculated as the cosine similarity) between different arXiv subjects (as well as the DL category) based on their cooccurrence on papers, sorted by their proximity to the DL category.

Fig. 13
figure 13

Cosine similarities between arXiv subjects based on co-occurrence in papers. Darker colours indicate a stronger propensity for subjects to appear in the same papers.

Consistent with Figure 10, DL papers are closer to computer science subjects involving unstructured data and subjects that research ML, AI and neural networks. These subjects also tend to to co-occur with each other, forming what seems to be a ‘cluster’ of data analytics research in arXiv. Our analysis also reveals intuitive connections between other arXiv subjects such as Computers and Society (cs.CY) and Human Computer Interaction (cs.HC) or between Logic (cs.LO) and Programming Languages (cs.PL), supporting the idea that our measures of proximity are a meaningful measure of thematic similarity and relatedness between computer science subjects in arXiv.

CrunchBase data

Descriptive results

Figure 14 presents the regional distribution of activity in CrunchBase. California is again the top region by number of organizations. Technology company activity in CB is more concentrated than research in arXiv (California accounted for 15% of all activity in CrunchBase, while it only captured 7% of the activity in arXiv). US States and Indian regions have a stronger presence here than they did in arXiv. Chinese provinces are, by contrast, less visible.

Fig. 14
figure 14

Share of CrunchBase activity by region

arXiv and CrunchBase coverage

Figure 15 compares levels of activity in arXiv and CrunchBase. Although there is a strong correlation between both datasets (ρ = 0.67), we note some divergences. For example, there are several UK counties around London with a strong presence in CrunchBase but low activity in arXiv. Conversely, some Japanese prefectures display high levels of arXiv activity but few organizations in CrunchBaseFootnote 27.

Fig. 15
figure 15

Association between arXiv and CrunchBase activity (logged )

Research subject: industrial category relatedness

In order to estimate the relatedness between research and industry we take the following steps:

  1. 1.

    We train a supervised machine learning model that predicts the sector for companies in CrunchBase based on their vectorized description.Footnote 28 We perform a grid search to select the best performing model, a logistic regression classifier with L1 regularization.

  2. 2.

    We then transfer this model to arXiv papers in order to predict their closest CrunchBase categories based on the text in their abstract. Specifically, we label arXiv papers with CrunchBase categories where the prediction probability is at least 0.99.Footnote 29

  3. 3.

    We then calculate the share of papers in each arXiv subject (and in the DL category) predicted to be in a CrunchBase category to measure the relatedness between CrunchBase categories and arXiv subjects.

The heatmap in Figure 16 presents the share of all papers in an arXiv subject ( and DL) that were labeled in a CrunchBase category. It shows that DL papers were classified more often in Data Analytics, Artificial Intelligence and Software CrunchBase sectors - these are, unsurprisingly, the industries that DL is thematically closer to. We also detect intuitive relations between other arXiv categories and CrunchBase sectors: for example, Robotics (cs.RO) is related to Science and Engineering, Sound (cs.SO) is related to Music and Audio, and Cryptography (cs.CR) is related to Privacy and Security. It is however worth noting that some of the similarities we identify could be linguistic rather than semantic (for example, our model detects a strong similarity between Game Theory - cs.GT and Gaming, which could be partly explained by their use of similar language rather than a shared knowledge base).

Fig. 16
figure 16

Similarity between arXiv disciplines and CrunchBase sectors: the colour in a cell represents the share of papers with a subject (in the vertical axis) that was classified in a CrunchBase sector (in the horizontal axis)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Klinger, J., Mateos-Garcia, J. & Stathoulopoulos, K. Deep learning, deep change? Mapping the evolution and geography of a general purpose technology. Scientometrics 126, 5589–5621 (2021). https://doi.org/10.1007/s11192-021-03936-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-021-03936-9

Keywords

Navigation