Full article: Hurdle-QAP models overcome dependency and sparsity in scientific collaboration count networks

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Spatial proximity may facilitate scientific collaboration. We regress its impact within two German research institutions, defining collaboration strength and proximity by the number of joint publications and spatial distance between work places. The methodological focus lies on accounting for (i) the dependency structure in network data and (ii) excess zeros in the sparse target matrix. The former can be addressed by a quadratic assignment procedure (QAP), the second by a hurdle model. To offer a joint solution, we combine the methods to novel parametric and non-parametric hurdle-QAP models. The analysis reveals that proximity can facilitate collaboration, but significant effects get lost within building structures. Outcomes of this study may inform about how to target the promotion of interdisciplinary research.

KEYWORDS:

1. Introduction

Interdisciplinary research offers numerous opportunities in terms of both academic output and personal development of researchers. According to Feng and Kirkley (Citation2020), researchers with a diverse collaborative neighborhood tend to have better academic performance and longer working years. However, the Nature journal (Nature, Citation2015) claims that interdisciplinary research does not yet receive the support it needs to be successful. Feng and Kirkley (Citation2020) also stress the importance of encouraging the development of interdisciplinary programmes. Therefore, it is essential to identify factors that influence interdisciplinary research (Feng & Kirkley, Citation2020).

Research at Bielefeld University, Germany has been self-characterized by its guiding principle of interdisciplinarity. The special structure of its main building allows researchers from different faculties to meet each other without going outdoors. The spatial design clearly differs at Helmholtz Munich, a German research center for environmental health. Although many institutes are united on one main campus, they are distributed over individual buildings. Here, too, interdisciplinary research plays an important role. Based on this observation and inspired by the work of Claudel et al. (Citation2017), we aim to assess the role of building and campus structures for interdisciplinary research: How does spatial proximity between two scientists influence their collaboration? We measure the strength of interdisciplinary collaboration by the number of joint publications between two researchers from two different institutes or faculties. We then address the question of whether researchers generally collaborate and, if so, how much they publish jointly, both depending on their spatial proximity. By that, we narrow down the analysis to few variables to make it transferable to other institutions. This is in contrast to Claudel et al. (Citation2017), who analyze detailed time-resolved data (down to floor and office level per person) from the Massachusetts Institute of Technology to also reveal the impact of heterogeneity within and between buildings and within research teams, and also in contrast to Salazar Miranda and Claudel (Citation2021), who investigate the effects of shifts of locations of researchers’ offices.

In Section 2.1, we give an overview over the two research institutions playing a key role in this work. Further, we describe the data collection and preprocessing, considering both publication and distance data. In Section 2.3, we explain how collaboration networks are constructed and introduce two novel combinations of the quadratic assignment procedure (QAP) and the hurdle model: the parametric and non-parametric hurdle-QAP. We present in Section 3 the results obtained by applying these methods to our data. Finally, we discuss strengths and limitations of this work in Section 4.

2. Materials and methods

2.1. Data collection and preprocessing

2.1.1. Institutions

Helmholtz Munich is a research center for environmental health. It was founded in 1960 and is member of the Helmholtz Association of German Research Centers (Helmholtz Association of German Research Centers, Citation2020). The main research focus of the center is on chronic lung diseases, allergies and diabetes mellitus. It consists of 57 institutes and departments in 12 different locations (main campus in Neuherberg, several locations in Munich and further locations in other parts of Germany). In 2018, the center had 2,546 employees (Helmholtz Zentrum München, Citation2020).

For this work, we use publication and location data from the whole center (Germany-wide) and concentrate later on the main campus which is located in Neuherberg, Germany. The Neuherberg campus (see ground plan in ) is an area of approximately 400,000 square meters with more than 60 buildings in which research institutes and departments or technical, administrative and management facilities are located.

Figure 1. Node net of the main campus in Neuherberg. Red crosses on the ground plan display node points to create a node distance net between all buildings on the campus. Letters on the crosses (e.g. a or b) represent several entrance doors of one building.

Bielefeld University is located in Bielefeld in the North-West of Germany. It was founded in 1969 and holds today 14 faculties in the areas of humanities, natural and technical sciences and in medicine. In 2018, the university had about 25,000 students and 1,622 scientific staff (University, Citation2020). The whole university campus is located on an area of circa 600,000 square meters. The campus contains one central main building and various smaller buildings around it (see Figure A3a). We will primarily concentrate on the main building which consists of different building parts (wings) and 13 floors. As demonstrated in Figure A3b in the supplement, all building wings are connected to the main hall so that one does not have to leave the building to get from one part to another. In this way, one building accommodates many institutes under one roof, which is also one of the university’s statements for interdisciplinary research and an important part of our research question.

In summary, we cover in our analysis two institutions with potentially different collaboration schemes: one research center, Helmholtz Munich, with rather homogeneous research topics and a university, Bielefeld University, with researchers from a more heterogeneous pool of disciplines. Based on publication outcome, we will investigate if the different building structures (campus with single buildings vs. main building) and the resulting distances have an effect on the collaboration.

2.1.2. Publication data

We used openly available publication data for our analysis rather than information from local libraries to avoid biases and to ensure expandability to further institutions. We downloaded data from the publication databases Web of Science (Web of Science, Citation2021), Scopus (Scopus, Citation2021) and PubMed (Pubmed, Citation2021). Each of the three databases was searched for publications of all document types from years 2015 to 2019 which were published by researchers from Helmholtz Munich or Bielefeld University, respectively (search date 04/06/2021). Query details can be found in Supplement Section A.1. We used the R package bibliometrix (Aria & Cuccurullo, Citation2017) to get a clean, merged data set from the three sources with unique publications. The final data set contains, among other, authors, author affiliations and publication year for each publication. Because of different spellings (including typos and abbreviations) for many affiliations (including departments and working groups), we performed a cluster analysis and string matching to assign the authors to their main faculties and institutes. For each author, we only chose the main affiliation (the one with most publications) and did not consider double affiliations. For our network analysis, only authors with affiliation from Helmholtz Munich or Bielefeld University are considered. Further, only authors who have at least one joint publication with another author from the same institution are included. At the end, we have data from 18 faculties (or centers) for Bielefeld and 61 institutes (or core facilities) for Helmholtz Munich. The institutes at both institutions are additionally categorized into research fields. summarizes this data for both Munich and Bielefeld. Further information such as the number of publications per institute and according graphics can be found in Supplement Section A.1. In the following, we generally write about institutes and include in this other units like faculties or core facilities.

Table 1. Summary of employed publication data (from 2015 to 2019) after filtering as described in the main text.

Download CSV Display Table

2.2. Distance data

We aimed to measure the physical distance for all author pairs. Detailed and time-dependent information on the researchers’ exact office locations was unavailable. We defined the location of authors by their most recent main institute building. Authors from the same institute thus have a distance of zero.

The main building of Bielefeld University accommodates many institutes under one roof. No GPS-based measuring was possible inside the university building. This is why we decided to conduct the distance measurement “by foot.” To be consistent, we used the same method of distance measuring for both research institutions. We connected the institutes by a node net (see for Helmholtz Munich net and Supplemental Figure A4 for the Bielefeld node net, exemplarily for the third floor of the main building), and we walked through the Bielefeld University building and the Bielefeld and Munich campus, respectively, to count the required walking steps between nodes. We positioned nodes on building doors or to the center of building parts (such as the building wings in the Bielefeld University main building). Based on this, we calculated a distance matrix with shortest connections between all node pairs. Assigning institutes to their according node points (one or more), we calculated a distance matrix containing all distances from each institute to one another. For institutes which consisted of more than one node or were located in different building parts, we calculated the minimum between all possible distances from this institute to another. The matrix diagonal was set to zero to achieve a zero distance for an institute with itself. Finally, we transformed the distances from steps into meters using the according step length ( $0.75$ m for Bielefeld and $0.72$ m for Munich).

For Helmholtz Munich, the distances within and to all locations outside Neuherberg campus were measured using Google Maps (Google, Citation2019). Within buildings, we did not distinguish between single floors. Building doors served as node points. Additional node points were inserted on street crossings to design a net with all possible combinations. We show details on exact distance measuring and further information on building distances in Supplement Section A.2.

In the following, we differentiate between all distances and interdisciplinary distances, where the latter only includes distances between different institutes. With this distinction, we aim to concentrate on interdisciplinary collaboration and to avoid an overly large impact of people working in the same building – and thus having a distance of zero – and publishing together because they are from the same institute.

2.2.1. Network data

For network analysis, we transform the publication data into edge vectors which consist of author or institute pairs. We use these edge vectors to build author and institute collaboration networks in which every node represents one author or institute, respectively. Two nodes are connected by an edge if they have at least one joint publication. We treat each connected pair of nodes (e.g. a pair of two authors who publish together) as one observation. Such data is called dyadic and is typically displayed by an adjacency matrix, where row and column names are identical and every combination of each two nodes is considered. Finally, we define three symmetric data matrices where each matrix contains author names or institutes on both columns and rows.

• $Y$ : count target variable, containing the number of joint publications of two publishing authors/institutes.

• $X_{d i s t}$ : dependent numeric variable, containing the distance between the author/institute pairs in metres.

• $X_{n u m P u b}$ (only for author networks): dependent numeric variable, containing the mean number of publications of each two authors’ institutes.

By vectorizing the data matrices of the dependent variable and the covariates, network data can be analyzed e.g. by regression. The main diagonal of $Y$ is set to NA since we are only interested in the collaboration of authors and not in single-author publications. By vectorizing all matrices, we get the vectors $y$ , $x_{d i s t}$ and $x_{n u m P u b}$ , containing pair-data of two authors/institutes in each observation.

2.3. Modeling and estimation

In this section, we describe the representation of the network data and explain in detail the applied descriptive and modeling methods. In particular, we introduce two new combinations of estimation and testing methods which meet the characteristics of our data.

2.3.1. Data representation and descriptive analysis

Networks can be represented by various network graphs. For encoding of information, it is possible to present nodes and edges using colors and line types. The node size could describe for example the node degree (number of connections to other nodes) or the total number of publications of the according author/institute. Color and thickness of edges can represent the number of joint publications, either in total or – in author networks – adjusted to the total number of the authors’ institute’s publications. For an example, see the author network in . Additionally, one can define edge weights which change the form of a network (see the author networks in Section A.4). In our networks, they could be defined e.g. as number of publications or physical distance between two nodes (author/institute) or a co-authorship index (see Claudel et al. (Citation2017) and Section A.3). The co-authorship index takes into account how many joint publications one pair of authors/institutes has and how many coauthors contributed to each of those publications. In our work, we use edge weights for network representation but not for inference.

Figure 2. Exemplary author collaboration network. Node size can represent e.g. the node degree; edge thickness or color is used to display further information, e.g. darker color for smaller and brighter color for larger distance between two authors.

For a first descriptive analysis, we calculate network properties such as density, characteristic path lengths, mean degrees and connected components, among others for each network. Bar plots and cumulative distribution functions of distance provide an overview of the number of publications per institution or research field and the distribution of publications separately for interdisciplinary and institution-wide work. For all descriptive network analyses we use the R package igraph (Csardi & Nepusz, Citation2006).

2.3.2. Network modeling

We use regression models to describe the effect of spatial distance on the number of joint publications. However, two challenges occur: First, ordinary least squares regression (OLS) requires independent observations, but one important aspect of social network data is that the data structure often induces row- or column-wise correlated observations in the adjacency matrix: In the example of the collaboration network, author A might generally publish more than author B. Then, coauthors of author A are likely to have more publications than coauthors of author B, independently of any other covariates. Such a positive correlation of observations within a row or column might lead to too small standard errors and too optimistic p-values for estimated regression coefficients (Simpson, Citation2001). The second challenge is in excess zeros in the publication data which occur because each author collaborates only with some (up to $\sim 30$ ) other authors. In an adjacency matrix for several thousands of authors, most of the matrix entries equal zero, i.e. the matrix is sparse. This leads to more zero observations than a Poisson model would allow (Zeileis et al., Citation2008).

Using a hurdle or zero-inflation model, one could account for excess zeros, but would not take the network structure into account. The R package sna (Butts, Citation2019), on the other hand, offers the possibility to analyze network data and considering the dependency structure by using a quadratic assignment procedure (QAP) test for linear and logistic regression models but is inappropriate for count models. Our approach, therefore, combines a two-step hurdle model with the concept of the QAP test to both model count data and account for excess zeros in the data while adjusting for network dependencies.

We choose hurdle-QAP over other network models such as exponential random graph models (ergm, Robins et al. (Citation2007)) or latent space models (LSM, P. D. Hoff et al. (Citation2002)) for several reasons: First, in our research question we are not directly interested in the network structure and dependencies but include them as an accompanying circumstance. Second, QAP is comparatively simple to specify (low theoretical cost for the user) and easy to interpret, as explained in Cranmer et al. (Citation2017).

The R package amen (P. Hoff et al., Citation2020) offers the possibility to analyze network data using additive and multiplicative random effects models within a Bayesian framework. The estimation of random effects and the use of Bayesian inference lead to high computational costs for large networks. Analysis examples of the package include networks with 18 to 71 nodes (P. D. Hoff, Citation2015) whereas our collaboration network contains more than 2400 nodes. In hurdle-QAP the permutations run independently of each other and can therefore be parallelized to reduce computation time. QAP is generally suited for large data sets, whilst other models require a higher computational effort. Whilst amen does not yet provide a ready-made implementation for count data, Choi et al. (Citation2017) present a method for network analysis for such data even with excess zeros. They, however, address graph structure learning which is not required in our case.

For the above reasons, QAP is widely used for analysis, e.g. in social, environmental and political sciences (Cai et al., Citation2018; Gui et al., Citation2019; Hwayoon et al., Citation2021; Justin & Johnson, Citation2021; Lee, Citation2019; Liu et al., Citation2017; Sidorov et al., Citation2018). If one is mainly interested in the effect of one or more covariates on the target variable rather than the network structure itself, our hurdle-QAP method is a suitable and fast approach for regression analysis in case of count network data. The approach also offers the opportunity to analyze non-parametric effects. If required, other model specifications can also be flexibly adapted without much effort. To our knowledge a comparable approach has not been published before.

In the following, we introduce the quadratic assignment procedure (QAP), the hurdle model and our novel approaches of combining parametric and non-parametric hurdle models with QAP. The latter provide solutions to account for the mentioned challenges.

2.3.2.1. Quadratic assignment procedure (QAP)

To account for observation intercorrelation in the dyadic data, we use the quadratic assignment procedure (QAP), which was first proposed by Hubert (Citation1987). It performs a non-parametric permutation-based test for the null hypothesis that a test statistic of a chosen association between variables (e.g. correlation, regression coefficient) equals the expected value of the test statistic under random permutations. In this manner, we examine whether a given pattern was created by chance, and we reject the null hypothesis if the association test-statistic is at an extreme percentile in favor of the alternative. Therefore, we draw a random sample of all permutations to generate a reference distribution of random parameters. A data set with the same structure as the non-permuted data set could have induced this reference distribution. This property of not making assumptions about the distribution of parameters is one of the major advantages of QAP (Dekker et al., Citation2003).

Using this method on network data, we manage to adjust the received results for the natural row- or column-correlation resulting from the network structure. For a network with $n$ nodes, there are $n!$ possible permutations. We permute only the dependent variable (edge information) $Y$ . Rows and columns of its adjacency matrix get permuted in the same way so that we keep the overall structure of one node (e.g. author), i.e. values are switched but node pairs not separated (see ). This way, we conserve the dependence among elements of the same row or column but remove any relationship between the dependent and the independent variable. Consequently, the permuted data sets follow the null hypothesis (Simpson, Citation2001).

Figure 3. Example of QAP permutation scheme with a random simultaneous row/column permutation. In a symmetric matrix with identical row/column names, row names are reordered randomly and the same order is used for column ordering.

Krackhardt (Citation1988) explains the use of QAP for regression models in the context of network data where the regression coefficient is compared to the reference distribution of estimated coefficients resulting from the permutation models. By using QAP we adjust for network correlation and resolve the problem that OLS regression requires independent observations. Thus, e.g. in the case of binary networks (binary edge status: $1 =$ “existing connection,” $0 =$ “no connection”), we use a univariate logistic regression model with $Y$ and $X$ as adjacency matrices of the dependent and independent variable, respectively. Considering the model regression coefficient $β_{X}$ as the association that we aim to test, our analysis structure is as follows:

(1) Vectorise adjacency matrices of considered variables to obtain $y$ and $x$ .

(2) Estimate a regression model with $y$ as dependent and $x$ as independent variable, resulting in the estimated model coefficient ${\hat{β}}_{x}$ .

(3) Randomly permute rows/columns of the dependent matrix and vectorise it to obtain $y_{p e r m}$ .

(4) Estimate the model from Step 2 with $y_{p e r m}$ as dependent variable, resulting in the estimated model coefficient ${\hat{β}}_{p e r m, x}$ .

(5) Repeat Steps 3 and 4 $N$ times (e.g. $N = 1000$ ).

(6) Compare the original estimate ${\hat{β}}_{x}$ from Step 2 to the empirical sampling distribution of estimates ${\hat{β}}_{p e r m, x}$ from Step 4. Specifically, obtain a p-value by calculating the proportions of coefficients which are smaller/greater than ${\hat{β}}_{x}$ (depending on the direction of the alternative hypothesis). For a two-sided hypothesis test, consider the direction which yields a p-value less than or equal to $0.5$ and multiply this value by factor two.

(7) If ${\hat{β}}_{x}$ is at an extremely high or low percentile of the empirical distribution, i.e. the p-value is below a previously defined threshold, reject the null hypothesis that an observed association between $X$ and $Y$ has simply occurred by chance.

Krackhardt (Citation1988) compares the QAP approach to OLS regression and demonstrates in a simulation study that QAP results in model coefficients with smaller bias when structural autocorrelation exists in the data.

In case of more than one independent variable, the multiple regression quadratic assignment procedure (MRQAP) (Krackhardt, Citation1988) can be used. Applications of QAP and MRQAP can be found for example in Kulahci et al. (Citation2016) and Ngugi (Citation2018). In the remainder of this paper, we will use the name QAP also in case of more than one covariate.

For permutation, we use a simultaneous row/column permutation using the function rmperm() from the R package sna (Butts, Citation2019). Further, the sna package contains the function qaptest() to perform a QAP hypothesis test on a user-defined statistic. For linear and logistic regression models, e.g. the functions netlm() and netlogit() from the same package can be used. To perform the QAP test, choose “qap” for the nullhyp argument. For MRQAP modeling, the R package asnipe (Farine, Citation2013) provides functions for custom permutation networks and MRQAP with double-semi-partialing approach which is suggested by Dekker et al. (Citation2007) in case of multicollinearity.

2.3.2.2. Hurdle model

The independent variable $Y$ in our data is a count matrix (number of joint publications). Therefore, we apply a count data regression model. In addition, we aim to account for excess zeros (i.e. sparsity) in the adjacency matrix $Y$ which occur as described above. Thus, we employ a (multivariate) hurdle model (Mullahy, Citation1986) to regress the effect of spatial distance on the number of publications. Here, we include the mean number of publications of an institute as covariate. This enables us to address the two questions if two authors publish jointly (binomial model part) and if so, how many joint publications they have (count model part).

A hurdle model is a two-step model which in the first step reduces to binomial regression by dichotomizing the outcome count variable into zero and positive realizations. For positive realizations, the “hurdle” is considered as crossed and only the positive observations are analyzed in the second step with a zero-truncated count model (Mullahy, Citation1986). The hurdle model is formalized in a two-step process by Cameron and Trivedi (Citation2013) as

(1)

f (y) = f_{1} (0) ​ ​ ​ if y = 0 \frac{1 - f_{1} (0)}{1 - f_{2} (0)} \cdot f_{2} (y) ​ ​ ​ ​ if y 0,

(1)

where the probability for a zero observation equals $f_{1} (0)$ , and positive counts follow the truncated density $f_{2} (y) / (1 - f_{2} (0))$ multiplied by $1 - f_{1} (0)$ . A binomial distribution can be chosen for the binomial model whereas e.g. Poissonian, geometric or negative binomial distributions are possible for the count component. Parameters of the hurdle model are typically estimated by maximum likelihood where the parts of the likelihood for zero and positive realizations can be maximized separately (Cameron & Trivedi, Citation2013).

Due to the larger number of parameters, the hurdle model is of increased complexity in comparison to e.g. a simple Poisson model, both with respect to computational effort and interpretability. However, in case of excess zeros it prevents from inconsistent estimates (Cameron & Trivedi, Citation2013). The hurdle model is implemented in the R packages pscl (Zeileis et al., Citation2008) and countreg (Zeileis & Kleiber, Citation2020). For more flexibility (e.g. further distribution families, more convenient interpretation), both parts of the hurdle model can be fitted separately with a generalized linear model (GLM) for each part (Kleiber & Zeileis, Citation2015). Following the suggestions of Kleiber and Zeileis (Citation2015), in practise our hurdle model combines a GLM with binomial family and a complementary log-log (cloglog) link for the binomial part and one with a zero-truncated Poisson (ZTP) distribution (restricted to positive numbers) for the count part.

2.3.2.3. Hurdle model with QAP (parametric hurdle-QAP)

In a novel approach we take advantage of the properties of the previously described methods by combining them to a (parametric) hurdle model with QAP test.

For the hurdle part, we first build the binomial and the ZTP model for the original data matrices that have been introduced as network data in Section 2.1. We use $Y$ as target variable and $X_{d i s t}$ and $X_{n u m P u b}$ as covariates. With the latter we aim to adjust for the fact that each discipline has its own publication practices. By vectorizing all matrices, we get the vectors $y$ , $x_{d i s t}$ and $x_{n u m P u b}$ . In the binomial model, we use the binary variable $y_{b i n}$ as target variable, with entries

(2)

y_{b i n, i j} = \{\begin{matrix} 0 & i f y_{i j} = 0, i . e . n o j o i n t p u b l i c a t i o n \\ 1 & i f y_{i j} > 0, i . e . a t l e a s t o n e j o i n t p u b l i c a t i o n \end{matrix}

(2)

for authors $i \neq j$ . In the ZTP model, we consider the number of joint publications of two publishing authors, which leads to a count target variable $y_{p o i s}$ , a sub-vector of $y$ which considers only those observations from $y$ for which one has $y_{b i n} = 1$ . We consider a binomial model with cloglog link, which is defined in Fahrmeir et al. (Citation2013) as

(Y_{b i n} = 1 | X = x) = 1 - exp (- exp (η))

with

(3)

η = β_{b i n 0} + β_{b i n D i s t} \cdot x_{d i s t} + β_{b i n P u b} \cdot x_{n u m P u b},

(3)

and a ZTP model according to Cameron and Trivedi (Citation2013),

E (Y_{p o i s} | Y_{p o i s} > 0, X^{*} = x^{*}) = \frac{exp (θ)}{1 - exp (- exp (θ))}

with

(4)

θ = β_{p o i s 0} + β_{p o i s D i s t} \cdot x_{d i s t}^{*} + β_{p o i s P u b} \cdot x_{n u m P u b}^{*} .

(4)

Here, $x$ represents the vector of covariates and $x^{*}$ a sub-vector of $x$ containing only observations of author pairs with at least one joint publication.

For QAP, we obtain the matrix $Y_{p e r m}$ by permuting $Y$ and then build the vectorized target variables $y_{b i n P e r m}$ and $y_{p o i s P e r m}$ from $Y_{p e r m}$ in the same manner as described above. The covariates $x_{d i s t}$ and $x_{n u m P u b}$ remain unaltered. We repeat the permutation of $Y$ $N$ times and every time estimate the binomial and the ZTP model. Finally, we compare the estimated model coefficients ${\hat{β}}_{b i n D i s t}$ , ${\hat{β}}_{p o i s D i s t}$ , ${\hat{β}}_{b i n P u b}$ and ${\hat{β}}_{p o i s P u b}$ from the original data with the empirical distributions of estimated coefficients ${\hat{β}}_{b i n D i s t P e r m_{i}}$ , ${\hat{β}}_{p o i s D i s t P e r m_{i}}$ , ${\hat{β}}_{b i n P u b P e r m_{i}}$ and ${\hat{β}}_{p o i s P u b P e r m_{i}}$ , $i \in {1, \dots, N}$ , from the permuted models. This yields empirical p-values for the null hypothesis that a coefficient estimated on the original data stems from the empirical distribution of coefficients estimated on the permuted data, i.e. the null hypothesis that there is no effect of the considered covariable. Importantly, the p-values are adjusted for excess zeros and network-induced inter-observation dependence.

2.3.2.4. Non-parametric hurdle-QAP

So far, we considered a linear predictor with covariables $X_{d i s t}$ and $X_{n u m P u b}$ as part of the hurdle model. However, the linear assumption may be inappropriate for the distance data: Here, numeric values are widespread, and some ranges do not occur because they do not arise from the considered spatial setting. This can be seen, for example, in where the rugs on the x-axis indicate distance observations. The impact of these different ranges may scale non-linearly and rather reflect the fact that one has to change the building than pure spatial distance. For this reason, we take a potential non-linear effect of distance into account by replacing the linear term for the distance variable $x_{d i s t}$ by a non-parametric spline $s$ ; the second covariate, $x_{n u m P u b}$ , is considered as before in a linear term. In this novel approach, we estimate two generalized additive models (GAM): one with a binomial distribution family and complementary log-log link, and one with a ZTP distribution family. EquationEq. (3)(3) $η = β_{b i n 0} + β_{b i n D i s t} \cdot x_{d i s t} + β_{b i n P u b} \cdot x_{n u m P u b},$ (3) and EquationEq. (4)(4) $θ = β_{p o i s 0} + β_{p o i s D i s t} \cdot x_{d i s t}^{*} + β_{p o i s P u b} \cdot x_{n u m P u b}^{*} .$ (4) change therefore to

η = β_{b i n 0} + s (x_{d i s t}) + β_{b i n P u b} \cdot x_{n u m P u b}

and

θ = β_{p o i s 0} + s (x_{d i s t}^{*}) + β_{p o i s P u b} \cdot x_{n u m P u b}^{*} .

In a subsequent QAP, we permute the target variable as described before and estimate the non-parametric binomial and ZTP models on each permutation data set.

In detail, we choose P-splines for the smooth term $s$ with $k = 8$ basis functions. This number shows a good compromise between a wiggly curve and information loss for our data. The smoothing parameter estimation in the models with original data is done by the fREML method, which performs fast restricted maximum likelihood computation. The R function bam() from the mgcv package (Wood et al., Citation2015) is able to estimate the GAMs for the large number of observations (more than 6 million possible author pairs).

For better comparability between the parametric and non-parametric hurdle-QAP, we employ the same $N$ permuted data sets as in the parametric approach above. For the model estimation on permuted data, the smoothing parameter is set to the estimated smoothing parameter of the according original model to obtain comparable curves for all permutations. To compare the original and the permutation splines, we compute empirical pointwise 95%-confidence bands. We consider an effect significant in those areas where the estimated curve from the original model does not cut the confidence band of the permutation spline functions.

2.3.3. Simulation Study

We additionally conduct a simulation study to investigate the performance and robustness of our method using synthetic data. Section A.5.1 in the supplement explains the simulation setup and presents and interprets results. In summary, the study shows that our hurdle-QAP approach performs reliable and robust parameter estimation where data shows a clear (zero-inflated) count pattern. Precision and power increase for growing sample size. The assumed significance level is met. The application of other methods than hurdle-QAP demonstrates the added value of our new method: Using a hurdle rather than a Poisson model leads to more precise point estimates and larger power of the tests for significance of effects. The power achieved with hurdle-QAP is comparable to the power in a hurdle model with p-values calculated based on normal and Student t distributions, with hurdle-QAP ensuring to adequately account for occurring dependencies within the covariance matrices.

2.3.4. Implementation

To enable usage of our novel approach in practice, we developed the R package hurdleqap, which can be downloaded from Github at https://github.com/fuchslab/hurdleqap. For analysis, R version 4.0.4 (20 February 21, 2015) [local] and R version 3.6.0 (20 April 19, 2026) [cluster] R Core Team, Citation2021 were used. All analysis code for data preparation and method application including publication and distance data can be found on https://github.com/fuchslab/HurdleQap_Application.

3. Results

3.1. Descriptive analyses and network representations

For interpretation and computational reasons, we excluded the less than 10% of publications with more than 35 authors from the analysis. At Helmholtz Munich (2,451 authors), the remaining publications had on average 3.27 authors (median of 2). 34.2% of the publications involved at least two different institutes, with a maximum of 11 institutes on one paper. At Bielefeld University (2,456 authors), the number of authors per publication was 2.3 in mean (2 in median). 12.6% of the works involved at least two faculties. At maximum, 13 faculties contributed to one publication. Table A1 shows further properties of the author networks.

compares the empirical cumulative distribution functions (eCDFs) of distances between each collaborating (interdisciplinary) pair of researchers (blue) with those of distances between all (interdisciplinary) author pairs, independently of whether they published together or not (grey). If there was no association between spatial distance and the number of joint publications, both curves would arise from the same distribution. The two-sided Kolmogorov–Smirnov test yields p-values less than $2 \cdot 10^{- 16}$ for all comparisons shown in . The null hypothesis of samples being drawn from the same distribution can thus be rejected at $5 %$ level, indicating an effect of distance.

Figure 4. Empirical cumulative distribution functions (eCDFs) of distances between author pairs. Blue: Distances between authors with joint publications; for multiple publications, the distance is considered only once. Grey: Distances between all author pairs, also those without joint publications. A: All author pairs for the Neuherberg campus at Helmholtz Munich; $n_{g r e y} = 3, 108, 774$ , $n_{b l u e} = 25, 144$ . B: Interdisciplinary author pairs, i.e. two authors coming from different institutes, for the Neuherberg campus at Helmholtz Munich; $n_{g r e y} = 2, 896, 176$ , $n_{b l u e} = 14, 168$ . C: All author pairs for Bielefeld University; $n_{g r e y} = 3, 286, 675$ , $n_{b l u e} = 8, 561$ . D: Interdisciplinary author pairs for Bielefeld University; $n_{g r e y} = 2, 742, 805$ , $n_{b l u e} = 2, 334$ . For each panel, the two-sided Kolmogorov–Smirnov test yields p-values less than $2 \cdot 10^{- 16}$ regarding the null hypothesis that the two eCDFs arise from the same distribution.

shows the institute networks for Helmholtz Munich and Bielefeld University. Nodes represent institutes and the node size is proportional to the number of collaborative publications per institute. The network weight is defined for each pair of institutes as the maximum number of joint publications between all institutes divided by the number of joint publications of the particular pair. Lower weights, i.e. smaller distances in the graph, indicate more joint publications. The networks show high connectivity between nodes for both institutions. The same applies to the authors networks, which are displayed in Supplement Figures A5 to A7: The nodes (authors) have on average 20.77 (Helmholtz Munich) and 7.75 (Bielefeld University) connections to other nodes. In the Helmholtz network, there are only six connected components, whereas at Bielefeld University there are 44, namely one main component containing the majority of nodes and several disconnected smaller networks. Researchers from the same institute tend to work together but are also connected to authors from other institutes. For Bielefeld University, the effect of institute clusters is more pronounced than for Helmholtz Munich, supposedly due to the more homogeneous research focus at Helmholtz Munich than at a university.

Figure 5. Institute networks for Helmholtz Munich (left) and Bielefeld University (right). The node size is proportional to the number of collaborative publications per institute. Connectivity between all institutes is apparent, as also reflected by the mean degree of 20.77 (Helmholtz Munich) and 7.75 (Bielefeld University). Colors correspond to research categories as in Figure A1 (Helmholtz Munich), and to institutes as in Figure A2 (Bielefeld University).

3.2. Network modeling

We estimate the effect of spatial distance on the number of joint publications using the hurdle-QAP models described before. To reduce sparsity in the distance distribution for Helmholtz Munich, we concentrate on its main campus in Neuherberg rather than including all locations across Germany. For Bielefeld University, we perform the analyses once for the main building and once for all campus buildings (including distances within the main building) to also account for the comfort of staying inside. We consider both the author and the institute networks. For author networks, we further differentiate between interdisciplinary collaboration (i.e. we only consider publications between authors from different institutes) and overall collaboration, because two authors can still belong to different institutes even if their distance from each other is zero. Reported p-values are with respect to a two-sided hypothesis test (see Step 6 of the procedure described in Section 2.3.2.1), i.e. we investigate whether distance has a positive or negative effect on collaboration. For all results we consider a significance level of $α = 0.05$ .

3.2.1. Parametric hurdle-QAP

Estimated coefficients for the effect of distance and publication strength in the parametric hurdle-QAP model are reported in (Helmholtz Munich, main campus), (Bielefeld University, main building) and (Bielefeld University, overall campus), respectively. The estimated intercepts are shown in Tables A2 and A3. Additionally, display the empirical density functions of the estimated coefficients across the QAP permutations, including the estimated model coefficient from the original data (dashed line). In the following, we first describe the estimated models for networks between authors, then for networks between institutes.

Figure 6. Helmholtz Munich: empirical density functions of estimated model coefficients across the $N = 1000$ QAP permutations. The estimated coefficients based on the non-permuted data are indicated by the dashed line.

Figure 7. Bielefeld University, main building: density functions as described in Figure 6.

Figure 8. Bielefeld University, overall campus: density functions as described in Figure 6.

Table 2. Helmholtz Munich: estimated hurdle-QAP model coefficients and according QAP p-values. Effects with significant QAP p-values are marked with $*$ . ${\hat{β}}_{b i n D i s t}$ and ${\hat{β}}_{p o i s D i s t}$ represent the coefficients from the binomial and the ZTP model for the distance variable, ${\hat{β}}_{b i n P u b}$ and ${\hat{β}}_{p o i s P u b}$ for the variable of publication strength.

Display Table

Table 3. Bielefeld University, main building: estimated hurdle-QAP coefficients as described in .

Display Table

Table 4. Bielefeld University, overall campus: estimated hurdle-QAP coefficients as described in .

Display Table

For the overall author network at Helmholtz Munich (including all author distances and publications, even if non-interdisciplinary), we obtain negative significant coefficients for the distance effect in both the binomial part (coefficient $- 0.0053$ , p-value $0.000$ ) and the ZTP part (coefficient $- 0.0017$ , p-value $0.000$ ). Consequently, with the mean distance of $245$ meters and the mean number of averaged joint publications of the two author’s institutes of $390$ , the probability for an author pair to collaborate is $0.0065$ . The expected number of publications for a collaborating author pair with these covariate values is $2.17$ . With an increase of distance to $600$ meters, the collaboration probability decreases to $0.0010$ and the expected number of publications for a collaborating author pair to $1.57$ . The effect of the mean number of publications of each two authors’ institutes is negative for the binomial part and positive for the ZPT part, but none of them is significant. For the interdisciplinary author network, we estimate a significant negative effect of distance on the collaboration probability (coefficient $- 0.0013$ , p-value $0.000$ ) but a non-significant positive effect of the publication strength (coefficient $0.0003$ , p-value $0.468$ ). Thus, the probability of active collaboration decreases whilst the (conditional) expected number of interdisciplinary publications increases with increasing distance.

For the main building of Bielefeld University, the overall author network reveals significant effects of distance for overall collaboration, both for the model’s binomial (coefficient $- 0.0354$ , p-value $0.000$ ) and ZTP part (coefficient $- 0.0097$ , p-value $0.000$ ). In contrast, in the interdisciplinary case, the effect of distance is non-significantly positive both for the binomial part (coefficient $0.0004$ , p-value $0.854$ ) and the ZTP part (coefficient $0.0021$ , p-value $0.492$ ), meaning an increase in collaboration probability and the expected number of joint publications with increasing distance between author pairs.

For the overall campus of Bielefeld University, we estimate the same effect directions of distance in (overall and interdisciplinary) author networks as for the case where we concentrate on collaboration within the university’s main building. However, the effect of distance is significant only for the overall author network, both in the binomial and ZTP part (coefficients $- 0.0030$ and $- 0.0007$ , p-values $0.000$ ).

Considering the institute networks, effect directions are mixed across Helmholtz Munich, Bielefeld University main building and Bielefeld University overall campus. Only the effect of the involved institutes’ publication strengths on the ZTP part is always positive and in case of Bielefeld University’s overall campus even significant (coefficient $0.0023$ , p-value $0.048$ ). There is no significant effect of distance on either of the models’ binomial or ZTP part.

3.2.2. Non-parametric hurdle-QAP model

We next describe the non-parametric estimates of the effect of distance and institutes’ publication strength on collaboration probability and joint publications: show smoothed curves for the distance’s effect on the models’ binomial and ZTP parts. They compare the curves based on the original non-permuted data (thick red lines) to those resulting from QAP-permuted data (thin grey lines, $N = 1000$ ) and their empirical pointwise 95% confidence band (blue-shaded areas). We consider an effect significant on an interval of distance values if the red curve lies outside this confidence band. The red-shaded area indicates the $95 %$ confidence interval of the original curve. For convenient interpretation, the effect curves are displayed on the response scale, keeping the covariate of publication strength at its mean value. The values of the curves can thus be interpreted as the predicted outcome, i.e. the probability of collaboration for the binomial part and – conditional on collaboration being implemented – the number of publications for the ZTP part. For comparison, the dotted lines indicate prediction from the parametric model with linear effect. For the linear effect of publication strength, we show empirical density functions of estimated model coefficients across the QAP permutations as in . The original spline functions without prediction on the response scale are displayed in Section A.5. show effective degrees of freedom (edf) of the distance spline functions and the linear coefficients for effects of publication strength for the original binomial and ZTP part, each for the overall and the interdisciplinary author network and the institute network.

Figure 9. Helmholtz Munich: The first two columns show the predicted response (column one: collaboration probability; column two: expected number of publications) in the non-parametric model as a function of distance. Thick red lines show the predicted response based on the original non-permuted data. Thin grey lines result from QAP-permuted data ( $N = 1000$ ). The red-shaded area indicates the confidence $95 %$ interval of the original predicted response. The bold blue lines represent the lower and upper bounds of the pointwise $95 %$ -confidence bands (shaded areas in light blue) over the permuted estimates. Effect curves are displayed on the predicted response scale (probability of collaboration for the binomial and expected number of publications for the ZTP part) keeping the covariate of publication strength at its mean value. Values of observed distances are shown as black ticks on the x-axis to visualise the sparsity of the data basis. The dotted line represents the predicted response of the parametric model with linear effect. The third and fourth column display the parametric hurdle-QAP estimates for the effect of publication strengths as described in Figure 6.

Figure 10. Bielefeld University, main building: Non-parametric hurdle-QAP estimates as described in Figure 9.

Figure 11. Bielefeld University, overall campus: Non-parametric hurdle-QAP estimates as described in Figure 9.

Table 5. Helmholtz Munich: effective degrees of freedom (edf) of non-parametric spline functions and estimated coefficients as described in Table 2. edf indicate how wiggly a curve is. Values of one or close to it describe an (almost) linear effect. The higher the value, the wigglier the curve.

Display Table

Table 6. Bielefeld University, main building: edf of non-parametric spline functions and estimated coefficients as described in .

Display Table

Table 7. Bielefeld University, overall campus: edf of non-parametric spline functions and estimated coefficients as described in .

Display Table

As before, we first describe the estimated effects for author networks, then for institute networks. The non-parametric hurdle-QAP models reveal smoothed effects of distance for both author networks at Helmholtz Munich. In the overall network, the predicted response of both the binomial and the ZTP part decrease for values up to approximately 100 meters and significantly lie below the response predicted from permuted data. Up to about 400 meters, the predicted response is below the one from the parametric model. Towards the right (distances of more than 700 meters), observations become sparser and estimates thus less reliable. For the interdisciplinary network, the collaboration probability goes up and down with increasing distance, with a general decreasing trend. The predicted response of the ZTP part decreases until up to $200$ meters, followed by an increase, most of the time within the bounds of the confidence band. For both author networks, for distances of up to 100 meters, both the probability to collaborate at all and the number of joint publications decrease with increasing distance between authors. After that, the effect becomes less pronounced but for the overall author network still lies in the same direction. For the overall author network, the predicted response curves lie mostly (and clearly) outside the $95 %$ -confidence band while the curves for the interdisciplinary author network fall to a large extent into the interval.

For the overall author networks at Bielefeld University, we observe similar patterns, both for the main building and the overall campus: After a steep decline of the predicted response for distances of up to 20–50 meters, there is an increase and another (partly slight) decrease/increase wave afterwards. The predicted responses lie almost everywhere outside the $95 %$ -confidence band, indicating significant effects of distance on the binomial and ZTP part. For the interdisciplinary author network in the main building, on the other hand, the predicted responses in the binomial and ZTP part increase directly from the beginning. The binomial part shows two peaks whereas the curve for the ZTP part has its maximum in between these two. Those peaks are significant both for the binomial and the ZTP part. The predicted response as a function of distance in the interdisciplinary author network regarding the overall campus differs in the binomial and the ZTP parts. The first one is, after a small initial increase, significantly decreasing from about 100 to 500 meters and increases then to its maximum at 750 meters. The predicted response of the ZTP part is monotone increasing and not significant.

For the institute networks, the predicted response curves lie mostly within the $95 %$ -confidence band. The probability of collaboration, however, differs from the one in the parametric model for Helmholtz Munich and the main campus of Bielefeld University. The (conditional) expected number of publications evolves similarly as in the parametric model when regarded as a function of distance.

In all datasets (Munich and Bielefeld) and networks (overall/interdisciplinary authors, institutes), the estimated effects of publication strength in the non-parametric model resemble the directions (except for ${\hat{β}}_{b i n P u b}$ in the overall author network of the Bielefeld University overall campus), approximate values and significances in the parametric models.

4. Discussion

In an age in which interdisciplinary research is becoming more and more important, we have been looking at the extent to which this can be promoted by spatial structures. This led us to the statistical challenge of a multivariate network regression model where both the target variable and the covariates are in matrix form; and where these matrices have both row and column dependencies to be accounted for, and the target matrix is sparse.

To tackle this methodological problem, we combined the strengths of two established approaches: the quadratic assignment procedure (QAP) to correct for dependencies, and the hurdle model to consider excess zeros. Our proposed parametric and non-parametric hurdle-QAP models enable us to appropriately analyze the impact of spatial distance within research institutions on interdisciplinary collaboration.

We compiled data about institute locations and publications at Helmholtz Munich and Bielefeld University. For this, we used publicly available information and extracted relevant data such as author affiliations in an automated manner. We made our text-mining pipeline available online such that the analysis can be directly transferred to other institutions.

We considered the spatial distance and the general publication strength as dependent variables. The effect of distance was of primary interest; publication strength was included in the model as covariate in order to not distort the results by subject-specific publication habits. In the overall author networks, we consistently see significant effects of distance on publication behavior: A lower distance increases both the probability of joint publication and, in the case of collaboration, the expected number of such joint publications. This holds for both institutions considered and results from both the parametric and the non-parametric model. However, this statement changes as soon as we restrict our consideration to interdisciplinary author networks: At Helmholtz Munich, the direction of the effect remains only for the probability of collaboration and is still significant. At Bielefeld University, on the other hand, the effect is even reversed: According to the parametric estimation, a larger distance increases both the probability of collaboration and the then resulting expected number of publications. These effects, however, are not significant. The non-parametric model provides a somewhat more differentiated picture: According to it, the effects fluctuate; they support the previously described results for the overall author networks, especially for very small or very large distances. For institute networks, we hardly find any significant effect of spatial distance on joint publication activity.

Overall, the negative effect of distance (i.e. larger distance leading to less collaboration) is mainly driven by coauthors from the same institute, that is non-interdisciplinary collaboration. For interdisciplinary collaboration, distance matters less. Especially in the case of Bielefeld University, we suspect that the fact to work under one roof is more important than precise office locations within the building. In addition, we showcase that at the university with its broad spectrum of studies, authors are less connected across institutes than at the considered research center with its more focused research program.

When interpreting associations, care should be taken to avoid false conclusions regarding causality: It would be possible that collaboration arises from spatial proximity; conversely, however, offices could also have been assigned because of professional proximity. This could be ruled out by means of suitable further information. In the case considered here, we have included a further covariance matrix that describes professional proximity (see supplement), leading to the same conclusions as described above.

Beyond the considered use case, our proposed hurdle-QAP models may reveal insights in any network regression model with falsifying dependencies and excess zeros: for example, social networks (Lewis et al., Citation2012), power systems (Levron & Belikov, Citation2017) or genome-wide gene expression analyses (Shen et al., Citation2014). The application will determine whether the content motivates rather the parametric or the non-parametric approach. In the present consideration, values of distance are irregularly distributed, and a non-linear effect appears obvious. Still, both approaches deliver compatible results, where the computationally more demanding non-parametric estimation draws a more detailed picture and provides more information about the uncertainty of estimates.

In the context of more than one covariable matrix, Dekker et al. (Citation2007) proposed residual permutation methods rather than permuting the entries of the target matrix to appropriately account for correlation between covariables. In the application considered here, the two covariable matrices $X_{d i s t}$ and $X_{n u m P u b}$ are deterministically determined by the authors’ institutes and thus linked though neither correlated nor causally related (see also Section A.5.2 in the supplement). We thus consider residual permutation dispensable. Beyond the considered use, however, we see it an interesting extension of our work to transfer the idea of residual permutation to the described combination of a binomial and a zero-truncated Poisson model.

Our research question about the relationship between spatial proximity and scientific (interdisciplinary) collaboration gained sudden relevance due to the changed working conditions since the onset of the COVID-19 pandemic. The abrupt switch to primarily digital communication overcomes spatial distance and might change established networks: On the one hand, experienced handling with digital communication methods may facilitate interdisciplinary and international collaboration. On the other hand, informal exchange and building of personal relationships becomes more difficult. The data in our study comes from a time when spatial proximity served both professional exchange and social relationship maintenance. The question is to what extent these aspects are influenced by changed work situations and how they in turn affect scientific (interdisciplinary) collaboration. This is of ongoing importance as the establishment of digital communication methods seems to be persistent: They continue to be used even when meetings in person are again possible (Remmel, Citation2021). As already emphasized by Koopmann et al. (Citation2021), the changed situation leads to a greater importance of social proximity, also in the professional context. The impact may become apparent in a few years when sufficient publication data are available on research conducted since the onset of the pandemic. A follow-up study could analyze changes in publication networks and put them in context – across spatial and disciplinary boundaries.

Supplemental material

Supplemental Material

Download PDF (10.7 MB)

Acknowledgement(s)

We thank Irina Janzen, Nina Langius and Minh Viet Tran for help in data collection. Further, we thank Elmar Spiegel and Edoardo Marchi for reviewing and methodological discussions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/0022250X.2023.2180000

Additional information

Funding

We acknowledge support for the publication costs by the Open Access Publication Fund of Bielefeld University and the Deutsche Forschungsgemeinschaft (DFG).

References

Aria, M., & Cuccurullo, C. (2017). Bibliometrix: An R-tool for comprehensive science mapping analysis. Journal of Informetrics, 11(4), 959–975.
Web of Science ®Google Scholar
Butts, C. T. (2019). Sna: Tools for social network analysis. R package version 2.7. https://CRAN.R-project.org/package=sna .
Google Scholar
Cai, M., Wang, W., Cui, Y., & Stanley, H. E. (2018). Multiplex network analysis of employee performance and employee social relationships. Physica A: Statistical Mechanics and Its Applications, 490(C), 1–12.
Google Scholar
Cameron, A. C., & Trivedi, P. K. (2013). Regression analysis of count data (Vol. 53, Second edition edn ed.). Cambridge University Press. Econometric Society monographs.
Google Scholar
Choi, H., Gim, J., Won, S., Kim, Y. J., Kwon, S., & Park, C. (2017). Network analysis for count data with excess zeros. BMC Genetics, 18(1), 93.
PubMedGoogle Scholar
Claudel, M., Massaro, E., Santi, P., Murray, F., & Ratti, C. (2017). An exploration of collaborative scientific production at MIT through spatial organization and institutional affiliation. PLoS One, 12(6), e0179334.
PubMed Web of Science ®Google Scholar
Cranmer, S. J., Leifeld, P., McClurg, S. D., & Rolfe, M. (2017). Navigating the range of statistical tools for inferential network analysis. American Journal of Political Science, 61(1), 237–251.
Web of Science ®Google Scholar
Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. Interjournal, Complex Systems, 1695(5), 1–9.
Google Scholar
Dekker, D., Krackhardt, D., & Snijders, T. (2003). Multicollinearity robust QAP for multiple-regression. NAACSOS Conference 2003. Pittsburgh, PA. http://www.casos.cs.cmu.edu/publications/papers/dekker_2003_multicollinearity.pdf.
Google Scholar
Dekker, D., Krackhardt, D., & Snijders, T. A. B. (2007). Sensitivity of MRQAP tests to collinearity and autocorrelation conditions. Psychometrika, 72(4), 563–581.
PubMed Web of Science ®Google Scholar
Fahrmeir, L., Kneib, T., Lang, S., & Marx, B. (2013). Regression: Models, methods and applications. Springer.
Google Scholar
Farine, D. (2013). Animal social network inference and permutations for ecologists in R using asnipe. Methods in Ecology and Evolution, 4(12), 1187–1194.
Web of Science ®Google Scholar
Feng, S., & Kirkley, A. (2020). Mixing patterns in interdisciplinary co-authorship networks at multiple scales. Scientific Reports, 10(1), 7731.
PubMed Web of Science ®Google Scholar
Google. (2019). Google Maps: https://www.google.de/maps.
Google Scholar
Gui, Q., Liu, C., & Debin, D. (2019). Globalization of science and international scientific collaboration: A network perspective. Geoforum, 105, 1–12. https://www.sciencedirect.com/science/article/abs/pii/S0016718519302040.
Web of Science ®Google Scholar
Helmholtz Association of German Research Centers. (2020). https://www.helmholtz.de/en/; 09/14/2020.
Google Scholar
Helmholtz Zentrum München. (2020). https://www.helmholtz-muenchen.de/en/about-us/profile/facts-at-glance/index.html; 09/14/2020.
Google Scholar
Hoff, P. D. (2015). Dyadic data analysis with amen. arXiv article: arXiv:1506.08237.
Google Scholar
Hoff, P., Fosdick, B., & Volfovsky, A. (2020). Amen: Additive and multiplicative effects models for networks and relational data. R Package Version 1.4.4. https://cran.r-project.org/web/packages/amen/.
Google Scholar
Hoff, P. D., Raftery, A. E., & Handcock, M. S. (2002). Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460), 1090–1098.
Web of Science ®Google Scholar
Hubert, L. J. (1987). Assignment methods in combinatorial data analysis. Statistics, (Vol. 73). Dekker.
Google Scholar
Hwayoon, S., Barnett, G. A., & Nam, Y. (2021). A social network analysis of international tourism flow. Quality & Quantity, 55(2), 419–439.
Google Scholar
Justin, S., & Johnson, J. C. (2021). How inter-state amity and animosity complement migration networks to drive refugee flows: A multi-layer network analysis, 1991-2016. PLoS One, 16(1), e0245712.
PubMed Web of Science ®Google Scholar
Kleiber, C., & Zeileis, A. (2015). Count data regression with excess zeros: A flexible framework using the GLM toolbox. Workshop of the ERCIM Working Group on Computational and Methodological Statistics 2015. London, United Kingdom. https://www.zeileis.org/papers/ERCIM-2015.pdf.
Google Scholar
Koopmann, T., Stubbemann, M., Kapa, M., Paris, M., Buenstorf, G., Hanika, T., Hotho, A., Jäschke, R., & Stumme, G. (2021). Proximity dimensions and the emergence of collaboration: A HypTrails study on German AI research. Scientometrics, 126, 9847–9868. https://link.springer.com/article/10.1007/s11192-021-03922-1.
Web of Science ®Google Scholar
Krackhardt, D. (1988). Predicting with networks: Nonparametric multiple regression analysis of dyadic data. Social Networks, 10(4), 359–381.
Web of Science ®Google Scholar
Kulahci, I. G., Rubenstein, D. I., Bugnyar, T., Hoppitt, W., Mikus, N., & Schwab, C. (2016). Social networks predict selective observation and information spread in ravens. Royal Society Open Science, 3(7), 160256.
PubMed Web of Science ®Google Scholar
Lee, T. (2019). Network comparison of socialization, learning and collaboration in the C40 cities climate group. Journal of Environmental Policy & Planning, 21(1), 104–115.
Web of Science ®Google Scholar
Levron, Y., & Belikov, J. (2017). Reduction of power system dynamic models using sparse representations. IEEE Transactions on Power Systems, 32(5), 3893–3900.
Web of Science ®Google Scholar
Lewis, K., Gonzalez, M., & Kaufman, J. (2012). Social selection and peer influence in an online social network. Proceedings of the National Academy of Sciences, 109(1), 68–72.
PubMed Web of Science ®Google Scholar
Liu, B., Huang, S., & Hui, F. (2017). An application of network analysis on tourist attractions: The case of Xinjiang, China. Tourism Management, 58(C), 132–141.
Google Scholar
Mullahy, J. (1986). Specification and testing of some modified count data models. Journal of Econometrics, 33(3), 341–365.
Web of Science ®Google Scholar
Nature. (2015). Mind meld. Nature, 525(7569), 289–290.
Web of Science ®Google Scholar
Ngugi, K. J. (2018). Social networks stimuli: A double dekker semi-partialling multiple regression quadratic assignment procedure (Mrqap) approach: Case of small-scale farmers in Kenya. Biomedical Journal of Scientific & Technical Research, 11(4), 8716–8719.
Google Scholar
Pubmed. (2021). https://pubmed.ncbi.nlm.nih.gov; 03/05/2021: Citation database.
Google Scholar
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
Google Scholar
Remmel, A. (2021). Scientists want virtual meetings to stay after the COVID pandemic. Nature, 591(7849), 185–186.
PubMed Web of Science ®Google Scholar
Robins, G., Pattison, P., Kalish, Y., & Lusher, D. (2007). An introduction to exponential random graph (p*) models for social networks. Social Networks, 29(2), 173–191.
Web of Science ®Google Scholar
Salazar Miranda, A., & Claudel, M. (2021). Spatial proximity matters: A study on collaboration. PLoS One, 16(12), e0259965.
PubMed Web of Science ®Google Scholar
Scopus. (2021). https://www.scopus.com; 03/05/2021: Citation database.
Google Scholar
Shen, N., Jing, L., Jin, C., & Zhou, P. (2014). Sparse gene expression data analysis based on truncated power. Pages 39–44 of: 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). Belfast, United Kingdom.
Google Scholar
Sidorov, S. P., Faizliev, A. R., Balash, V. A., Gudkov, A. A., Chekmareva, A. Z., Levshunov, M., & Mironov, S. V. (2018). QAP analysis of company co-mention network. In A. Bonato, P. Prałat, & A. Raigorodskii Eds., Algorithms and models for the web graph Lecture notes in computer science Theoretical computer science and general issues (Vol. 10836, pp. 83–98). Springer.
Google Scholar
Simpson, W. (2001). The quadratic assignment procedure (QAP). North American Stata Users’ Group Meetings 2001 1.2, Stata Users Group. Boston, Massachusetts.
Google Scholar
University, B. (2020). Daten und Zahlen - Universität Bielefeld; https://www.uni-bielefeld.de/uni/profil/daten-zahlen/; 09/14/2020.
Google Scholar
Web of Science. (2021). https://www.webofknowledge.com; 03/05/2021: Citation database.
Google Scholar
Wood, S. N., Goude, Y., & Shaw, S. (2015). Generalized additive models for large data sets. Journal of the Royal Statistical Society: Series C (Applied Statistics), 64(1), 139–155.
Web of Science ®Google Scholar
Zeileis, A., & Kleiber, C. (2020). Countreg: Count data regression. R package, version 0.2-1. https://rdrr.io/rforge/countreg/.
Google Scholar
Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R: Journal of statistical software. Journal of Statistical Software, 27(8), 1–25.
Web of Science ®Google Scholar

Hurdle-QAP models overcome dependency and sparsity in scientific collaboration count networks

ABSTRACT

1. Introduction