The impact of Big Data on the industry

I met Professor Kopal [1] a few years ago at the conference about the impact of Big Data on the industry. His lecture about social networks and its impact on business sectors such as telecom and financial industry was brilliant. This is an excellent opportunity to present this scientist and a high-ranked officer of The Ministry of the Interior of the Republic of Croatia to the readers of WSI magazine.

Prof. Kopal, your impressive career continued at The Ministry of the Interior of the Republic of Croatia. The State Secretary is your current position. What kind of challenges do you meet at your work?

I will try to briefly answer your question by addressing 5 key challenges. The first challenge is a shift from a reactive to a proactive mode which requires law enforcement to shift from a mode of reporting (what happened?), analysis (why it happened?) and monitoring (what is happening now?) to a mode of forecasting (what might happen?), prediction (what is likely to happen?) and prescription (what should be done?). This also includes full implementation of the intelligence-led policing model, as well as the Ratcliffe's 3i format: Interpret crime environment > Influence decision makers > Impact crime environment.

The second challenge are information and data resources, and their effective and efficient use. In this regard, it is necessary to change the direction of the intelligence process. If hypotheses are based exclusively on data, you cannot “conceive the inconceivable” and you cannot answer the question “what do we not know that we do not know”. But if, for example, morphological analysis is applied to first generate hypotheses that your brain “objects to” (because they go against your logic and your experience), and after that a system of indicators is built, it will result in a proactive early warning system (with appropriate set of indicators). And a hypothesis exists until we confirm or reject it. It is important to understand that the absence of evidence is not the evidence of absence.

The third challenge is an appropriate exchange/integration of data with all other security system components at national and international level. The setting up of a unique intelligence taxonomy at a national level is a special challenge.

The fourth challenge are human resources and their training, knowledge and skills, including their formal, informal and non-formal knowledge and skills. Here we can apply a quote by Maslow: “I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.” Some of the challenges in this area are extinct by instinct issue, expert blindness phenomena and consequently intelligence failures. We cannot solve our problems with the same thinking we used when we created them. The fifth challenge also refers to human resources, in particular to their material and financial status, working conditions, working resources, etc.

The Republic of Croatia is a full member of the EU and NATO. During the past decade, European countries have been encountering challenges such as terrorism and different kinds of organized crime (human trafficking, drugs and goods smuggling, etc.). What can you say about the impact of social networks as a means of communication among terrorists and other criminal groups?

The (Social) Network Analysis (SNA) is becoming an increasing significant methodology used in security and intelligence structures dealing with the detection and combating of the so called dark networks comprising of members of (but not limited to) terrorist organizations. At this point I would like to emphasize intensive interconnectedness between organized crime groups and terrorists, and in this respect any network analysis of one of them needs to keep both in focus. Also, it is important to disambiguate the term (Social) Network Analysis - it does not refer only to Facebook, Twitter, LinkedIn and other “social” networks, but rather it includes all other types of networks, animate or inanimate.

The Social Network Analysis is a group of analytical methods used to provide an overview and measure the connection and flow of e.g. transaction, impact, information, goods or other things between persons, groups, organizations and other entities. The Social Network Analysis helps identify hidden links and degrees of impact between nodes (entities), and it is also the most complex level of network analysis (there are three levels of network analysis).

The network can be formally defined as a group of nodes (network members). All those nodes are connected by different types of relationships, formally called links. The SNA examines all those links as a foundation of the social world and it also examines the relationship attributes from a significantly different perspective than a traditional data analysis method. I will give you a very specific example of its use (terrorist attack on 11 September 2001, only open source data).

The analysis of open source data, such as the data gathered on members of the terrorist group which was responsible for the 9/11 attacks, may single out certain network regularity and behavioral patterns in order to discover methods generally used by Al-Qa’ida in their organization, and it can also indicate how useful and relevant a particular analysis technique is.

Although the analysis of terrorist groups is a relatively new area, Valdis E. Krebs, a consultant and researcher of organizational and other networks, has collected and analyzed publicly available data published in major newspapers such as The New York Times, The Wall Street Journal, Washington Post and The Los Angeles Times. Krebs was the first who attempted to explain the roles of the participants in the attack, their importance in relation to the entire terrorist network which participated in the organization and the very attack, as well as the channels for information exchange. I should point out that Krebs attempted to explain the source of the data considering that it is an open source; although those were all renowned papers, their resources and data sources are limited and, therefore, we must consider the fact that those investigating the attacks did not publish all the relevant information, and that even certain misinformation was published.

Furthermore, it is interesting that the results of the criminal investigation, such as the data on the number of hijackers, who they were and which planes they hijacked, as well which nation’s passports they had used to get into the US, were available to the public only a week after the attacks. According to the open source data and recommendations provided by Krebs, my associates and I have produced a map which portrays the links divided into three degrees, whereas the tie strength is governed by the amount of time spent together by a pair of hijackers. Thus, those living together, attending the same school or the same training had strong ties. Those travelling together and participating in meetings together had ties of moderate strength, and those having only one financial transaction together or an occasional meeting and no other ties were sorted as weak ties within the network. It should be noted that certain data were not published in real time; when it comes to such data, one should be very cautious when adding new entities (nodes) and links in the network to ensure the authenticity of the information. This is due to the fact that certain number of data are often denied after the publication.

The analysis of the network, classified as a small network (less than 20 nodes), has shown that a terrorist network is very sparse (0.1871), that an average path length of 2.49 steps between hijackers is extremely long for such a small network, particularly as some hijackers were even 6 steps away from each other. The hijackers on the same flight were even more than two steps away. The distance between the hijackers indicated to a strategy for deliberate keeping of cell members distant from each other, which can minimize damage to the network if one of the cell members is captured or compromised. Osama Bin Laden even confirmed this network structure when he described his plan in a videotape found in Afghanistan. In the transcript he stated: “Those who were trained to fly didn't know the others. One group of people did not know the other group”. The resulting SNA metrics and Bin Laden’s comments indicate that this covert network traded efficiency for secrecy. The question here is how a covert network sets its goals. The answer lies in forming shortcuts in the network. Thus, the analyzed terrorist network held meetings and in this way connected the distant parts of the network and coordinated tasks. Once coordination was accomplished, the cross-ties were put on hold, or to use the terminology used by terrorist networks, the cells went “dormant”. There is information that one such hijacker meeting was held in Las Vegas. By adding six new ties on the map, representing contacts established at the meeting, an average path length between hijackers was reduced by 43%, thus improving the information flow in the network.

The terrorist network analysis according to centrality metrics might provide a conclusion that individuals who participated in the Pentagon attack scored the highest on the SNA metrics. This leads to a conclusion that those were persons of confidence and that Pentagon was the primary target of the attack. Although the fact that they were previously acquainted and linked makes this terrorist network extremely strong, the analysis has shown certain weaknesses of the network that were not recognized at the time. In particular, although not all ties between hijackers were known, it appeared that a large number of ties concentrated precisely around the pilots, which is very risky for this type of covert network because concentrating both unique skills and connectivity in one node makes the network easier to disrupt, once it is discovered. Detecting and eliminating those network nodes which have unique skills may cause maximum damage to the accomplishment of the task. In this case, pilots would be the most desirable targets for removal from the network. As they were an important link and given that their pilot training required contact with an “outside world”, this somewhat weakened their conspiracy. However, we can assume that the planning and preparation of the attacks was not limited to efforts of 19 people. Therefore, a new larger map was produced with persons (additional 43) who did not board the plane but were assisting in raising funds for carrying out the task, and who had additional knowledge and skills to commit such a demanding terrorist attack.

The centrality metrics analysis of the new larger network singles out Mohamed Atta as a node with a large number of ties, while the remaining centrality metrics for Mohamed Atta (closeness and betweennes) showed almost the highest values. Based on the centrality metrics, the following conclusion can be reached: the number of ties shows Atta’s activity in the network, closeness measures his ability to access others in the network, and also the possibility to control other members of the network. Betweennes shows that it was Atta who controlled the flow of information in the network. It was Atta who played the role of a mediator in the network. All the above has confirmed the important role he played. The second node according to the number of ties and other centrality metrics is Marwan Al-Shehi, the pilot of the plane that attacked the WTC South Tower. Given that the two most important entities participated in the WTC attack, the analysis of a larger network shifts its focus from Pentagon as a primary target to the WTC.

We should also take into consideration the fact that there are some nodes and subnetworks missing in this network. As the centrality metrics are very sensitive to even minor changes in network connectivity, information about one or two new associates, but also detection of new ties in the existing network, may lead to a change in the values of centrality metrics. One must, therefore, be especially careful with incomplete data. As we can assume that intelligence analysts rarely have access to complete data on a particular terrorist organization, network analysis methods often intertwine with data mining. A common name for such analysis is Criminal Network Analysis.

Given that we are talking about a terrorist network, which by its very definition is not a classic social network, the entities in the network are first and foremost focused on conspiring. The entities within the network avoid relationship formation outside of the network and minimize contact with persons who were part of terrorist training. Such ties remain latent and are invisible to persons who are not a part of their terrorist cell. However, the requirement to complete the task imposes the need for exchange of information between terrorists and cells within the terrorist organization. Given that the entities must reconcile the need for conspiring and intensive communication, investigators of such networks must take advantage of the moment of intensive communication necessary for carrying out the task in order to identify its members. In order to ensure information exchange, as well as conspiracy, members of covert networks use various tactics and methods of covert communication, such as simply having one person acting on behalf of a particular cell, thus protecting its members from using modern means of communication, which entail exchange of messages in online games such as “World of Warcraft”.

The example of the analyzed group shows that past acquaintances and ties made this terrorist network extremely strong. Even based on such questionable and incomplete information, one can make a conclusion that a large number of ties was concentrated around the pilots, which is very risky for this type of covert network because concentrating both unique skills and connectivity in the same nodes makes the network easier to disrupt, once it is discovered. Detecting and eliminating those network nodes which have unique skills may cause maximum damage to the accomplishment of the task. In this case, pilots would be the most desirable targets for removal from the network. Of course, such a conclusion could only be reached once a terrorist attack had already occurred and information which would have been difficult to collect during the conspiracy phase of planning the attack revealed (however, one could have acted proactively by using SNA analysis and other types of analyses). Data analysis shows that upon their arrival to America, the terrorists suppressed their communication, which made it very difficult to discover their true intentions.

The data indicated that preventive action was possible in the preparation phase, during the time when the terrorists were staying in the Republic of Germany where their social life was much more active. Thus, their mutual ties confirm that most of them were connected to Al-Qa’ida cell in Hamburg. The said 9/11 terrorist attack shows at the same time the power arising from dense connections and the network resilience. The provided example demonstrates the potential of network analyses to detect and understand the activities of terrorist groups. Of course, terrorist groups do evolve and further adapt to global and regional, as well as local state of play, which makes their detection and monitoring more difficult. However, the development of terrorist groups follows the development of algorithms and network analysis techniques, depending on the groups themselves, type of data and the way agencies collect and keep data. The latest data science research shows the enormous potential of the so-called Laplacian centrality.

To put in simply, the significance of a node in the network is determined in accordance with the distribution of energy in the network following a simulated removal of the node from the network. In other words, what happens with the energy in the network once we remove a particular node? A non-weighted network may be considered a special type of weighted network, where each edge weight is set to 1. It was interesting to see the testing results of the application of the Laplacian method on a non-weighted 9/11 terrorist network (Qi; Fuller; Wu; Wu; Zhang). This is a large network including a total of 62 persons, but the centrality metrics were calculated for 19 hijackers. The values are normalized and divided by the highest amount for each measure and are shown in a table for the first 5 persons according to the Laplacian centrality criteria. The first four persons according to the Laplacian centrality criteria differ from persons singled out according to other metrics criteria. When the identity of those four persons was checked, it was discovered that those were the very pilots on four different 9/11 flights (AA 11, AA 77, UA 175 and UA 93). Thus, the pilot Ziad Jarrah ranked as fourth according to the Laplacian centrality criteria, fifth according to the degree criteria, and also fifth according to the closeness criteria, and only ninth according to the betweenness criteria.

This result, where the pilot is positioned at the very top of a list, is in line with the conclusions of analysts who believe that the pilots were very important links, in particular since their training required a lot of time and money. The Laplacian method, which measures the significance (centrality) of a node by the reduction of the network energy following the deactivation (removal) of that node from the network, has therefore proven to be an excellent measure in this example – it resulted in a very concrete and applicable result. Haider Butt and his associates went a step further by suggesting the use of hybrid classifier as an innovative method for the prediction of key nodes in covert networks. The system calculates certain centrality measures for each node in the network and detects key nodes among them by applying a hybrid classifier. The proposed classifier is actually a combination of 𝑘-nearest neighbors (kNN), Gaussian mixture model (GMM), and support vector machine (SVM).

The system also applies anomaly detection to predict terrorist activities in order to help security and intelligence agencies to destabilize the involved networks. The proposed system concept has been implemented and tested using different case studies including two publicly available databases and one local network. One of those studies is the 9/11 attack. The results of classification and detection of key network nodes have been shown through the following parameters: sensitivity – the number of correctly detected key nodes in relation to the total number of key nodes in the network, specificity – the number of correctly detected ordinary nodes in relation to the total number of ordinary nodes in the network, accuracy – the number of correctly detected nodes (ordinary and key) in relation to the total number of nodes in the network, and AUC - area under receiver operating characteristics (ROC) curve.

It is evident that use of hybrid classifiers results in significantly better detection of key nodes in terrorist groups – dark networks. The hybrid classifier accuracy achieved in 3 case studies in the paper “Covert Network Analysis for Key Player Detection and Event Prediction Using a Hybrid Classifier” (e.g. detection of key network nodes) is between 88.73% and 95.91%.

It should be said that there are three key problems of dark network analysis which could be “copied” to a network consisting of members of terrorist groups: incompleteness – it is an inevitable fact that investigators do not have or cannot uncover all the missing nodes and links at a particular moment, fuzzy boundaries – difficulties in deciding who to include and who not to include in the network, and dynamics – these networks are not static, they are always changing. It is not sufficient just to establish a link between two entities, but one should also establish the existence of weak or strong links, depending on the time and the task. However, this is an additional reason to certainly use these (and some other) analysis techniques in combating terrorism.

A great practical example of SNA and data science application in criminal science in Croatia was demonstrated by my colleagues Horjan and Krnjašić in their paper “The Social Network Analysis of Organized Criminal Groups in the Republic of Croatia”. Their paper provides an overview of the application of social network analysis of organized criminal groups on a representative sample which reflects the Croatian criminal network at the national level of threat. It also describes basic network models and principles that govern them. A relevant model of the Croatian criminal groups’ network is also determined.

The most important conclusions (practical value of the paper) can be summed up as follows: in Croatian criminal network (which is, of course, part of the regional and even the European criminal network), there is a clear compactness of the network; the criminal system portrayed is flexible and adaptable to changes; the links between network entities (group members) show the strength and power of particular persons, but they also show domination of particular groups in the entire network, etc. In any criminal network, 7 roles of network members can be identified: (1) organizers who provide the direction of the network (these are usually core persons); (2) insulators who are responsible for protecting core persons from any danger; they transmit commands, directives and instructions from the network core to the periphery of the network, thus ensuring that the communication flow from the periphery does not compromised the very core of the network; (3) communicators who ensure communication flow, transmit instructions and receive feedback (potential conflict between insulators and communicators); (4) guardians who ensure network security, monitor recruitment in the network and ensure the loyalty of those recruited; (5) extenders of the network who are responsible for recruiting new members and negotiating cooperation with other networks; (6) monitors who ensure the effectiveness of the network, collect information on weaknesses and problems within the network, ensure that the network is able to adjust to new circumstances and maintain the high degree of flexibility, and (7) crossovers, who are parts of the criminal network infiltrated in other networks, and who provide key information and contribute to network protection.

Krnjašić and Horjan have examined a network of telephone communication between the detected telephone numbers of 280 members of the 10 most powerful national criminal groups (group names are coded by numbers ranging from 1 to 10) and operationally relevant persons at a national level identified through criminal investigations in the period from 2009 until 2011.

By analyzing the resulting SNA metrics, the number of components (how many separate components make up the network; if all the nodes-entities are connected in the network, we can conclude that the network consists of a single component), betweenness of the analyzed network and an average geodesic distance (indicates the closeness in the network, how many steps on average there are between the nodes in the network) and the diameter (what is the distance measured in steps between the two most distant nodes (entities) in the network) of the analyzed network, one can conclude that this is a disproportionate network. The analysis results for the analyzed organized criminal groups’ network in the Republic of Croatia show that this is a disproportionate network.

In order to keep a criminal organization functioning, the information spreading both within a single criminal organization and within the whole network of organized criminal groups acting in the Republic of Croatia and beyond is of the utmost importance. Information spreading pattern has characteristics of a chain reaction, whereby a group member informs the member closest to him, who then “contaminates” the next person in their group or another group, and so on until the information covers the whole area. When we examine the way in which the information spreads from the network core, we can notice an increased percentage of affected entities in the network in the first three steps, whereas the number of entities sharply increases in the fourth step (around 95% entities/nodes), after which the possibility for the information to spread lessens. Information flow from the periphery has a small delay, meaning that the information starts to spread in the fourth step (when the information coming from the core has already reached culmination) and reaches its maximum in the seventh step.

To enable information flow within a network in such a small number of steps, the network has to meet certain conditions, such as the significant density, extensive links and the so-called widespread neighborhood of the entities. Now we have come to a question: “What’s next?” And again a reference is made to data science in a broader context. If the security and intelligence structures wish to “disrupt” a particular piece of information or “infiltrate” another piece of information in the network, the part of the network where their activity begins is of utmost importance.

Furthermore, defining of a particular network position or roles of each member of a criminal network is very helpful for combating criminal networks and for dismantling organized criminal groups. Understanding how criminal networks function helps neutralize self-protection measures utilized by a criminal network. It is precisely in this particular case of SNA application that we can observe the key data science contribution.

[1] Mr. Kopal's professional and scientific experience also includes the following: Special Advisor to the Croatian Prime Minister for national security; Vice Dean for research and development at Algebra University College and Head of the professional master study program of Digital Marketing and Data Science; lecturer and visiting lecturer at numerous university colleges in Croatia and abroad and at CROMA EduCare Programme (Croatian Managers and Entrepreneurs Association).