Using machine learning to tackle arms proliferation in Russian trade data

9 minute read

Context: In 2021 I worked on a project with a few classmates while at BloomTech’s Data Science program. We were given the opportunity to work on a novel data science problem through a Washington-based think tank.


In this post, we describe how we put together a novel approach to use machine learning on a large trade dataset to identify Russian parties that ship to foreign militaries, or parties acting on behalf of Russian state military firms.

Russia is the world’s second-largest arms exporter. The sheer volume of trade data available poses a challenge and an opportunity to researchers and government agencies in the arms non-proliferation field. Recently, our team of data scientists worked with C4ADS, a non-profit in Washington, DC, to figure out if we could apply machine learning techniques to a very large trade dataset they had and find instances of potential Russian arms proliferation.

Our task was twofold: would an ML solution be feasible in this use case, and if so, how would we structure our approach? In our research, we found only one other project attempting to apply machine learning techniques to a similar use case. Project Alpha at King’s College London is working to create a big data platform for non-proliferation analysts.

Our first attempt was a success: in 8 weeks, we were able to identify high-risk companies, suggesting that data science is a feasible approach and of high utility to those working on arms control research and/or in export control.

Working in Amazon Web Services (AWS)

This was our team’s first exposure to AWS so the learning curve was pretty steep. We learnt about parquet files and how the schema info worked. We spent a week exploring different options, including some time spent in AWS Glue where we were unsuccessful at writing lifecycle scripts.

We eventually settled on EMR (Spark & Livy), in conjunction with SageMaker, as it was perfect for our use-case. We picked these two for their ease-of-use in a collaborative work environment and ability to use multiple clusters to process our large dataset. The set-up process was substantial and required some of what would be called a sysadmin skillset. In brief, we created the cluster in the EMR service before setting up the SageMaker notebook. We used a notebook in the root directory of the SageMaker notebook as a central place to pip install relevant libraries for the conda_python3 environment.

Samples of the datasets were explored using Dask (parquet files) and then cleaned using Spark.

Data Exploration & Cleaning

The trade dataset was larger than 170 GB encompassing more than 65 million rows of Russian import and export data from 2016–2018. This included 59 tagged companies as our target variable of a large Russian arms exporter and parties known to have forwarded shipments on its behalf.

There were some significant hurdles with reading the datasets as they did not come with data dictionaries; some columns were mysteries. A key column, Description Good, was in Cyrillic — -while we were fortunate to have two team members with Cyrillic-language abilities, this column should be translated to allow for NLP-derived features in future iterations of model building. The companies included names and unique identifiers in the form of their tax ID number (INN).

We filtered for exporters only and heavily used regex to clean trade in PySpark. We had to convert some of the Russian (Русский язык) into its unicode equivalent for the regex engine to process. We either took the exact words that we needed (e.g. ПАО, ООО, ОО) or we had to take variances of a word to match with (e.g. ТАМОЖНИ, ПОСТ among others) to ensure the most rudimentary standard of cleanliness for those columns. We had also thought about applying the regex to Description Good, but trying to figure out the intricacies of verb/noun/adjective conjugation in Russian proved to be quite a feat.

This cut the trade dataset from 170+ GB to less than 4GB, or about 9 million rows. While we were worried that it was too much of a reduction, it was difficult to ensure that our regex was not too stringent or lenient in cleaning the data. This reduction did allow us to use Pandas for machine learning instead of having to learn another new tool, Apache Spark’s MLlib. An example code snippet of the cleaning process that resulted in two dataframes:

Feature Engineering

We mainly used udfs to apply lambda functions to only allow for specified values (e.g., nones and nulls to be set as 0.0 for float columns). As a first pass over the data, we did manually typed out all the methods and parameters, but in future iterations a more programmatic approach would be cleaner.

Every column had values that were strings or non-ordinal values, so they were all encoded. An example of the workflow can be seen here:

We tried to incorporate some domain knowledge into feature engineering. Our research found that military and security forces have been the main source of illicit Russian arms transfers, either through direct participation in regional conflicts or through sales by corrupt officers in conjunction with criminal organizations.

We created a new column to try and capture this process with a dummy variable, active-russian-military-ties. The criteria was defined as having present-day or recent historical (last 10 years) ties which includes military interventions, peacekeeping, and signed military or technical cooperation agreements. We manually coded this through our own research and found 106 countries that fit the criteria.

Later, however, running feature importance analysis would confirm our intuition that the only column of any importance was the description of goods column.

Prototyping and Training

We pursued both supervised and unsupervised learning approaches simultaneously.

With the list of 59 exporters as our target variable, we tried a few classifiers before settling on scikit-learn’s GradientBoostingClassifier. GBC lets us build a predictive function with a low bias and a lower variance relative to a deep decision tree. Boosting fits trees sequentially at each step, with newer trees trained on the residual errors in the previous trees. GBC addresses overfitting by using a learning rate (between 0,1) to scale contributions from the new tree each step.

Because it trains many trees sequentially before ensembling them at the end, we can carefully control the variance and avoid overfitting, unlike when fitting a single deep decision tree. Decision trees are also non-parametric, which was especially useful to our dataset since so many of our columns could not be identified and therefore contextualized with domain knowledge.

Our initial iterations heavily relied on one feature, so we removed it and retrained the model. To address the highly imbalanced classes, we trained on a subset of the dataset, where we sampled 25,000 and 250,000 known and non-exporters respectively (this 1:10 ratio worked best for our model). Our metrics optimized for recall as it was better to over identify legal trade (false positives) than miss illegal trade (false negatives) due to the legal ramifications. The supervised approach yielded 50 predicted INNs not contained in the original list of 59. We had a sense that we had achieved a good baseline after looking up the INNs on a Russian registry website, where many of them had descriptions that were military-related.

Unsupervised We used scikit-learn’s Nearest Neighbors’ library which implements unsupervised nearest neighbors learning.

Within the Nearest Neighbors algorithm, the parameters are n_neighbors and the type of algorithm used. BallTree was used for the algorithm because of its use ‘for fast generalized N-point problems.’ The n_neighbors was used for grouping of the neighbors in a cluster. The number 10,000 was chosen to produce a prediction set that was wide enough to determine the grouping of INN trade transactions of a known exporter and use that transaction to identify other companies of interest. Out of 37,467 known INN predictor rows, the first 1000 were used to make a predictions dataframe of 10,000,000 neighbors. This dataframe then filtered out the known INN neighbors and the results were value counted to find new potential INNs. The success of this approach was deemed inconclusive after looking up initial results.


We made several assumptions in our approach:

  • Illegal and legal arms exporters exhibit similar features in their trade transactions (e.g. in trading partners, space, goods, volume). Our target variable was the list of 59 known arms exporters in Russia. Using this list of legal exporters was a way to profile how illegal arms exporters might also appear in the trade data. Since we don’t actually know what illegal exporters do differently from legal exporters, this might have been a big leap.
  • The trade data contains information that is accurate and true. It is more likely that illicit arms exporters obfuscate their routes, export destination, volume of goods, or even what the good actually is. A well-known example is that of the Russian arms broker Victor Bout, who used numerous front companies, transported weapons through airplanes, and re-registered aircraft in an attempt to dodge controls through a combination of legal and illegal means. Other exporters may take circuitous routes en route to their true destination, transferring goods between shell companies, and these transit countries or companies will not be captured by the data.

Lessons Learned & Looking Ahead

We had 8 weeks to learn new tools and put together a proof of concept for the project’s stakeholders at C4ADS. We were excited to hear that our baseline results were an effective MVP for further exploration and positively received by the stakeholders.

  • First, documentation is vital: Other teams had previously partnered with them on this challenge, but without any documentation (especially regarding architecture set-up), we spent a significant amount of time trying different setups before setting on one that worked best for the project’s unique needs. To that end, we made sure to document early and often, and passed on an extensively detailed, 22-page, documentation for reproducibility and to enable the next team to ramp up faster.
  • Second, project management kept us on track: We worked as a remote team the entire time, and were able to go from 0 to a promising baseline result so quickly as we had an amazing project manager (thanks Elan!) who worked closely with us across all time zones. We had daily standups, submitted regular release canvases, assessed milestones, and video-called regularly with stakeholders to better understand the business problem and expectations.
  • As a result, we iterated early and often: With this clear work structure in place, and our project manager’s focus to deliver MVP by the 12th week, we were encouraged to try things early, discarding tools or workflows that were proving to be unfruitful quickly. We weren’t able to get to hyperparameter tuning or NLP on key Cyrillic columns in the short timeframe, but we’re happy that our documentation and baseline results will enable the next team to push the edge of what machine learning can do in this space.

Written by: Elliot Gunn

Data science team: Sean Antosiak, Elliot Gunn, Andrew Mikol, Jason Nova

Project Lead: Elan Riznis

Special thanks to C4ADS for the incredible opportunity to work on this novel data science problem.