The Council for Scientific and Industrial Research (CSIR) is mining posts from social media to see if it can detect crime trends in South Africa from the data.
Speaking at a media briefing at the Meraka Institute on the CSIR campus today, CSIR acting research group leader for data science Vukosi Marivate said they use machine learning to detect incidents and classify them.
They trained their algorithm with 1,200 posts, of which 40% had been labelled as crime-related, while the remaining 60% were labelled as not related to crime.
The algorithm analysed users’ posts and classified whether they were talking about a crime, something relating to public safety, a protest, or another incident.
Marivate highlighted a Twitter post from 2015, which contained everything they were looking for in a report – time, place, and the crime committed.
— Sophie Ribstein (@SophieRibstein) August 25, 2015
The CSIR was then able to perform an analysis on which events were regular occurrences in an area, and which stood out as exceptional.
Marivate said that a significant number of tweets they filtered described road traffic issues.
The researchers also looked into author identification to find who originally reported a crime. They matched duplicate reports of incidents, and versions of tweets that had been altered from the original.
Currently, the CSIR is looking at ways to identify events and automatically classify them as they happen.
Marivate emphasised that the work is not aimed at developing crime statistics from social media posts, but to look at trends.
Marivate also acknowledged that the results will naturally be biased for places where there is Internet connectivity.
Ethics of data mining
Working with fellow researcher and data scientist Nyalleng Moorosi, the CSIR group has included the privacy and ethics factors as they relate to mining data from social networks.
This follows events like the Cambridge Analytica scandal, which has fortunately not impacted the CSIR’s research and access to Facebook posts.
Marivate said they are not affected as they relied on public posts only.
For our work, we did not need access to personal account data,” he said.
Marivate said that Twitter has also changed its API, which will have an impact on their current approach.
“Back in 2011, as an academic researcher you could get access to Twitter’s firehose,” he said.
As the company worked to make a profit, it gradually restricted its API access and only allowed access to samples of historical data.
From around September, Marivate said they expect to be limited to Twitter’s streaming API – and will only see tweets as they come in.