Researchers at the U.S. Department of Energy's Lawrence Berkeley National Laboratory (Berkeley Lab) have created a machine learning algorithm, called Word2vec, that can scan millions of scientific papers, and then use that knowledge to predict future scientific discoveries. The study has shown that an algorithm with no prior training in materials science can successfully uncover new scientific knowledge without any need for human guidance.
“Without telling it anything about materials science, it learned concepts like the periodic table and the crystal structure of metals,” said team lead Anubhav Jain, a scientist at Berkeley Lab's Energy Storage and Distributed Resources Division. “That hinted at the potential of the technique. But probably the most interesting thing we figured out is, you can use this algorithm to address gaps in materials research, things that people should study but haven't studied so far.”
According to lead author Vahe Tshitoyan, a former Berkeley Lab postdoctoral fellow who now works at Google, the creation of Word2vec was motivated by the difficulty of making sense of the overwhelming amount of previously published studies.
“In every research field there's 100 years of past research literature, and every week dozens more studies come out,” said Tshitoyan. “A researcher can access only fraction of that.”
Faced with this challenge, the team decided to make a machine learning algorithm that can make use of all of this collective knowledge without needing intervention from human researchers.
As part of this, the team collected 3.3 million abstracts from papers published in more than 1,000 journals between 1922 and 2018. Using this, Word2vec took about 500,000 words from the abstracts and then turned them into a 200-dimensional vector, or an array of 200 numbers. Using this, the algorithm could then learn how each of the words were related to one another.
Following this, Word2vec was then trained on materials science texts. Here, it was able to learn the meaning of scientific terms and concepts based on the positions of words in the abstracts, and how often they occurred with other words. Word2vec even learned the relationships between elements in the periodic table, by simply projecting the vector for each chemical element onto two dimensions. (Related: McDonald's acquires machine-learning startup to develop personalized menus using A.I.)
With Word2vec trained using the abstracts, the team tested it to see if it could predict breakthroughs in the development of novel thermoelectric materials. These are materials that can efficiently convert heat into electricity. When the team looked at the top thermoelectric material candidates predicted by the algorithm, they found that all had computed power factors higher than known thermoelectric materials.
To further test Word2vec, the team had the algorithm perform experiments “in the past” – that is, the algorithm was only given abstracts up to a certain point in time, for example, the year 2000. From this, Word2vec not only accurately “predicted” the breakthroughs in thermoelectrics that had been made since then, it actually found others that have yet to be discovered.
With these results, the team is now working to release the top 50 thermoelectric materials predicted by the algorithm, so that scientists can start work developing them. In addition to this, they're releasing the word embeddings, so that others can make their own applications for other materials. Beyond this, the team is also working on a smarter, more powerful search engine based on the algorithm that should provide a more useful way for scientists to search for abstracts.