Bling FIRE Tokenizer Released to Open Source

The Bling team (Beyond Language and Understanding team) in Bing Web Data is proud to announce that we’ve released Bling FIRE (FInite state machine and Regular Expression manipulation) library to the open source community.

Bling FIRE is a library that allows construction of efficient tokenizers, sentence breakers, word segmentations, multi-word expression matching, unknown word-guessing, stemming/lemmatization, etc. It is designed for high speed and quality tokenization of natural language text.

The first application released on this library is Bling FIRE, which is the tokenizer used internally by Bing for all its Deep Learning based projects. It supports all whitespace separated languages and follows closely the NLTK tokenization logic with additional fixes and added breaking for hyphenated words:

NLTK: The South Florida/Miami area has previously hosted the event 10 times .

FIRE: The South Florida / Miami area has previously hosted the event 10 times .

NLTK: Marconi 's European experiments in July 1899—Marconi may have transmitted the letter S ( dot/dot/dot ) in a naval demonstration

FIRE: Marconi 's European experiments in July 1899 — Marconi may have transmitted the letter S ( dot / dot / dot ) in a naval demonstration

NLTK: Go to C : \Users\Public\Documents\hyper - v\Virtual hard disks\ and delete MSIT_Win10.VHDX .

FIRE: Go to C : \ Users \ Public \ Documents \ hyper - v \ Virtual hard disks \ and delete MSIT_Win10 . VHDX

NLTK: In the confirmation window , click OK. Review the FMT Real - time Report ES .

FIRE: In the confirmation window , click OK . Review the FMT Real - time Report ES .

The key advantage of this library is speed – it is 10x faster than existing open source tokenizers:

System Avg Run Time (Second Per 10,000 Passages)
Bling FIRE 0.823
SpaCy 8.653
NLTK 17.821

Since getting released, the project has seen coverage in specialized news sites and already has more than 1000 stars on GitHub.

You can get access to the library and find out more details at: https://github.com/Microsoft/BlingFire. To reach out to the team with questions or comments, connect with us on Stack Overflow.

- Bling Web Data Team

Copyright © Microsoft Corporation. All rights reserved. Bling Fire is licensed under the MIT License.