The Bling team (Beyond Language and Understanding team) in Bing Web Data is proud to announce that we’ve released Bling FIRE (FInite state machine and Regular Expression manipulation) library to the open source community.
Bling FIRE is a library that allows construction of efficient tokenizers, sentence breakers, word segmentations, multi-word expression matching, unknown word-guessing, stemming/lemmatization, etc. It is designed for high speed and quality tokenization of natural language text.
The first application released on this library is Bling FIRE, which is the tokenizer used internally by Bing for all its Deep Learning based projects. It supports all whitespace separated languages and follows closely the NLTK tokenization logic with additional fixes and added breaking for hyphenated words:
NLTK: The South Florida/Miami area has previously hosted the event 10 times .
FIRE: The South Florida / Miami area has previously hosted the event 10 times .
NLTK: Marconi 's European experiments in July 1899—Marconi may have transmitted the letter S ( dot/dot/dot ) in a naval demonstration
FIRE: Marconi 's European experiments in July 1899 — Marconi may have transmitted the letter S ( dot / dot / dot ) in a naval demonstration
NLTK: Go to C : \Users\Public\Documents\hyper - v\Virtual hard disks\ and delete MSIT_Win10.VHDX .
FIRE: Go to C : \ Users \ Public \ Documents \ hyper - v \ Virtual hard disks \ and delete MSIT_Win10 . VHDX
NLTK: In the confirmation window , click OK. Review the FMT Real - time Report ES .
FIRE: In the confirmation window , click OK . Review the FMT Real - time Report ES .
The key advantage of this library is speed – it is 10x faster than existing open source tokenizers:
System | Avg Run Time (Second Per 10,000 Passages) |
---|---|
Bling FIRE | 0.823 |
SpaCy | 8.653 |
NLTK | 17.821 |
Since getting released, the project has seen coverage in specialized news sites and already has more than 1000 stars on GitHub.
You can get access to the library and find out more details at: https://github.com/Microsoft/BlingFire. To reach out to the team with questions or comments, connect with us on Stack Overflow.
- Bling Web Data Team
Copyright © Microsoft Corporation. All rights reserved. Bling Fire is licensed under the MIT License.