Do not use

Driving Performance at Microsoft Bing

Bing Team — Thu, 17 Aug 2023 08:00:00 Z

Ensuring fast performance is a common challenge faced by teams building web sites, especially given the tendency for new features and functionality to cause performance regressions. A related challenge is that many organizations don’t adequately prioritize and fund performance work. These are difficult challenges to overcome, and many sites perform poorly because of it.

In this blog we present the model Bing uses to successfully maintain and continuously improve site performance. The model is comprised of components spanning data and tools, a dedicated performance team, processes, and organizational commitment. While each component will be familiar to most people, it’s their synthesis into a unified model that we hope teams across the industry will find useful as a template for their own performance missions.

The Bing Model for Driving Performance

A prior blog discussed the architecture and many of the technical performance techniques used by the Bing search engine to deliver world-class performance to end users. But architecture and performance techniques alone are not sufficient to ensure fast performance.

At Bing we employ a set of necessary and sufficient components that have proven successful in the mission to maintain and continuously improve site performance. The model is shown below.

Figure 1: Bing Model for Driving Performance

Necessary and sufficient means that each component of the model is necessary, and collectively they are sufficient to drive a successful performance mission. Additionally, each component must be executed with a high standard of excellence. A component that is either missing or poorly executed will undermine the mission.

This model has been developed over time through experience and learnings, both at Bing and other divisions at Microsoft. The following sections discuss each component of the model.

Organizational Commitment

Strong organizational commitment to performance is the foundation and its importance cannot be overstated because all other components of the model depend on it. At Bing there is an unwavering commitment to performance at the highest levels in the organization. For example:

Performance doesn’t need to be justified; it’s known to be good for users and good for business.
Performance isn’t traded off against features. There is a win-win attitude of delivering both new features and performance and, importantly, this is reflected in ship criteria that prevent regressions.
Significant resources are allocated to performance.

An anti-pattern to watch out for in your own organizations is lip service to performance. The importance may be understood but isn’t backed up by treating performance as a top priority when it comes to resource allocation, ship criteria, or architectural decisions.

Data & Tools

Excellent quantitative data and tools for developers are critical. At Bing the performance team discussed later is responsible for providing solutions in this space.

Representative Metrics

Fast performance is in service of desirable business outcomes such as user satisfaction, loyalty, and engagement. Thus, it’s important to have performance metrics representing end user perceived performance, such that an improvement to performance metrics can be assumed to have a beneficial impact on business metrics.

Which performance metrics to use is dependent on the specific experience being measured and what makes users perceive it to be fast. Many metrics are used at Bing depending on the scenario. For the main search results page, we primarily focus on three metrics representing rendering performance: First Render, First Results Render, and Above Fold Render, which correspond to the main rendering phases of the experience.

Figure 2: Bing Rendering Sequence in Slow Motion for Main Search Results Page

Real User Measurement

Collecting performance metrics from real users is called RUM, or Real User Monitoring. Measuring performance in the wild is essential as it represents the “truth” across the myriad of different conditions (network speeds, device speeds, browsers, etc.) in the real user population. At Bing RUM data is used during the A/B Experimentation and Live Site phases as discussed in the next section.

Performance is analyzed and tracked across the full distribution of percentiles, but the 90^th percentile is primarily used for detailed tracking, goal setting, and accountability for end user performance.

Performance Tools

Bing uses performance tools at each stage in the ship pipeline as shown in the diagram below. These tools are designed to be used by any engineer in Bing, not just the performance team engineers.

Figure 3: Performance Tools

Upstream Dev Tools

The Bing performance team operates a system called the Performance Analyzer Service (PAS) which runs in a highly controlled environment of identical machines and is used to measure the performance of front-end code. PAS runs tests against URLs (or scripts) using real browsers in so-called test agents running on the machines. Tests generate the perceived performance metrics discussed earlier, along with a wealth of other useful metrics and diagnostics for performance debugging.

Under the hood, PAS leverages WebPageTest open source which is an outstanding system for measuring web page performance. PAS uses layered services and a database to implement extensions to WebPageTest.

PAS has implemented built-in support for executing multiple tests together as a group in support of “Lab A/B Testing.” Just as developers test the performance impact of a code change in the real-world using A/B experimentation (see next section), they use Lab A/B testing to test the perf impact in a controlled lab environment during the development cycle.

PAS also has implemented Web Page Replay (WPR), which is a mechanism to eliminate test results variance due to servers, CDN, and networking when testing front-end code. It works by running a pre-test (called the Recording) through a WPR service which is essentially a web proxy that captures and stores all HTTP request/response information. Then, the actual test (called the Playback) is run against the WPR service with all requests served by it using the stored responses from the Recording. PAS is built such that a test agent and its WPR service are collocated, so that all HTTP traffic during a Playback is local to a single host; this further reduces variance.

Because of WPR and other tactics to reduce variance such as disabling background services and non-deterministic browser heuristics on the test agents, PAS provides repeatable test results with high sensitivity in detecting the front-end perf impact of a code change. PAS has proven to be very popular with developers in Bing and other divisions at Microsoft.

A/B Experimentation

Bing uses Microsoft’s internal A/B testing platform to evaluate performance in the real world prior to shipping. A/B experimentation is used both for evaluating the benefit of a potential perf optimization and for ensuring that standard feature enhancements don’t regress performance. An A/B experiment’s scorecard includes all relevant performance metrics that enable ship decisions after an experiment has completed (see more in the Enforced No Regressions section).

The Bing performance team operates a pre-experiment gate that acts as a performance sanity check. It is implemented using the PAS system described in the prior section. The gate measures the control and treatments of an experiment to ensure the treatments’ performance are within a margin of expectations. This prevents egregiously performing code from being inflicted on real users during experiments. An additional guard is that after an experiment starts, it can trigger performance alerts if perf has regressed beyond a threshold and will automatically be shut down if severe enough.

Live Site Tools

Despite best efforts to prevent performance regressions due to code ships, it sometimes happens. Or a regression can occur due to other causes such as a networking infrastructure problem, capacity issue, or feature promotion on the site that is outside the control of engineering. To detect and quickly identify the root cause of a regression, Bing uses various tools.

Bing uses typical time-series dashboards and reports to track real world performance over time, including drill-down into different dimensions and percentiles. There are flavors of these dashboards based on pre-cooked daily aggregations (unsampled) as well as NRT (near real time). Additionally, a data store with sampled data is used for deeper debugging across a larger set of dimensions; using sampled data allows for faster queries and visualization than is possible with the full data set.

Because Bing has a vast amount of performance data collected from different sources, it’s a challenge to automatically detect regressions. Unlike availability which can be measured against a fixed threshold like 99.99%, performance metrics can’t be measured using fixed thresholds because the expected performance varies depending on dimensions such as country, device class, browser, or seasonality in the data (end user performance varies across time-of-day and day-of-week due to a different mix of users on faster enterprise networks versus slower home networks).

To automatically detect and alert on regressions, Bing uses an Anomaly Detection tool. Anomaly detection efficiently checks a vast number of dimension combinations to quickly identify unusual patterns or deviations in metrics, such as spikes or steps in latency or traffic. Anomaly detection allows engineers to identify and isolate issues to specific areas of the system, rather than having to search through a vast amount of data.

Team & Process

Team and process components are the next part of the model. With a foundation of strong organizational commitment and excellent data and tools, Bing uses a centralized performance team and a couple of key processes to round out its performance mission.

Dedicated Performance Team

Bing has a dedicated performance team whose responsibility is to deliver and maintain the data and tools components described above and to oversee the processes described in the following two sections.

The Bing performance team provides expert guidance, analysis, driving, training, and data and tools that democratizes the performance mission – i.e., enables all feature developers to write and ship optimal code. For optimizing existing code, it will sometimes be done by an engineer from the performance team, but to achieve scale, frequently the performance team will identify opportunities and then partner with other teams, so that many optimizations are implemented by feature developers.

Improvement Workstreams

The Bing performance team drives workstreams whose goal is to improve performance. A backlog is maintained that is populated with opportunities, usually deriving from data analysis of RUM data, lab analysis using the PAS tool, or the emergence of new web technologies (see prior blog for examples of optimizations that have come from these workstreams).

For end user performance, the 90^th percentile of worldwide performance is the primary focus, but specialized workstreams also are spun up as needed (e.g., to focus on a particular geography, to focus on higher percentiles, etc.). As mentioned above, these workstreams are typically executed in partnership with other teams to scale the optimization efforts.

Given that Bing is a large organization with many teams and experiences, it’s not feasible for the performance team to drive all improvement workstreams. In some cases, they will be driven by feature teams with the performance team in a supporting role.

To add accountability, formal goals are set and tracked on a regular basis to ensure that perf improvement targets are being met. Without accountability, even the best-intentioned process is susceptible to failure.

Policy of No Regressions

Enforcing a policy of No Regressions is super important to the performance mission. It’s all too easy to tradeoff performance against new features, especially if a business metric is shown to improve due to a new feature during A/B experimentation. But this is a losing game, as over time performance will inevitably degrade.

Because of the strong organizational commitment to performance discussed earlier, Bing has been able to implement a process that prevents shipping perf regressions. New features must be at least performance-neutral, and Bing uses a formal sign-off process (tooling + humans) to ensure perf regressions don’t ship. This forces teams to optimize performance of their features, and in some cases find other code to optimize to stay neutral.

Figure 4: Process for Preventing Performance Regressions

An exception mechanism exists so that in rare cases a perf regression can ship; this requires VP approval and must fit within a small annual budget (in msec) for regressions. To counteract the effect of these regressions, workstreams exist to improve performance as described above, which has the effect of funding the regression budget plus a surplus so that Bing is getting net faster.

Like the perf improvement process, formal goals are set and tracked to ensure accountability to the policy of no regressions.

Continuity Over Time

Finally, continuity over time is essential. In other words, as an organization morphs due to turnover or new people in leadership positions, it’s important to keep the model’s execution intact.

Fortunately, this hasn’t been a problem at Bing, and we’ve been able to improve the model’s execution over time. But we’ve seen cases in other divisions where a large reorganization led to a dramatic loss of institutional knowledge and a significant setback in the performance mission. The best way to overcome such situations is to ensure knowledge transfer to the new people, or even better have the model institutionalized company-wide, or at least across a higher level in the organization.

- Paul Roy, Jason Yu, and the Bing Performance Team

Bing delivers more contextualized search using quantized transformer inference on NVIDIA GPUs in Azure

Bing Team — Thu, 07 Oct 2021 08:00:00 Z

A couple of years ago we shared how Bing leveraged transformers for the first time at web-search scale to deliver its largest improvement in search experience. At the time, we used a distilled 3-layer transformer on top of Azure NV-series VMs with NVIDIA M60 GPUs to significantly improve Bing’s search relevance within the stringent cost and latency constraints for web search.

Since then, transformers have become increasingly popular across Bing and now power new capabilities such as intelligent summarization and expanding Question-Answering to 100+ languages. For example, by using domain adapted transformers, Bing incorporates signals such as the page’s language, location, and a higher proportion of the web page’s content to provide more relevant, fresh, and contextualized search experiences. Now when a user in Japan searches “精神病院赤羽”(mental health clinic Akabane), Bing uses the user’s location and language to surface relevant clinic options in Akabane.

Relative to the initial 3-layer transformer integrated into Bing, the latest transformers are much more complex – each model has many more layers and needs to support much longer input sequence lengths. To ensure Bing will continue to deliver the fast, responsive, and relevant search experience our users expect, we’ve invested heavily in transformer inference optimization across both hardware and software to mitigate the performance and cost impact of higher model complexity.

Optimizing transformers using inference-focused NVIDIA T4 GPUs in Azure

Using our prior experience optimizing transformers using NCsv3 series Azure VMs with NVIDIA V100 Tensor Core GPUs, we focused on optimizing transformers using NCasT4v3 Azure VMs given the inference-focused nature of NVIDIA T4 Tensor Core GPUs with low precision support like INT8. We optimized model inference along three main dimensions: Operator fusion, INT8 model quantization, and maximizing inference serving throughput.

Custom fused multi-head attention kernel

Operator fusion is an optimization technique to accelerate transformer performance by fusing multiple discrete transformer operations into a single kernel. This minimizes the overhead from memory copy and data transfer across different kernels to improve inference performance.
In close collaboration with NVIDIA, we used a custom fused multi-head attention kernel that combined batch matrix multiplication, softmax, and other operators and adapted it for each model’s specific transformer parameters such as hidden dimension and sequence length. The kernel was also optimized to take advantage of the new INT8 Tensor Core support available on NVIDIA T4 GPUs.

Applying INT8 quantization

Another approach to improve model inference speed is to decrease the overall amount of computation through model compression via quantization. To leverage NVIDIA’s Tensor Core GPU mixed-precision support, we applied post-training quantization to each model’s FP32 weights using two steps: 1) Generate and apply the INT8 scaling factors and 2) select the operators to quantize in the graph based on the scenario’s precision and latency thresholds. By using only post-training quantization, we were able to avoid quantization aware finetuning and deploy models to production faster.

We leveraged NVIDIA TensorRT’s INT-8 quantization pipeline to first dump each model’s FP32 weights, identify the weight and activation quantization scaling factors based on calibration data, and store the quantized weights. We then evaluated multiple combinations of quantized kernels (levels). In Level 1 quantization, only the projection, embedding, and feed forward layers were quantized to INT8 which improved model latency by 20.6%. For Level 2, we added INT8 batch matrix multiplication kernels which improved model latency by 30.5% and for level 3 we quantized the additional layers, providing a 35.4% latency improvement at P95. Each quantization level has varying impacts to model precision so depending on the specific use case and model precision requirements, we select different levels of INT8 quantization to apply.

Maximizing inference serving throughput

These transformers are hosted in Bing’s deep learning inference service (DLIS) which is hosted on heterogeneous hardware fleet spanning across GPU and CPU machines. Each NCasT4v3 Azure VM contains four NVIDIA T4 GPUs per VM and can host four independent model instances. Each model instance is associated with two CUDA streams to further saturate the GPU utilization which means that in total each GPU server could handle eight concurrent inference requests. To minimize the service overhead, each server runs a single DLIS service instance. The service instance has a per-model dispatch queue to decide which model instance to release the request, and the dynamic batch size based on the number of the requests in the queue as shown in below diagram.

While we successfully optimized tensor computation on GPU, we observed that tokenization emerged as a system bottleneck due to concurrent inference requests and longer input sequences. By offloading the tokenization pre-processing to CPU-dense servers in combination with accelerated networking available in NCasT4v3 VMs, we reduced the end-to-end latency for each inference by 2.5 milliseconds. This is quite significant given that each model inference is typically tuned below 10 milliseconds per inference to ensure fast and responsive Bing search results.

Serving complex transformer inference at web-search scale. We benchmarked a 6-layer transformer with max sequence length 256 and batch size 10 on NCasT4_v3 Azure VM using the above optimizations. By combining all of them, we observed nearly a 3x increase in throughput per NVIDIA T4 GPU which was critical to meet our stringent cost-to-serve and latency requirements.

Benchmarks on T4	Precision	Throughput	Model Latency (ms)
Baseline Model	FP16	1123	8.9
Fused Kernel	FP16	2083	4.84
Fused Kernel	INT8 (Level 1)	2525	3.96
Fused Kernel + two CUDA streams	INT8 (Level 1)	3174	6.38

Globally we’re using these transformer optimizations to support tens of millions of transformer inferences per second across 5 Azure regions on thousands of NCaSv3_T4 Azure VMs. Our experience shows that complex large-scale transformer inference can be successfully optimized for both performance and efficiency by using inference-focused Azure NCasT4v3 VMs. Without these software and hardware inference optimizations, the increase in model latency and cost to serve would have made these models impossible to ship and power Bing’s continued improvements in search experience.

If you’re interested in exploring some of these optimizations, please see the following resources:

NCasT4_v3 series Azure VMs are now generally available in Azure
To explore applying INT8 quantization for NVIDIA GPUs, see NVIDIA’s BERT Inference with Nvidia TensorRT demo on GitHub for quantization examples

- Jeffrey Zhu, Mingqin Li, Jason (Zengzhong) Li, and Cassandra Oduola

RocksDB in Microsoft Bing

Bing Team — Tue, 05 Oct 2021 08:17:36 Z

The Microsoft Bing platform has built one of the largest distributed storages for Bing web search data, using its home grown ObjectStore service. The system hosts hundreds of petabyte data and processes hundreds of millions lookups per sec. Open source RocksDB is used as the storage engine. Multiple techniques are applied to efficiently store and process the massive data with sub-second data freshness. This blog will present those techniques and the results in production.

Web Search Data Scenario

In early 2018, the team started the effort to build a new platform based on ObjectStore with RocksDB for all Microsoft Bing web document storage and processing pipelines. The new platform is not just a replacement for the old one which served Bing offline data processing workloads for over 10 years, but also with a massive increase of data volume and improved freshness. It hosts hundreds of petabytes of data and handles 10s of billions of document processing per day. This new platform became the largest RocksDB deployment in Microsoft.

The following diagram shows a simplified architecture of web data platform. The processing of a document starts in Process Controller when it picks which documents should currently be processed. The first interaction is with Crawler where a subset of these documents is recrawled to get the latest content. Then documents are assigned to a Document Processing node which first reads the prior state of the document from Web Data table. Then it generates the in-index representation of the document and updates processing metadata to help in future reprocessing. All the newly created data is stored back into Web Data table. Index Building service then reads the in-index version of document from Web Data table and creates a merged index that is served to users. Injections (of offline generated data) are written directly to the Web Data table and picked up by Index Building service to add directly into merged index. Messaging Service is used to send information between different documents (e.g. Anchor text collection from source page to target page). Lookups of any document state/data are possible from Web Data table at any time and are used in web partner pipelines and for debugging.

Compute and Storage Fabric

The capability of growing computing power and storage independently is highly demanded. Also having separate development and release cycles between computing and storage services are critical, hence the decoupling of storage and computing becomes the key design principle.

At the same time, it does not mean we should use different machines for deploying two types of services, instead, physically co-hosting CPU-intensive computing services with IO-intensive table services can greatly improve resource utilization. Networking costs for transferring data between storage and compute can be also reduced with affinitization.

The following illustration shows cohosted Document Process Server and ObjectStore Table Server. Coprocessors are user-defined functions that are hosted and executed by Table Server, which allows running user code close to data. In the design, some light-weighted processing logics are executed in Coprocessors, like close-to-data filtering to find candidate documents to process.

Data Model

When we prototyped the web data project, a series of tables came up naturally to present lifespan of web documents, from crawled raw documents to processed documents. Some common data from different tables are duplicated, so to achieve consistency under concurrent data processing tasks, transaction support across tables is required. We settled on a Bigtable-like data model by joining all sub-tables by primary key as column groups (will be explained later), which could be used by many applications and pipelines. It improved resource efficiency by avoiding redundant data, and reduced management costs by using fewer tables. Also, when doing batch update on multiple datasets for a document, it uses cheap in-row transaction instead of distributed transactions across tables.

Column Store

Web Data Table is persisted and served as Column Store, which is a distributed NoSQL database built on RocksDB supporting column-oriented operations. It provides an efficient way of accessing a portion of the data within a record and guarantees atomicity when multiple columns are updated or read across column groups. Partitioning is done by splitting ranges of URL hash. Logical storage hierarchy is shown as below:

Column store supports up to 64k of predefined columns with Bond as schematization provider.

Column family (different from column family of RocksDB’s terminology, term CF will be used for RocksDB column family in this document to avoid ambiguity) is a special column that includes a set of sub-columns indexed by a string key (e.g.: K1, K2, K3…). The number of sub-columns is arbitrary and can be different for different records. It is commonly used for storing various number of same typed facets for a record.
Column group is set of columns with the same data locality.

Column Store uses column storage key schematization to flatten 2-dimensional table cells into K/V pairs stored in RocksDB. A column storage key is combined by primary key, 2 bytes of column Id and nullable sub-key for indexing column family. The way of the column key schematization keeps columns in the same row adjacent in the compacted SST (Sorting String Table) file, which improves the efficiency of key prefix encoding, also maximizes the block cache hit rate.

Hot and cold data separation using column groups

Distributed storage engines are often optimized for high QPS and low latency or high throughputs and low-cost. Most popular engines can be configured towards either one of the characteristics. However, there are two major challenges to support web documents processing:

Hot documents can be refreshed in seconds, but the cold ones can be still for years, also the characteristics for the same documents can change dramatically over time.
System cost is impossible to ignore when solving problems at web-scale.

SSD technologies have evolved rapidly in the past decade. It provides cheaper IOPS/dollar compared with HDD, however for workload does not require substantial amounts of random accesses, HDD is still the more economical solution considering the volume/dollar.

Web documents can be classified as relatively hot or cold by update frequency, then stored into different column groups correspondingly. Each column group maps to a separated LSM-tree with a path assigned either on SSD or HDD.

Different from separating hot and cold data by multiple tables, single table with multiple column groups supports in row transaction across storages naturally, which allows documents switching between hot and cold storage easily. Also, batch accessing multiple column data for a document across storages is much cheaper.

Columnar processing

A major portion of document processing is done sequentially, which requires enormous range scans on some column data. Joining columns frequently accessed together as a separated column group enables efficient columnar processing.

Here is a Column Store performance comparison of sequential scan on HDD without cache. Tests were done after ingested 500K rows, and each row has 2 columns with 100KB average size. Test compares read throughputs by:

Read one of two columns by filter on the column Id.
Assigning columns to two column groups and read on one column group.

	*Column Filter*	*Column Group*
*Sequential Scan*	15.6 MB/s	22.0 MB/s

Using column group prevents reading unnecessary data and provides better performance.

Column Groups in RocksDB

Column groups are mapped to RocksDB’s CFs, which physically splits database into separated LSM-trees, at the same time atomic accesses are supported across CFs.

Here is the illustration for RocksDB stack:

There are 3 major techniques to improve performance: JBOD, applying per storage compaction limiter and using disk prioritization management.

JBOD on RocksDB

JBOD (stands for Just-a-Bunch-Of-Disks) is one of the fundamental technologies we picked for Web Data table. Compared with RAID0, it gives the ability for software stack to access all disks independently. While RAID0 has lower development costs, throughput and IOPS capacity are much less than JBOD.

JBOD supports on RocksDB is done by overriding file system APIs from JBOD environment wrapper. The performance benefits on HDD are significant as IOPS is the bottleneck of HDD for most scenarios, by using JBOD, IOPS for individual disk can be used separately.

In a RocksDB random overwrite test with 3 HDDs, JBOD got more than 2x write throughputs compared with RAID0, and the later test using 9 HDDs JBOD shows almost linear improvements along with number of disks.

	*3 HDDs RAID0*	*3 HDDs JBOD*	*9 HDDs JBOD*
*Write throughputs*	40 MB/s	100 MB/s	285 MB/s

Compaction Concurrent Limiter

RocksDB ingestion speed is limited by background compaction throughput. Writes will be throttled eventually if compaction cannot catch up with user write.

Column group is built on top of RocksDB column family (CF), and all the CFs for the same DB instance share components like WAL, flush, compaction schedulers and thread pool, etc. After we extend RocksDB to support configure CFs with different working paths across storages, we found a performance issue that SSD and HDD didn’t work well together, slow HDD compactions exhaust the compaction thread pool and block SSD compactions.

Enlarging thread pool is not an option, as HDD becomes much less performant with high concurrent compactions bounded by IOPS. Compaction limiter is designed as a sharable object across CF, controls the outstanding compaction tasks. Global control for max outstanding compaction tasks on individual drive is done by sharing per-drive compaction limiter across CFs of all DB instances. Comparing with uncontrolled mix compactions, a fine-tuned compaction limiter gave us 80% combined ingestion throughput gain. This perf number was measured with 3 HDDs as JBOD and 2 SSDs as RAID0.

Disk Prioritization Management

To achieve better serving performance and IO utilization, Disk IO manager is introduced as a common layer on top of file system API, and it controls and prioritizes IO based on current outstanding requests. For example, RocksDB compaction writes yield to flush write, compaction reads yield to user read when disk is busy. Disk reads and writes for remote data recovery yield to all the other types of IO, and they are marked as the lowest priority.

Disk IO manager maintenances IO task as multi-level feedback queue with FCFS (first come first serve) as the algorithm for each level, and IO requests are pushed into the queue if predefined max outstanding IO or outstanding bytes are reached. When IO task completes, the next task is picked from highest priority queue by FCFS followed by lower priority queues until outstanding task limit is reached again. To prevent starvation for lower priority queue, the task will be promoted to higher queue after a certain waiting time.

Summary

In this blog post, we introduced RocksDB usage in Microsoft Bing web data scenario which to our knowledge is one of the largest windows RocksDB deployment in industry. To meet the scalability and agility requirements, we built the storage fabric independent of computing fabric on top of ObjectStore table service. Several key techniques are applied, including column-based processing, hot-cold data tiering, JBOD, separate compaction filter between HDD and SSD, and disk prioritization, to make the system efficient and performant.

There are also tremendous improvements done at data replication protocol. We will talk about them in the next blog post.

- Burton (Pu) Li, Jason (Zengzhong) Li, Max Sigalov, Dafan Liu, Knut Magne Risvik

Welcome to the engineering blog

Bing Team — Tue, 05 Oct 2021 08:14:01 Z

Bing and all search and recommendation experiences at Microsoft are powered by infrastructure that runs at extreme scale and speed. The platform team is a world-class engineering team with presence around the world. Our mission is to build platforms that empowers the scale and scenarios for search today and tomorrow.

Some of our infrastructure is specialized for the extreme scale and performance we will need, but we also embrace open-source technology to a great extent. We stand on the shoulders of giants.

In this new blog we will showcase our work on infrastructure, presented by amazing engineers. We are excited to share, and even more excited to learn.
Happy reading!

Andrey Proskurin, Corporate Vice President Knut Risvik, Distinguished Engineer

Bling FIRE Tokenizer for BERT

Bing Team — Thu, 19 Mar 2020 07:00:00 Z

Bling Fire Tokenizer is a blazing fast tokenizer that we use in production at Bing for our Deep Learning models. For online scenarios, where the tokenizer is part of the critical path to return a result to the user in the shortest amount of time, every millisecond matters. This is where Bling FIRE performance helps us achieve sub second response time, allowing more execution time for complex deep models, rather than spending this time in tokenization.

We released Bling Fire Tokenizer to open source a year ago, to great community interest. We also got feedback and feature asks from you that we’ve addressed since then.

We’ve since added support for the BERT-style tokenizers with normalization and sub-token extraction. We provide precompiled tokenization models for BERT base/large, BERT base/large cased, BERT Chinese, Bert Multilingual Cased. As well as added instructions to create your own tokenization models here and here.

In terms of speed, we’ve now measured how Bling Fire Tokenizer compares with the current BERT style tokenizers: the original WordPiece BERT tokenizer and Hugging Face tokenizer. Using the BERT Base Uncased tokenization task, we’ve ran the original BERT tokenizer, the latest Hugging Face tokenizer and Bling Fire v0.0.13 with the following results:

As you can see Bling Fire is much faster than existing tokenizers for BERT based models. You can find the details of the benchmark here.

Bling Fire provides state of the art latency and is available for Windows, Linux and Mac OS X platforms. You can get access to the library and find out more details at: https://github.com/Microsoft/BlingFire. To reach out to the team with questions or comments, connect with us on Stack Overflow.

- Bling Web Data Team

Learn, Connect, Explore with the Bing Search APIs team at Microsoft Ignite 2019 in Orlando, Florida

Bing Team — Wed, 02 Oct 2019 09:00:39 Z

The Bing Search APIs team will be at Microsoft Ignite 2019, in Orlando, Florida, November 4^th through the 8^th.

If you are registered for the event, stop by the Azure Cognitive Services Web Search booth in the AI Apps & Agents area to learn more about the rich Cognitive Services search features and solutions available via the Bing Search APIs , including:

Bing Search APIs that enable you to query for web, images, news, video, entity, autosuggest and spell check
Bing Custom Search API -Tailor-made, ad-free and adjustable search experiences
Bing Visual Search API - Image-based search options for your app or website

Additionally, sign up to attend our Bing Search APIs session, to learn more about our flexible, state-of-the-art, end-to-end search services, that enable users to access the right information from Bing’s powerful knowledge graph, including intelligent tools such as fraud and plagiarism detection.

BRK3030 – AI powered web search with Bing Search APIs and more

In today’s competitive landscape, a business’s success is predicated on its ability to provide its users with the right products and services. In this session, learn how easy it is to create a tailored, end-to-end solution using Bing APIs such as Bing Custom Search, Bing Web Search, Bing Visual Search, Bing Entity Search, Bing Autosuggest, Bing Spell Check, and more. Next, learn how to triangulate multiple data sources to develop more nuanced insights and an improved understanding of customer sentiments using a combination of Cognitive AI API capabilities. Discover how companies around the world are using these APIs to access the powerful Bing knowledge graph and index for scenarios like plagiarism detection, fraud detection and reinforcing their existing datasets, and many more. Leverage these APIs to grow your business.

To learn more about how you can infuse applications and webpages with contextual intelligence and engaging experiences, visit the Bing for Partners site, follow us on Facebook, and join the Bing APIs LinkedIn group for news and updates.

- Bing APIs Team

Connect with the Bing Search APIs team at Microsoft Inspire 2019

Bing Team — Wed, 05 Jun 2019 09:00:12 Z

The Bing Search APIs team will be at Microsoft Inspire 2019, in Las Vegas, Nevada, July 14 through the 18^th. If you are registered for the event, stop by the Bing Search APIs booth in the AI area to learn more about the rich Cognitive Services search features and solutions available on the Bing Search APIs platform, including:

Bing Search APIs, including web, news, image, video, entity, and more
Bing Custom Search API
Bing Visual Search API
Bing Statistics, Bing Autosuggest, and Bing Spellcheck
And more

Bing Search APIs provide flexible, state-of-the-art, end-to-end search services, that enable users to access the right information from billions of web documents, images, videos and more. Help your customers extract various kinds of content, gain valuable insights, and protect your business with features such as fraud and plagiarism detection.

To learn more about how you can transform your apps, webpages, and other experiences, visit the Bing for Partners site and join the Bing APIs LinkedIn group for news and updates.

- Bing APIs Team

Bling FIRE Tokenizer Released to Open Source

Bing Team — Thu, 25 Apr 2019 09:08:28 Z

The Bling team (Beyond Language and Understanding team) in Bing Web Data is proud to announce that we’ve released Bling FIRE (FInite state machine and Regular Expression manipulation) library to the open source community.

Bling FIRE is a library that allows construction of efficient tokenizers, sentence breakers, word segmentations, multi-word expression matching, unknown word-guessing, stemming/lemmatization, etc. It is designed for high speed and quality tokenization of natural language text.

The first application released on this library is Bling FIRE, which is the tokenizer used internally by Bing for all its Deep Learning based projects. It supports all whitespace separated languages and follows closely the NLTK tokenization logic with additional fixes and added breaking for hyphenated words:

NLTK: The South Florida/Miami area has previously hosted the event 10 times .

FIRE: The South Florida / Miami area has previously hosted the event 10 times .

NLTK: Marconi 's European experiments in July 1899—Marconi may have transmitted the letter S ( dot/dot/dot ) in a naval demonstration

FIRE: Marconi 's European experiments in July 1899 — Marconi may have transmitted the letter S ( dot / dot / dot ) in a naval demonstration

NLTK: Go to C : \Users\Public\Documents\hyper - v\Virtual hard disks\ and delete MSIT_Win10.VHDX .

FIRE: Go to C : \ Users \ Public \ Documents \ hyper - v \ Virtual hard disks \ and delete MSIT_Win10 . VHDX

NLTK: In the confirmation window , click OK. Review the FMT Real - time Report ES .

FIRE: In the confirmation window , click OK . Review the FMT Real - time Report ES .

The key advantage of this library is speed – it is 10x faster than existing open source tokenizers:

System	Avg Run Time (Second Per 10,000 Passages)
Bling FIRE	0.823
SpaCy	8.653
NLTK	17.821

Since getting released, the project has seen coverage in specialized news sites and already has more than 1000 stars on GitHub.

You can get access to the library and find out more details at: https://github.com/Microsoft/BlingFire. To reach out to the team with questions or comments, connect with us on Stack Overflow.

- Bling Web Data Team

Talk Shop with the Bing Search APIs Team at Microsoft Build 2019

Bing Team — Mon, 15 Apr 2019 09:30:37 Z

Microsoft Build 2019 is taking place in Seattle, Washington, May 6 to May 8, and the Bing Search APIs team will be there. If you are registered for the event, stop by the Bing Search APIs booth to check out the rich web search API solutions available, all powered by Bing.

The Internet is always changing and evolving. If you want to extract this vast knowledge of the Web, you can use various Bing web search APIs to do that with solutions that help you extract various kinds of content (e.g., websites, images, videos, entities, and local businesses) from the web and cool insights (e.g., trending news images, video, visually similar products, etc.).

Come by our booth to learn more about these flexible Cognitive Services that can be customized to fit your application needs. Also, hear first-hand about the new features available for scenarios such as fraud detection, plagiarism detection, extracting sentiments from across the web, and more.

If you are not able to attend Microsoft Build 2019, check for updates on the Bing Developers blog after the conference. To learn more about the API solutions from Bing, visit the Bing for Partners site and join the Bing APIs LinkedIn group.

- Bing Search APIs Team

How to track customer sentiment online with Bing News Search API and Text Analytics API

Bing Team — Thu, 21 Mar 2019 10:03:23 Z

Nothing is more important than your reputation. The old adage has never been more true with our reputations largely being formed by our presence online. For brands, this goes two-fold. The reputation of a business or organization is built from several factors—search results, social media mentions, and customer reviews. How we show up in search results impacts how we are perceived by the world. We work and live within a pull economy where people, customers, and employers are looking for you online. And when they find you, their decision to purchase, connect, or move on is ultimately based on your reputation.

In a world where how you show up on search is key to how you are perceived, reputation management has become a function of monitoring and analyzing what your customers are saying about you online and in person. With the goal of building trust and maintaining positive control over your brand’s reputation, there are many tools available to help with just that. There are tools that can monitor social media mentions, compare you to your competitors, notify you whenever your brand is referenced online, understand sentiment, and much more. With such a vast landscape of content out there, both positive and negative about your brand, keeping up can feel like a daunting task.

Reputation Management with Bing News Search API

Bing News Search can serve as a powerful data provider to help you track and manage public sentiment about you, an organization, or brand. A Bing News Search pointed at a company’s own name can populate an internal dashboard with new or trending news content regarding that group. Also, News Search can be pointed outward to help you track your competitors’ latest releases or initiatives.

With the Bing News Search API, you can extract headlines and excerpts from articles and return them in the response, making it easier to scan and track the content in an article, and much easier to build additional tools that leverage that data stream.

How to track sentiment

Tracking sentiment online can be done quickly with the help of Bing News Search as one of the tools in your reputation management toolkit. Search for news about businesses, people, and products that you care about and then use text analytics to programmatically detect sentiment of articles, social media mentions, and more.

For example, in the screenshot below, there are a couple of headlines that came back when we searched for news about “Microsoft” that we can check:

“Microsoft's popular Halo Xbox games are coming to the PC”
“Microsoft to start nagging users in April about the January 2020 Windows 7 end-of-support deadline”

Using the Text Analytics API, also available via Cognitive Services, we can take those headlines and quickly detect sentiment that is returned as a numeric score, ranging from 0% (negative) to 100% (positive). See our results below:

Positive Sentiment Score (93%) - “Microsoft's popular Halo Xbox games are coming to the PC”

Neutral Sentiment Score (50%) – “Microsoft to start nagging users in April about the January 2020 Windows 7 end-of-support deadline”

Tracking mentions and gauging sentiment to help manage perception can be done just like that.

To learn more about Bing News Search API and Text Analytics API, go to https://azure.microsoft.com/en-us/services/cognitive-services/.

- Bing APIs Team