Thursday, January 14, 2016

Case Study: Machine Learning at American Express

American Express has a rich history of using data and analytics to create deeper relationships with potential and current customers, but it’s the advent of machine learning that has allowed their scientists to harness the full power of their data. American Express’ Risk & Information Management team in partnership with the company’s Technology group embarked on a journey to build world-class Big Data capabilities nearly five years ago. Big data analytics helps American Express drive commerce, service their customers more effectively and detect fraud, according to Chao Yuan, senior vice president and head of Decision Science for American Express.

Sastry Durvasula, senior vice president of Enterprise Information Management and Digital Partnerships for American Express explains that the American Express "big data ecosystem has been built leveraging Hadoop and other industry leading technologies, and supports all business units with multi-petabyte scale. The platform boasts best-in-class engineering metrics, which includes a 45 second TeraSort and a 1.65 TB MinuteSort. The platform is highly scalable with flexible architecture to meet the growing business demands."

Machine Learning at American Express: Benefits and Requirements

Curious to know how American Express uses machine learning successfully, in production, at very large scale? An audience of over 300 recently got a peek into this big data story thanks to a presentation by Chao Yuan, SVP at American Express who heads their Modeling and Decision Science Team for US Consumer Business, and by co-presenter Ted Dunning, Chief Application Architect at MapR Technologies, at an event organized by the Hive Data Think Tank in Palo Alto. Chao talked about a collection of production big data use cases in which American Express has seen big benefits from using machine learning to improve decisions and better leverage their data. Ted then explained what is required of a big data platform in order to support large-scale machine learning projects such as these in productions settings.

Data from both sides of business

American Express is used to operating at large scale. In business for 165 years, it has continued to transform itself to keep up with changing demand. It has gone from being primarily a shipping company, then a travel business and now a major credit card issuer, handling over 25% of US credit card spending. And in 2014, the company reached a milestone: one trillion dollars in transactions. The nature of the company gives it the opportunity to see data from both the customer and merchant side of business, in fact, from millions of sellers and millions of buyers. As Chao says, one thing American Express is never short of is data. But the question is, how can they best leverage this data to improve the decisions they make?

About five years ago, American Express recognized that traditional databases would not be enough to effectively handle the level of data and analytics needed for their projects, and decided that a big data infrastructure would be the solution. They began to employ a Hadoop platform for their infrastructure and to turn to machine learning experts such as co-presenter Ted Dunning to help them learn how to get inside the data in order to become “more intelligent”. They began to leverage machine learning techniques in a wide range of key interactions.

Data volume is not only increasing, but data sources are also changing. More people do business online or via their mobile devices. Chao explained that as part of American Express’s ongoing journey, they must keep up with these changes in style of interactions as well as with the increasing volume. Part of that involves making a huge number of decisions, millions every day. If American Express can become just a little bit smarter in these decisions, it can have a huge advantage to customers and to the company. That’s why they are expanding how they use machine learning at large scale. With access to big data, machine learning models can produce superior discrimination and thus better understand customer behavior.

Machine learning implemented in production

Chao talked in particular about three classes of big data machine learning use cases that American Express has implemented in production: Fraud detection, new customer acquisition and recommendation for better customer experience.

USE CASE 1: FRAUD DETECTION

In the case of fraud detection and prevention, machine learning has been helpful to improve American Express’s already excellent track record, including their online business interactions. To do this, modeling methods make use of a variety of data sources including card membership information, spending details, and merchant information. The goal is to stop fraudulent transactions before substantial loss is incurred while allowing normal business transactions to proceed in a timely manner. A customer has swiped their card to make a purchase, for instance, and expects to get approval immediately. In addition to accurately finding fraud, the fraud detection system is required to have these two characteristics:

Detect suspicious events early
Make decisions in a few milliseconds against a vast dataset

Large-scale machine learning techniques done correctly are able to meet these criteria and offer an improvement over traditional linear regression methods, taking the precision of predictions to a new level.

USE CASE 2: NEW CUSTOMER ACQUISITION

Finding new customers is a widespread need for business, and American Express is no exception. For example, when a prospective customer visits their website, there are many products (different credit card plans) from which to choose. Previously, around 90% of new customers came from direct mail campaigns, but now, with the web and with the advantage of targeted marketing through machine learning models, the amount of new customer acquisition via online interactions has risen to 40%. This is advantageous especially because the costs involved online are less than direct mail contact.

USE CASE 3: RECOMMENDATION FOR IMPROVED CUSTOMER EXPERIENCE

Chao mentioned that one of his favorite uses of machine learning at American Express is to build a machine learning mobile phone application to provide customized recommendations for restaurant choices. When the customer gives permission, the machine learning application uses recent spending histories and card member profile data for a huge number of transactions to train a recommendation model. The model predicts which restaurants a particular customer might enjoy and makes those recommendations (The technical basis for this approach was further explained by co-presenter Ted Dunning, as described below.)

The level of success of this improved customer experience is not only of interest to the card issuer but also to restaurant merchants who get feedback on how good a particular offer may be.

What Are the Requirements for the Data Platform?

Doing these types of successful large-scale machine learning in production puts certain requirements on the big data platform that supports them, which was the main focus of Ted’s presentation. Machine learning applications need to work with large amounts of data from a wide range of sources that has been prepared and staged in specific ways. The MapR data platform is well suited to store, stream and facilitate search on data that is big and needs to move fast.

MapR’s real-time read/write file system, integrated NoSQL database and large array of Hadoop ecosystem tools meet the needs of large-scale machine learning applications. MapR’s ability to use legacy code directly, to make consistent snapshots for data versioning and to use remote mirroring for applications synchronized across multiple data centers are especially useful.

Ted explained that recommendation systems similar to the mobile application Chao described can leverage large amounts of data on user behavior histories to train a machine learning model that predicts what items each user is likely to prefer. The model identifies recommendation indicators based on historical co-occurrence of users and items (or actions). The beauty of this type of design is that the computational heavy lifting in which the learning algorithm is used in training the model can be done ahead of time, offline. Then conventional techniques such as a search engine can be put to work to easily deploy the system, making it able to deliver rapid online recommendations in real-time.

Although the specific design and choice of algorithms differs with different types of machine learning, these applications do share some commonalities in the needs that they place on the big data platform that supports them.

Scalability

The quality of recommendations, as with other machine learning applications, depends in part on the quality and quantity of available data. Models learn from motifs observed across a large number of historical actions, so one requirement for the data platform is scalability of storage. This is true for different types of use cases from fraud detection at secure websites to predictive maintenance in large industrial settings.

Speed

Another need placed on a data platform by machine learning applications is to handle large-scale queries fast. Take the example of detecting anomalies in the propagation of telecom data. When special events occur, large groups of people may suddenly put a localized burden on telecommunications, such as tens of thousand of people in a sport arena who are using their phones for tweeting. To avoid having such a situation overload the communication system, it’s useful to temporarily activate localized higher bandwidth to serve this “flash mob”. If you can detect these anomalies quickly, you can prioritize service appropriately including maintaining service for first responders in an emergency. This need for speed is similar to the requirement for rapid response when validating a credit card transaction – either way, the machine learning system must be able to rapidly query millions of records and return a response in a second or less.

Compatible non-Hadoop File Access

Hadoop is a bit of a revolution, but in order to make the best use of existing experience as well as new ideas, it’s helpful for a data platform to seamlessly support both Hadoop and non-Hadoop applications and code, particularly with other machine learning systems. Ted explained that the fact that MapR has a real-time fully read/write file system that supports NFS access makes it possible to use legacy code along with Hadoop applications, a particular advantage in large-scale machine learning projects.

Data Versioning

One difference between a machine learning project being an interesting “arts and craft” study to being a serious work of engineering is the capability for repeatable processes and, in particular, for version control. It’s a challenge to do version control at the scale of terabytes and more of data, because it’s too expensive in space and time to make full copies. What is needed instead are transactionally consistent snapshots for data versioning, such as those available with the MapR data platform. Consistent snapshots let you understand and reference the state of training data from an exact point-in-time, compare results for old data and new models or vice versa and ultimately see what is really causing changes in observed behavior.

Federation Across Multiple Data Centers

With some large companies, machine learning projects need to be deployed across multiple data centers, either to share results or more often to share a consistent collection of data for use in independent development projects. This requirement for the data platform is met by MapR mirroring, which creates a consistent remote replica of data that is useful for geographic distribution as well as to provide the basis for disaster recovery if needed.

Sunday, January 10, 2016

Don't believe the hype about Big Data in 2016

Political campaigns are known for being a step behind when it comes to innovation.

If you've ever watched a bunch of cookie-cutter campaign ads and wondered why they look like they were produced by a couple of college students who just learned how to use Final Cut Pro — and not, say, Madison Avenue execs capable of creating heart-wrenching 30-second film masterpieces — you're not alone. But if there's one place where campaigns are supposed to be utilizing the latest in techno-whizbangery, it's in their exploitation of Big Data. With the information tools now at their disposal, they can microtarget voters down to the depths of their very souls.

This vision of highly sophisticated, algorithm-driven campaigns isn't completely inaccurate. But it's missing an important piece, the piece that is supposed to make it all seem either exciting or sinister, depending on your perspective. That missing piece is persuasion. In short, campaigns know how to find you in ways they never did before. But once they've found you, they haven't gotten any better at winning you over.

In a front-page story on Monday, The Washington Post reported on the data operation that supposedly may have propelled Ted Cruz to his current position near the front of the Republican pack:

To Continue,,,,

Friday, January 8, 2016

38 great resources for learning data mining concepts and techniques

With today’s tools, anyone can collect data from almost anywhere, but not everyone can pull the important nuggets out of that data. Whacking your data into Tableau is an OK start, but it’s not going to give you the business critical insights you’re looking for. To truly make your data come alive you need to mine it. Dig deep. Play around. And tease out the diamond in the rough.

Jumpstarting your data mining journey can be an uphill battle if you didn’t study data science in school. Not to worry! Few of today’s brightest data scientists did. So, for those of us who may need a little refresher on data mining or are starting from scratch, here are 38 great resources to learn data mining concepts and techniques.

Learn data mining languages: R, Python and SQL

W3Schools - Fantastic set of interactive tutorials for learning different languages. Their SQL tutorial is second to none. You’ll learn how to manipulate data in MySQL, SQL Server, Access, Oracle, Sybase, DB2 and other database systems.

Treasure Data - The best way to learn is to work towards a goal. That’s what this helpful blog series is all about. You’ll learn SQL from scratch by following along with a simple, but common, data analysis scenario.

10 Queries - This course is recommended for the intermediate SQL-er who wants to brush up on his/her skills. It’s a series of 10 challenges coupled with forums and external videos to help you improve your SQL knowledge and understanding of the underlying principles.

TryR - Created by Code School, this interactive online tutorial system is designed to step you through R for statistics and data modeling. As you work through their seven modules, you’ll earn badges to track your progress helping you to stay on track.

Leada - If you’re a complete R novice, try Lead’s introduction to R. In their 1 hour 30 min course, they’ll cover installation, basic usage, common functions, data structures, and data types. They’ll even set you up with your own development environment in RStudio.

Advanced R - Once you’ve mastered the basics of R, bookmark this page. It’s a fantastically comprehensive style guide to using R. We should all strive to write beautiful code, and this resource (based on Google’s R style guide) is your key to that ideal.

Swirl - Learn R in R - a radical idea certainly. But that’s exactly what Swirl does. They’ll interactively teach you how to program in R and do some basic data science at your own pace. Right in the R console.

Python for beginners - The Python website actually has a pretty comprehensive and easy-to-follow set of tutorials. You can learn everything from installation to complex analyzes. It also gives you access to the Python community, who will be happy to answer your questions.

PythonSpot - A complete list of Python tutorials to take you from zero to Python hero. There are tutorials for beginners, intermediate and advanced learners.

Read all about it: data mining books

Data Jujitsu: The Art of Turning Data into Product - This free book by DJ Patil gives you a brief introduction to the complexity of data problems and how to approach them. He gives nice, understandable examples that cover the most important thought processes of data mining. It’s a great book for beginners but still interesting to the data mining expert. Plus, it’s free!

Data Mining: Concepts and Techniques - The third (and most recent) edition will give you an understanding of the theory and practice of discovering patterns in large data sets. Each chapter is a stand-alone guide to a particular topic, making it a good resource if you’re not into reading in sequence or you want to know about a particular topic.

Mining of Massive Datasets - Based on the Stanford Computer Science course, this book is often sighted by data scientists as one of the most helpful resources around. It’s designed at the undergraduate level with no formal prerequisites. It’s the next best thing to actually going to Stanford!

Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners - This book is a must read for anyone who needs to do applied data mining in a business setting (ie practically everyone). It’s a complete resource for anyone looking to cut through the Big Data hype and understand the real value of data mining. Pay particular attention to the section on how modeling can be applied to business decision making.

Data Smart: Using Data Science to Transform Information into Insight - The talented (and funny) John Foreman from MailChimp teaches you the “dark arts” of data science. He makes modern statistical methods and algorithms accessible and easy to implement.

Hadoop: The Definitive Guide - As a data scientist, you will undoubtedly be asked about Hadoop. So you’d better know how it works. This comprehensive guide will teach you how to build and maintain reliable, scalable, distributed systems with Apache Hadoop. Make sure you get the most recent addition to keep up with this fast-changing service.

Learn from the best: top data miners to follow

John Foreman - Chief Data Scientist at MailChimp and author of Data Smart, John is worth a follow for his witty yet poignant tweets on data science.

DJ Patil - Author and Chief Data Scientist at The White House OSTP, DJ tweets everything you’ve ever wanted to know about data in politics.

Nate Silver - He’s Editor-in-Chief of FiveThirtyEight, a blog that uses data to analyze news stories in Politics, Sports, and Current Events.

Andrew Ng - As the Chief Data Scientist at Baidu, Andrew is responsible for some of the most groundbreaking developments in Machine Learning and Data Science.

Bernard Marr - He might know pretty much everything there is to know about Big Data.

Gregory Piatetsky - He’s the author of popular data science blog KDNuggets, the leading newsletter on data mining and knowledge discovery.

Christian Rudder - As the Co-founder of OKCupid, Christian has access to one of the most unique datasets on the planet and he uses it to give fascinating insight into human nature, love, and relationships

Dean Abbott - He’s contributed to a number of data blogs and authored his own book on Applied Predictive Analytics. At the moment, Dean is Chief Data Scientist at SmarterHQ.

Practice what you’ve learned: data mining competitions

Kaggle - This is the ultimate data mining competition. The world’s biggest corporations offer big prizes for solving their toughest data problems.

Stack Overflow - The best way to learn is to teach. Stackoverflow offers the perfect forum for you to prove your data mining know-how by answering fellow enthusiast's questions.

TunedIT - With a live leaderboard and interactive participation, TunedIT offers a great platform to flex your data mining muscles.

DrivenData - You can find a number of nonprofit data mining challenges on DataDriven. All of your mining efforts will go towards a good cause.

Quora - Another great site to answer questions on just about everything. There are plenty of curious data lovers on there asking for help with data mining and data science.

22 Big Data & Data Science experts predictions for 2016

Will machines become smarter than man? What technology will dominate Data Science? What is smart data? Read Big Data experts predictions for 2016.

can’t help but look forward to what the new year will bring.

Will you finally be able to buy a self-driving car? Will machines become smarter than man? And what, will happen to the world of data science?

We’re no fortune tellers, so we rounded up a bunch of experts to ask them what they thought. And here (in no particular order), is what they said:

Bernard Marr, Big Data Guru and Bestselling Author

“2016 will be exciting for Big Data – Big Data will go even more mainstream. 2016 will also be the year when companies without solid big data strategies will start to fall behind. In terms of technology, I see particular growth in real-time data analytics and increasing use of machine-learning algorithms.”

Kirk Borne, Principal Data Scientist at Booz Allen Hamilton and founder of RocketDataScience.org

“In 2016, the world of big data will focus more on smart data, regardless of size. Smart data are wide data (high variety), not necessarily deep data (high volume). Data are “smart” when they consist of feature-rich content and context (time, location, associations, links, interdependencies, etc.) that enable intelligent and even autonomous data-driven processes, discoveries, decisions, and applications.”

For More

Thursday, January 7, 2016

Simple Linear Regression Example

In this lesson, we apply regression analysis to some fictitious data, and we show how to interpret the results of our analysis.

Note: Regression computations are usually handled by a software package or agraphing calculator. For this example, however, we will do the computations "manually", since the gory details have educational value.

Problem Statement

Last year, five randomly selected students took a math aptitude test before they began their statistics course. The Statistics Department has three questions.

What linear regression equation best predicts statistics performance, based on math aptitude scores?
If a student made an 80 on the aptitude test, what grade would we expect her to make in statistics?
How well does the regression equation fit the data?

How to Find the Regression Equation

In the table below, the x_i column shows scores on the aptitude test. Similarly, the y_i column shows statistics grades. The last two rows show sums and mean scores that we will use to conduct the regression analysis.

The regression equation is a linear equation of the form: ŷ = b₀ + b₁x . To conduct a regression analysis, we need to solve for b₀ and b₁. Computations are shown below.

therefore, the regression equation is: ŷ = 26.768 + 0.644x .

How to Use the Regression Equation

Once you have the regression equation, using it is a snap. Choose a value for the independent variable (x), perform the computation, and you have an estimated value (ŷ) for the dependent variable.

In our example, the independent variable is the student's score on the aptitude test. The dependent variable is the student's statistics grade. If a student made an 80 on the aptitude test, the estimated statistics grade would be:

ŷ = 26.768 + 0.644x = 26.768 + 0.644 * 80 = 26.768 + 51.52 = 78.288

Warning: When you use a regression equation, do not use values for the independent variable that are outside the range of values used to create the equation. That is called extrapolation, and it can produce unreasonable estimates.

In this example, the aptitude test scores used to create the regression equation ranged from 60 to 95. Therefore, only use values inside that range to estimate statistics grades. Using values outside that range (less than 60 or greater than 95) is problematic.

Wednesday, January 6, 2016

20 Questions to Detect Fake Data Scientists

Now that the Data Scientist is officially the sexiest job of the 21st century, everyone wants a piece of the pie.

That means there are a few data posers out there. People who call themselves Data Scientists, but who don't actually have the right skill set.

This isn't always done out of a desire to deceive. The newness of data science and lack of a widely understood job description means that many people may think they are data scientists purely because they deal with data

The first way to detect fake Data Scientists is to understand the skill set you should be looking for. Knowing the difference between what makes a Data Scientists vs a Data Analyst vs a Data Engineer is important, especially if you're planning on hiring one of these rare specimens. To help you sort the true data scientist from the fake (or misguided) one, we've complied a list of 20 interview questions you can ask when interviewing data scientists.

Here are the 20 Questions

Tuesday, January 5, 2016

Statistical data Analysis

I am a Freelancer in statistical data analysis

I can perform the analysis using any kind of Statistical Software like R, SPSS , MINITAB Eviews etc

I Will perform analysis like ,

Regression ,Time series ,Panel Data

Experimental Designs

Factor analysis , principle component analysis

Bayesian analysis and other types of Data analysis

Also i will provide

Output and well explained interpretations

with a comprehensive report

At the same time i can help you to do any statistics related

Reports

Presentations

Homework

Tutorials

Questionnaires

I'm a graduate student of University of Peradeniya , who did aa Statistical Special degree. I have Three years of research experience.
I will provide the best quality work available within minimum time .

Contact https://t.co/flwgBhGlSo